A prompt is pinned to its model and its eval data — like pyproject.toml declares deps and poetry.lock pins what actually works. When either dependency moves, the prompt goes stale; driftless regenerates it, validates on holdout, and opens a PR with the evidence.
One drifts from the outside, one from the inside. Either way, the same automatable job keeps your prompt in sync.
Providers deprecate and retire models on their own schedule. A deadline you don't control forces a migration before things break.
deprecationYour team adds new labeled data — say, from customer feedback — and what counts as "correct" shifts. The prompt needs re-tuning to match.
data_changeEither way it's the same job: re-tune the prompt, validate on held-out data, open a PR with the evidence.
driftless is both: lock regeneration when dependencies change, delivered as an automated, evidence-backed PR.
You can bump versions in pyproject.toml, but until you run poetry lock the lockfile is stale — install may not match reality. Same here: swap the model or update labels and the prompt is out of sync with what "correct" means.
driftless.yml = declared deps · prompts/ = lockPoetry doesn't auto-regenerate the lock for you — you run it yourself. Dependabot watches upstream, tests the change, and opens a PR. driftless watches model lifecycle and dataset drift, runs your eval, repairs the prompt, and opens a PR with metrics attached.
migrate / refine → open-prUnlike Poetry's resolver, prompt "locking" is empirical — behavior isn't declared, it's measured on your eval. That's the loop driftless automates.
When a provider retires a model, every workflow that depends on it breaks or silently degrades. Changing the model ID is the easy part — the new model behaves differently, so prompts, format instructions, and few-shot examples often need to change too. And before you ship, you need proof the replacement still meets your quality bar.
Declare how to run your eval, how to override the model, which files may be edited, and what quality bar must hold — in a single driftless.yml. The tool drives the rest.
Find probable LLM usage and flag at-risk models.
Scaffold a migration-ready contract from the scan.
Score baseline vs. target on your real eval.
Repair prompts/configs, iterate, validate on holdout.
Produce an evidence-backed markdown scorecard.
Open a PR — or an issue when blocked.
The customer owns the workflow. The tool orchestrates it. We shell out to your eval command with the model overridden — we never reimplement your pre/post-processing.
The engine only edits files you declare editable, applies edits in a sandbox, and never commits a change that fails on a holdout split it never tuned on.
No synthetic benchmarks. Replacements are scored through your actual eval command, with F1, schema-error rate, refusals, latency, and cost.
Every migration ships a scorecard: current vs. naive vs. migrated, the exact edits made, unmet thresholds, and holdout validation — straight into the PR body.
An LLM-backed generator fixes prompts by default — provider-neutral across OpenAI and Anthropic. Customize the prompt, or plug in your own deterministic generator.
Decide when to migrate — deprecation deadlines, cost savings, quality gains, or new models — with per-trigger thresholds and a snooze list.
The CLI is the engine — testable locally before any CI is involved. A composite GitHub Action runs the same commands on a schedule.
driftless produces a markdown report that becomes the PR body. Reviewers see exactly what changed, why, and the measured impact on held-out data — so approving an automated migration feels like reviewing a teammate's PR.
| Metric | Current | Naive swap | Migrated |
|---|---|---|---|
| model | 3.5-turbo | 4o-mini | 4o-mini |
| F1 | 1.00 | 0.00 | 1.00 |
| schema err | 0% | 100% | 0% |
| refusals | 0% | 0% | 0% |
| holdout | — | — | ✓ pass |
workflows:
support_classifier:
run:
command: "python evals/run_eval.py"
input_path: evals/inputs.jsonl
output_path: evals/outputs.jsonl
model:
current: gpt-3.5-turbo
env_var: SUPPORT_CLASSIFIER_MODEL
files:
editable: [prompts/system.md, prompts/examples.yml]
eval:
labels_path: evals/labels.jsonl
schema_path: schemas/ticket.schema.json
thresholds:
min_f1: 0.9
max_schema_error_rate: 0.02
# discover at-risk workflows + apply your policy — gates CI
$ driftless plan
Migration plan
┌────────────────────┬──────────────────────┬─────────┬──────────┐
│ Workflow │ Migrate │ Retires │ Decision │
├────────────────────┼──────────────────────┼─────────┼──────────┤
│ support_classifier │ 3.5-turbo → 4o-mini │ 41d │ PR (high) │
└────────────────────┴──────────────────────┴─────────┴──────────┘
# exits non-zero when action is needed
# .github/workflows/llm-model-scan.yml
on:
schedule: [{ cron: "0 9 * * 1" }] # every Monday
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: driftless/action@v1
with:
command: plan
A single typed schema drives every command. Typos surface as errors, not silent misbehavior.
We orchestrate your eval with the model overridden — never reimplementing your logic.
Every run emits pass / partial / blocked. A blocked migration still produces an actionable artifact.
Nothing is committed unless it passes on data the repair loop never optimized against.
Install the CLI, point it at your repo, and let evidence-backed migrations come to you as pull requests.
pipx install driftless
driftless init
driftless scan