Documentation
driftless keeps prompts in sync when their dependencies change. Think Poetry (regenerate the lock when model or eval data moves) plus Dependabot (watch, test, open a PR with evidence).
Introduction
LLM providers deprecate and retire models on aggressive timelines. When a model is retired, every workflow that depends on it breaks or silently degrades. Swapping a model ID is rarely enough — the new model behaves differently, so prompts, format instructions, and few-shot examples often need to change too, and you need evidence that the new model still meets your quality bar before you ship.
driftless is CLI-first: the CLI is the engine, and the GitHub Action just invokes the same commands. You describe your model-dependent workflow once in a driftless.yml contract, and the tool orchestrates everything else.
Two analogies: driftless.yml declares deps (model + dataset); editable prompts are the lockfile. When deps drift, regenerate the lock — except LLM behavior is empirical, so driftless scores candidates on your eval instead of running a resolver. Delivery is Dependabot-style: watch → test → PR with metrics.
Core idea: the customer owns the workflow; driftless orchestrates it. We shell out to your eval command with the model overridden — never reimplementing your pre/post-processing.
Installation
Once published, install with pipx (recommended) so the CLI lives in its own isolated environment:
pipx install driftless
For local development from source:
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
LLM-backed repair needs at least one provider key. Install the optional extra and export a key:
pip install -e ".[llm]" # openai + anthropic SDKs
export OPENAI_API_KEY=sk-... # or ANTHROPIC_API_KEY
Requires Python 3.10+. The provider is chosen automatically by which API key is present, so the same contract works across OpenAI and Anthropic.
Quickstart
From the root of the repo you want to migrate:
# 1. find probable LLM usage and at-risk models
driftless scan
# 2. scaffold a contract for a detected workflow
driftless configure support_classifier
# 3. sanity-check the contract parses and the harness runs
driftless validate -w support_classifier
# 4. score a candidate replacement on your real eval
driftless compare -w support_classifier --to gpt-4o-mini
# 5. repair + validate + produce migrated files
driftless migrate -w support_classifier --to gpt-4o-mini
# 6. render the evidence and open a PR (dry run by default)
driftless report -w support_classifier
driftless open-pr -w support_classifier --create
If you're starting from scratch rather than a scan, driftless init writes a commented driftless.yml template you can fill in.
The workflow contract
A driftless.yml declares one or more model-dependent workflows. Each is a strict, typed schema — unknown keys are rejected, so typos surface as errors rather than silent misbehavior.
workflows:
support_classifier:
run:
command: "python evals/run_eval.py"
input_path: evals/inputs.jsonl
output_path: evals/outputs.jsonl
timeout_seconds: 600
model:
current: gpt-3.5-turbo
target_candidates: [gpt-4o-mini]
env_var: SUPPORT_CLASSIFIER_MODEL
config_file: config/llm.yml
config_path: model
files:
editable: [prompts/system.md, prompts/examples.yml]
readonly: [src/**]
eval:
labels_path: evals/labels.jsonl
schema_path: schemas/ticket.schema.json
label_field: category
split: { tuning: 60%, seed: 7 }
thresholds:
min_f1: 0.9
max_schema_error_rate: 0.02
migration:
max_iterations: 4
holdout_required: true
There are two override mechanisms, and a workflow may use both:
env_var— the harness sets this environment variable to run the workflow under a different model. This is the runtime override used during evaluation.config_file+config_path— a dotted path into a JSON/YAML config file. Used when opening a PR, to write the new model ID into the repo.
Percentages are ergonomic: 60%, 60, and 0.6 all parse to 0.6.
The migration loop
This is the core of the product. The same loop powers migrate and refine; they differ only in the objective:
MEET_THRESHOLDS(migrate) — a new model regressed; repair until the contract'sthresholds:pass. Candidates are ranked by how close they are to passing.MAXIMIZE(refine) — the dataset changed, so the old thresholds are stale; the model is pinned, the loop chases the best primary metric, then proposes fresh thresholds.
Full algorithm, from run_migration in engine.py (primary = F1 or score/pass-rate; diff_size = changed lines vs. the original editable files; evaluate = apply files in a backup/restore sandbox, run your real workflow, score the split):
run_migration(W, M_target, generator G, objective O, seed):
# ── Setup ──────────────────────────────────────────────
tuning, holdout ← split(W.dataset, seed) # deterministic, seeded
baseline ← evaluate(M_current, current_files, tuning)
naive_target ← evaluate(M_target, current_files, tuning)
# ── Short-circuit: is a bare model swap enough? (migrate) ─
if O = MEET_THRESHOLDS and passes(baseline, naive_target)
and passes_on(holdout):
return model_change_only # just bump the model ID
# ── Iterative repair ───────────────────────────────────
original ← current editable files # frozen, for diff sizing
best ← naive_target ; best_files ← {} ; best_size ← 0
width ← G.num_candidates ; widened ← false # adaptive search width
for i in 1 .. W.migration.max_iterations:
clusters ← cluster_failures(best.rows) # group similar errors
context ← { clusters, failing & correct examples,
attempt history, editable + readonly files }
candidates ← G.generate(context, escalated_width if widened else width)
if candidates = ∅: break
improved ← false
for patch in candidates:
size ← diff_size(patch, original)
try:
check_scope(patch) # reject edits outside files.editable
cand ← evaluate(M_target, apply(patch), tuning)
except error: # patch broke the workflow
log(failed) ; continue # skip it — never abort the run
better ← score(cand, O) > score(best, O)
tie ← score(cand, O) = score(best, O) and size < best_size
if better or tie: # tie → the smaller edit wins
best, best_files, best_size ← cand, patch.files, size
improved ← true
if O = MEET_THRESHOLDS and passes(baseline, best)
and passes_on(holdout, best_files):
commit(best_files) ; return pass # validated on never-tuned data
if improved: widened ← false # progress → cheap width
else if not widened: widened ← true ; continue # stall → widen once
else: break # stalled at full width
# ── Resolve outcome ────────────────────────────────────
if O = MAXIMIZE: # refine
validate best_files on holdout (no-regression vs. current)
suggest fresh thresholds from holdout metrics
return pass if best beats naive_target else no_change
else: # migrate, thresholds unmet
return partial if best improved over naive_target else blocked
Holdout validation is what makes the result honest: the winning patch must perform on data the loop never optimized against.
Invariants the loop guarantees, regardless of what a generator proposes:
- Sandboxed trials — every candidate is applied via backup → run → restore; the working tree is written only on a committed pass.
- Crash isolation — a candidate that breaks the workflow (e.g. emits invalid YAML/JSON) is logged as a failed attempt and skipped; it can't abort the run.
- Minimal-change tie-breaker — on an exact score tie the smaller edit wins; against the no-op baseline (
best_size = 0) a same-scoring patch is rejected, so the loop never makes a change that doesn't help. - Stall-escalation — a stalled iteration widens the candidate pool once (to
max(width × 3, 5)) before giving up — cheap when easy, broad when stuck. - Holdout gate — nothing is committed until it clears a split it never tuned against.
Migration statuses
| Status | Meaning | Files committed? |
|---|---|---|
| model_change_only | Naive swap already passes thresholds. | No — just a model-ID change. |
| pass | Repair succeeded and holdout validated. | Yes. |
| partial | Improved over the naive swap but below thresholds. | No. |
| blocked | Could not recover quality within budget. | No. |
| no_change | refine: nothing beat the current prompt on the new dataset. | No. |
migrate exits non-zero on partial / blocked, so it gates CI naturally. A blocked migration still produces a full report and files an actionable issue.
Safety guarantees
These are enforced by the engine, not left to the patch generator:
- Edit-scope enforcement — any patch touching a file outside
files.editableis rejected before it is applied. - Sandboxed application — candidate edits are applied with originals backed up and always restored, so evaluation never leaves the repo dirty.
- Holdout gating — nothing is committed unless it passes thresholds on the holdout split.
- No auto-merge, no force-push —
open-pris a dry run by default;--createopens a PR/issue but never merges or pushes to the base branch.
Triggers & policy
The rest of the tool answers can we migrate; the policy layer answers when should we — the "Dependabot config" of the project. A trigger is only a candidate; whether it's worth it is decided by running your eval.
| Trigger | Tier | Behavior |
|---|---|---|
deprecation | Forced | Within the warn window it always surfaces — a validated migration opens a PR; a blocked one files an issue. Urgency escalates as the retirement date nears. |
cost | Opportunistic | Surfaces only if a candidate is sufficiently cheaper with quality within tolerance. |
quality | Opportunistic | Surfaces only if a candidate measurably improves quality. |
new_model | Opportunistic | Surfaces a newly released candidate that passes your eval. |
A .driftless/policy.yml configures per-trigger thresholds, candidate allow/deny globs (preview models denied by default), and an ignore snooze list. The plan command wires this together as a CI triage step.
Today discovery emits deprecation triggers from the bundled lifecycle data. Cost / quality / new-model discovery plug in once a richer model catalog (pricing, release dates) is wired up.
CLI commands
| Command | Purpose |
|---|---|
init | Scaffold a driftless.yml. |
scan | Find probable LLM usage and at-risk models. |
plan | Discover at-risk workflows and apply the migration policy (CI triage). |
configure <workflow> | Turn a detected workflow into a migration-ready contract. |
validate -w <w> | Check the contract parses and the harness runs. |
audit-labels -w <w> | Find duplicate inputs with disagreeing gold labels (--fail for CI). |
judge-check -w <w> | Measure judge↔human agreement on a calibration set (--enforce to gate). |
compare -w <w> --to <model> | Baseline vs. target scorecard + threshold checks. |
migrate -w <w> --to <model> | Repair + validate + produce migrated files. |
refine -w <w> | Re-optimize the prompt for a changed dataset (model pinned). |
report [-w <w>] | Render the latest migration report(s). |
open-pr -w <w> | Open a PR (or issue) whose body is the evidence report: summary, scorecard, unified diffs, attempt log, holdout checks. |
Useful flags on migrate:
--generator llm|none— the repair strategy (LLM-backed by default;noneturns the loop into a dry analysis).--to <model>— the target model to migrate to (otherwise the contract's candidates are used).--strict-label-audit— block when duplicate/near-duplicate inputs disagree on gold labels (warns by default).
Contract schema reference
| Block | Key fields | Purpose |
|---|---|---|
run | command, input_path, output_path, timeout_seconds | How to execute the real workflow. |
model | current, target_candidates, env_var, config_file, config_path | Which model and how to override it. |
files | editable[], readonly[] | Edit scope for the repair loop. |
eval | labels_path, schema_path, label_field, id_field, split | How to score outputs. |
thresholds | min_f1, min_precision, min_recall, max_schema_error_rate, max_cost_increase, max_latency_increase | What must hold to pass. |
migration | allow_*_edits, max_iterations, holdout_required | What the engine may do. |
repair | system_prompt(_path), guidance, user_template(_path) | Customize the LLM repair prompt. |
Evaluation metrics
compare and migrate load the output JSONL your command writes, align it with gold labels, validate each record against the JSON schema, and compute:
By default they compare current prompt on current model vs current prompt on target. When the prompt was never source-optimized, that delta mixes prompt debt with model drift — see Measuring migration gains.
- Accuracy + macro precision / recall / F1 (per-class breakdown retained for failure clustering).
- Schema error rate — unparseable or schema-invalid records.
- Refusal rate — empty/
nulllabels, a truthyrefusedfield, or values listed ineval.refusal_values. - Average latency — derived from run duration / record count.
- Total cost — only when the workflow emits a per-record
cost_field. Token-based estimates are never fabricated.
Measuring migration gains
compare and migrate score your current prompt on the current model (baseline) and the same prompt on the target (naive_target). That mirrors flipping a model ID in prod without touching the prompt — the right default.
But when the prompt was never tuned to its ceiling on the source model, the delta conflates two effects:
- Prompt debt — under-optimization that would improve on either model.
- Model-induced drift — quality lost because the target behaves differently.
A headline "+0.07 F1 after migration" can be mostly debt, not repair. The refine path (dataset change, model pinned) avoids this; model migration needs a control.
2×2 control
Optimize on the source model first, then switch. Example from the testbed (macro-F1, 290 labels, real API calls, gpt-3.5-turbo → gpt-4o-mini):
| Prompt | Source | Target |
|---|---|---|
| P0 — original hand prompt | 0.922 | 0.904 |
| Psrc* — optimized for source | 0.993 (A) | 0.921 (B) |
| Ptgt* — optimized for target | 1.000 (C) | 0.987 (D) |
- P0 → A = prompt debt on the source (not migration).
- A → B (−0.072) = true model-induced drift from a strong baseline.
- B → D (+0.066) = gain from re-tuning after the switch.
Report migration repair relative to (B) — target model + source-optimized prompt — not the raw hand prompt. If baseline is far below your bar, run refine on the source model first, then compare --to again. Offline simulator regressions in the testbed do not reproduce on real gpt-4o-mini; validate model-switch claims on live APIs.
Full write-up and repro steps: docs/repair-and-generators.md § "Measuring migration gains honestly".
Run viewer
After migrate or refine, inspect the optimization trajectory in a local web UI — iteration metrics chart, failure-cluster trends, and a full attempt log (rationales, scores, diff sizes, accept/reject).
driftless view # opens http://localhost:8777/runs.html
driftless view -w support_classifier
The viewer reads .driftless/migrations/<workflow>.json from the current project. You can also load any result JSON via file picker or drag-and-drop. Static demo: runs.html → Load sample.
Repair & custom generators
The engine defines a PatchGenerator protocol; the repair strategy is swappable. Two ship out of the box:
LLMPatchGenerator(default) — asks an LLM to rewrite the editable files to fix the observed failure clusters. Provider-neutral, requests strict JSON, and varies temperature across candidates.NoOpPatchGenerator(--generator none) — proposes nothing; the loop becomes a dry analysis, useful offline and for CI gating.
You can customize repair via the contract's repair: block — append domain guidance, fully replace the system prompt, or supply a user_template with {{placeholder}} substitution (placeholders include failure_clusters, failing_examples, editable_files, metrics, and target_model).
A generator only ever proposes — the engine owns acceptance. See The migration loop for the full algorithm (sandboxing, crash isolation, the minimal-change tie-breaker, stall-escalation, and holdout gating).
See the in-repo guide docs/repair-and-generators.md for writing your own deterministic, rule-based generator.
GitHub Action
A composite GitHub Action wraps the CLI so scans and migrations can run in CI. The same commands you run locally run unchanged in a workflow.
# .github/workflows/llm-model-scan.yml
name: llm-model-scan
on:
schedule: [{ cron: "0 9 * * 1" }]
workflow_dispatch: {}
jobs:
scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: driftless-dev/driftless@v0.2.11
with:
command: plan
A scheduled plan gates CI when a deprecated model needs attention; a manually-triggered migrate opens a PR (or an issue when blocked) with the evidence attached.
Ready to try it? Head back to the quickstart, or explore the project on GitHub.