Documentation

driftless keeps prompts in sync when their dependencies change. Think Poetry (regenerate the lock when model or eval data moves) plus Dependabot (watch, test, open a PR with evidence).

Introduction

LLM providers deprecate and retire models on aggressive timelines. When a model is retired, every workflow that depends on it breaks or silently degrades. Swapping a model ID is rarely enough — the new model behaves differently, so prompts, format instructions, and few-shot examples often need to change too, and you need evidence that the new model still meets your quality bar before you ship.

driftless is CLI-first: the CLI is the engine, and the GitHub Action just invokes the same commands. You describe your model-dependent workflow once in a driftless.yml contract, and the tool orchestrates everything else.

Two analogies: driftless.yml declares deps (model + dataset); editable prompts are the lockfile. When deps drift, regenerate the lock — except LLM behavior is empirical, so driftless scores candidates on your eval instead of running a resolver. Delivery is Dependabot-style: watch → test → PR with metrics.

Core idea: the customer owns the workflow; driftless orchestrates it. We shell out to your eval command with the model overridden — never reimplementing your pre/post-processing.

Installation

Once published, install with pipx (recommended) so the CLI lives in its own isolated environment:

pipx install driftless

For local development from source:

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

LLM-backed repair needs at least one provider key. Install the optional extra and export a key:

pip install -e ".[llm]"      # openai + anthropic SDKs
export OPENAI_API_KEY=sk-...   # or ANTHROPIC_API_KEY

Requires Python 3.10+. The provider is chosen automatically by which API key is present, so the same contract works across OpenAI and Anthropic.

Quickstart

From the root of the repo you want to migrate:

# 1. find probable LLM usage and at-risk models
driftless scan

# 2. scaffold a contract for a detected workflow
driftless configure support_classifier

# 3. sanity-check the contract parses and the harness runs
driftless validate -w support_classifier

# 4. score a candidate replacement on your real eval
driftless compare -w support_classifier --to gpt-4o-mini

# 5. repair + validate + produce migrated files
driftless migrate -w support_classifier --to gpt-4o-mini

# 6. render the evidence and open a PR (dry run by default)
driftless report  -w support_classifier
driftless open-pr -w support_classifier --create

If you're starting from scratch rather than a scan, driftless init writes a commented driftless.yml template you can fill in.

The workflow contract

A driftless.yml declares one or more model-dependent workflows. Each is a strict, typed schema — unknown keys are rejected, so typos surface as errors rather than silent misbehavior.

workflows:
  support_classifier:
    run:
      command: "python evals/run_eval.py"
      input_path: evals/inputs.jsonl
      output_path: evals/outputs.jsonl
      timeout_seconds: 600
    model:
      current: gpt-3.5-turbo
      target_candidates: [gpt-4o-mini]
      env_var: SUPPORT_CLASSIFIER_MODEL
      config_file: config/llm.yml
      config_path: model
    files:
      editable: [prompts/system.md, prompts/examples.yml]
      readonly: [src/**]
    eval:
      labels_path: evals/labels.jsonl
      schema_path: schemas/ticket.schema.json
      label_field: category
      split: { tuning: 60%, seed: 7 }
    thresholds:
      min_f1: 0.9
      max_schema_error_rate: 0.02
    migration:
      max_iterations: 4
      holdout_required: true

There are two override mechanisms, and a workflow may use both:

  • env_var — the harness sets this environment variable to run the workflow under a different model. This is the runtime override used during evaluation.
  • config_file + config_path — a dotted path into a JSON/YAML config file. Used when opening a PR, to write the new model ID into the repo.

Percentages are ergonomic: 60%, 60, and 0.6 all parse to 0.6.

The migration loop

This is the core of the product. The same loop powers migrate and refine; they differ only in the objective:

  • MEET_THRESHOLDS (migrate) — a new model regressed; repair until the contract's thresholds: pass. Candidates are ranked by how close they are to passing.
  • MAXIMIZE (refine) — the dataset changed, so the old thresholds are stale; the model is pinned, the loop chases the best primary metric, then proposes fresh thresholds.

Full algorithm, from run_migration in engine.py (primary = F1 or score/pass-rate; diff_size = changed lines vs. the original editable files; evaluate = apply files in a backup/restore sandbox, run your real workflow, score the split):

run_migration(W, M_target, generator G, objective O, seed):

  # ── Setup ──────────────────────────────────────────────
  tuning, holdout ← split(W.dataset, seed)        # deterministic, seeded
  baseline     ← evaluate(M_current, current_files, tuning)
  naive_target ← evaluate(M_target,  current_files, tuning)

  # ── Short-circuit: is a bare model swap enough? (migrate) ─
  if O = MEET_THRESHOLDS and passes(baseline, naive_target)
     and passes_on(holdout):
       return model_change_only               # just bump the model ID

  # ── Iterative repair ───────────────────────────────────
  original ← current editable files            # frozen, for diff sizing
  best ← naive_target ; best_files ← {} ; best_size ← 0
  width ← G.num_candidates ; widened ← false   # adaptive search width

  for i in 1 .. W.migration.max_iterations:
      clusters   ← cluster_failures(best.rows)  # group similar errors
      context    ← { clusters, failing & correct examples,
                     attempt history, editable + readonly files }
      candidates ← G.generate(context, escalated_width if widened else width)
      if candidates = ∅: break

      improved ← false
      for patch in candidates:
          size ← diff_size(patch, original)
          try:
              check_scope(patch)               # reject edits outside files.editable
              cand ← evaluate(M_target, apply(patch), tuning)
          except error:                        # patch broke the workflow
              log(failed) ; continue           # skip it — never abort the run

          better ← score(cand, O) > score(best, O)
          tie    ← score(cand, O) = score(best, O) and size < best_size
          if better or tie:                    # tie → the smaller edit wins
              best, best_files, best_size ← cand, patch.files, size
              improved ← true

      if O = MEET_THRESHOLDS and passes(baseline, best)
         and passes_on(holdout, best_files):
           commit(best_files) ; return pass    # validated on never-tuned data

      if improved:         widened ← false             # progress → cheap width
      else if not widened: widened ← true ; continue   # stall → widen once
      else:                break                        # stalled at full width

  # ── Resolve outcome ────────────────────────────────────
  if O = MAXIMIZE:                              # refine
      validate best_files on holdout (no-regression vs. current)
      suggest fresh thresholds from holdout metrics
      return pass if best beats naive_target else no_change
  else:                                         # migrate, thresholds unmet
      return partial if best improved over naive_target else blocked

Holdout validation is what makes the result honest: the winning patch must perform on data the loop never optimized against.

Invariants the loop guarantees, regardless of what a generator proposes:

  • Sandboxed trials — every candidate is applied via backup → run → restore; the working tree is written only on a committed pass.
  • Crash isolation — a candidate that breaks the workflow (e.g. emits invalid YAML/JSON) is logged as a failed attempt and skipped; it can't abort the run.
  • Minimal-change tie-breaker — on an exact score tie the smaller edit wins; against the no-op baseline (best_size = 0) a same-scoring patch is rejected, so the loop never makes a change that doesn't help.
  • Stall-escalation — a stalled iteration widens the candidate pool once (to max(width × 3, 5)) before giving up — cheap when easy, broad when stuck.
  • Holdout gate — nothing is committed until it clears a split it never tuned against.

Migration statuses

StatusMeaningFiles committed?
model_change_onlyNaive swap already passes thresholds.No — just a model-ID change.
passRepair succeeded and holdout validated.Yes.
partialImproved over the naive swap but below thresholds.No.
blockedCould not recover quality within budget.No.
no_changerefine: nothing beat the current prompt on the new dataset.No.

migrate exits non-zero on partial / blocked, so it gates CI naturally. A blocked migration still produces a full report and files an actionable issue.

Safety guarantees

These are enforced by the engine, not left to the patch generator:

  • Edit-scope enforcement — any patch touching a file outside files.editable is rejected before it is applied.
  • Sandboxed application — candidate edits are applied with originals backed up and always restored, so evaluation never leaves the repo dirty.
  • Holdout gating — nothing is committed unless it passes thresholds on the holdout split.
  • No auto-merge, no force-pushopen-pr is a dry run by default; --create opens a PR/issue but never merges or pushes to the base branch.

Triggers & policy

The rest of the tool answers can we migrate; the policy layer answers when should we — the "Dependabot config" of the project. A trigger is only a candidate; whether it's worth it is decided by running your eval.

TriggerTierBehavior
deprecationForcedWithin the warn window it always surfaces — a validated migration opens a PR; a blocked one files an issue. Urgency escalates as the retirement date nears.
costOpportunisticSurfaces only if a candidate is sufficiently cheaper with quality within tolerance.
qualityOpportunisticSurfaces only if a candidate measurably improves quality.
new_modelOpportunisticSurfaces a newly released candidate that passes your eval.

A .driftless/policy.yml configures per-trigger thresholds, candidate allow/deny globs (preview models denied by default), and an ignore snooze list. The plan command wires this together as a CI triage step.

Today discovery emits deprecation triggers from the bundled lifecycle data. Cost / quality / new-model discovery plug in once a richer model catalog (pricing, release dates) is wired up.

CLI commands

CommandPurpose
initScaffold a driftless.yml.
scanFind probable LLM usage and at-risk models.
planDiscover at-risk workflows and apply the migration policy (CI triage).
configure <workflow>Turn a detected workflow into a migration-ready contract.
validate -w <w>Check the contract parses and the harness runs.
audit-labels -w <w>Find duplicate inputs with disagreeing gold labels (--fail for CI).
judge-check -w <w>Measure judge↔human agreement on a calibration set (--enforce to gate).
compare -w <w> --to <model>Baseline vs. target scorecard + threshold checks.
migrate -w <w> --to <model>Repair + validate + produce migrated files.
refine -w <w>Re-optimize the prompt for a changed dataset (model pinned).
report [-w <w>]Render the latest migration report(s).
open-pr -w <w>Open a PR (or issue) whose body is the evidence report: summary, scorecard, unified diffs, attempt log, holdout checks.

Useful flags on migrate:

  • --generator llm|none — the repair strategy (LLM-backed by default; none turns the loop into a dry analysis).
  • --to <model> — the target model to migrate to (otherwise the contract's candidates are used).
  • --strict-label-audit — block when duplicate/near-duplicate inputs disagree on gold labels (warns by default).

Contract schema reference

BlockKey fieldsPurpose
runcommand, input_path, output_path, timeout_secondsHow to execute the real workflow.
modelcurrent, target_candidates, env_var, config_file, config_pathWhich model and how to override it.
fileseditable[], readonly[]Edit scope for the repair loop.
evallabels_path, schema_path, label_field, id_field, splitHow to score outputs.
thresholdsmin_f1, min_precision, min_recall, max_schema_error_rate, max_cost_increase, max_latency_increaseWhat must hold to pass.
migrationallow_*_edits, max_iterations, holdout_requiredWhat the engine may do.
repairsystem_prompt(_path), guidance, user_template(_path)Customize the LLM repair prompt.

Evaluation metrics

compare and migrate load the output JSONL your command writes, align it with gold labels, validate each record against the JSON schema, and compute:

By default they compare current prompt on current model vs current prompt on target. When the prompt was never source-optimized, that delta mixes prompt debt with model drift — see Measuring migration gains.

  • Accuracy + macro precision / recall / F1 (per-class breakdown retained for failure clustering).
  • Schema error rate — unparseable or schema-invalid records.
  • Refusal rate — empty/null labels, a truthy refused field, or values listed in eval.refusal_values.
  • Average latency — derived from run duration / record count.
  • Total cost — only when the workflow emits a per-record cost_field. Token-based estimates are never fabricated.

Measuring migration gains

compare and migrate score your current prompt on the current model (baseline) and the same prompt on the target (naive_target). That mirrors flipping a model ID in prod without touching the prompt — the right default.

But when the prompt was never tuned to its ceiling on the source model, the delta conflates two effects:

  • Prompt debt — under-optimization that would improve on either model.
  • Model-induced drift — quality lost because the target behaves differently.

A headline "+0.07 F1 after migration" can be mostly debt, not repair. The refine path (dataset change, model pinned) avoids this; model migration needs a control.

2×2 control

Optimize on the source model first, then switch. Example from the testbed (macro-F1, 290 labels, real API calls, gpt-3.5-turbogpt-4o-mini):

PromptSourceTarget
P0 — original hand prompt0.9220.904
Psrc* — optimized for source0.993 (A)0.921 (B)
Ptgt* — optimized for target1.000 (C)0.987 (D)
  • P0 → A = prompt debt on the source (not migration).
  • A → B (−0.072) = true model-induced drift from a strong baseline.
  • B → D (+0.066) = gain from re-tuning after the switch.

Report migration repair relative to (B) — target model + source-optimized prompt — not the raw hand prompt. If baseline is far below your bar, run refine on the source model first, then compare --to again. Offline simulator regressions in the testbed do not reproduce on real gpt-4o-mini; validate model-switch claims on live APIs.

Full write-up and repro steps: docs/repair-and-generators.md § "Measuring migration gains honestly".

Run viewer

After migrate or refine, inspect the optimization trajectory in a local web UI — iteration metrics chart, failure-cluster trends, and a full attempt log (rationales, scores, diff sizes, accept/reject).

driftless view                    # opens http://localhost:8777/runs.html
driftless view -w support_classifier

The viewer reads .driftless/migrations/<workflow>.json from the current project. You can also load any result JSON via file picker or drag-and-drop. Static demo: runs.htmlLoad sample.

Repair & custom generators

The engine defines a PatchGenerator protocol; the repair strategy is swappable. Two ship out of the box:

  • LLMPatchGenerator (default) — asks an LLM to rewrite the editable files to fix the observed failure clusters. Provider-neutral, requests strict JSON, and varies temperature across candidates.
  • NoOpPatchGenerator (--generator none) — proposes nothing; the loop becomes a dry analysis, useful offline and for CI gating.

You can customize repair via the contract's repair: block — append domain guidance, fully replace the system prompt, or supply a user_template with {{placeholder}} substitution (placeholders include failure_clusters, failing_examples, editable_files, metrics, and target_model).

A generator only ever proposes — the engine owns acceptance. See The migration loop for the full algorithm (sandboxing, crash isolation, the minimal-change tie-breaker, stall-escalation, and holdout gating).

See the in-repo guide docs/repair-and-generators.md for writing your own deterministic, rule-based generator.

GitHub Action

A composite GitHub Action wraps the CLI so scans and migrations can run in CI. The same commands you run locally run unchanged in a workflow.

# .github/workflows/llm-model-scan.yml
name: llm-model-scan
on:
  schedule: [{ cron: "0 9 * * 1" }]
  workflow_dispatch: {}

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: driftless-dev/driftless@v0.2.11
        with:
          command: plan

A scheduled plan gates CI when a deprecated model needs attention; a manually-triggered migrate opens a PR (or an issue when blocked) with the evidence attached.

Ready to try it? Head back to the quickstart, or explore the project on GitHub.