Poetry-style lock regen · Dependabot-style PRs

Your prompts have dependencies. Keep them in sync.

A prompt is pinned to its model and its eval data — like pyproject.toml declares deps and poetry.lock pins what actually works. When either dependency moves, the prompt goes stale; driftless regenerates it, validates on holdout, and opens a PR with the evidence.

Validates on held-out data Only edits files you allow Provider-neutral
~/support-classifier-svc
# 1. find at-risk models in your repo $ driftless scan ! gpt-3.5-turbo deprecated retires in 41d gpt-4o-mini # 2. test the replacement on your real eval $ driftless compare -w support_classifier --to gpt-4o-mini baseline F1 1.00 schema-err 0% target F1 0.00 schema-err 100% REGRESSION # 3. auto-repair the prompt, then validate on holdout $ driftless migrate -w support_classifier --to gpt-4o-mini iter 1 +raw-JSON rule schema-err 100% → 0% iter 2 +refund rule F1 0.78 → 1.00 holdout ✓ passed status PASS $ driftless open-pr -w support_classifier --create opened PR #128 — migrate gpt-3.5-turbo → gpt-4o-mini
The core idea

A prompt has two dependencies — and both go stale.

One drifts from the outside, one from the inside. Either way, the same automatable job keeps your prompt in sync.

External drift

The model

Providers deprecate and retire models on their own schedule. A deadline you don't control forces a migration before things break.

trigger: deprecation
Internal drift

The eval dataset

Your team adds new labeled data — say, from customer feedback — and what counts as "correct" shifts. The prompt needs re-tuning to match.

trigger: data_change

Either way it's the same job: re-tune the prompt, validate on held-out data, open a PR with the evidence.

Two familiar patterns

Poetry explains why. Dependabot explains how.

driftless is both: lock regeneration when dependencies change, delivered as an automated, evidence-backed PR.

Poetry

The lockfile problem

You can bump versions in pyproject.toml, but until you run poetry lock the lockfile is stale — install may not match reality. Same here: swap the model or update labels and the prompt is out of sync with what "correct" means.

driftless.yml = declared deps · prompts/ = lock
Dependabot

The automation problem

Poetry doesn't auto-regenerate the lock for you — you run it yourself. Dependabot watches upstream, tests the change, and opens a PR. driftless watches model lifecycle and dataset drift, runs your eval, repairs the prompt, and opens a PR with metrics attached.

trigger → migrate / refineopen-pr

Unlike Poetry's resolver, prompt "locking" is empirical — behavior isn't declared, it's measured on your eval. That's the loop driftless automates.

The problem

A model swap is never just a model swap.

When a provider retires a model, every workflow that depends on it breaks or silently degrades. Changing the model ID is the easy part — the new model behaves differently, so prompts, format instructions, and few-shot examples often need to change too. And before you ship, you need proof the replacement still meets your quality bar.

  • Deprecation deadlines arrive faster than teams can react.
  • Silent quality regressions slip into production after a swap.
  • Prompt rewrites are manual, slow, and easy to get wrong.
  • No evidence means risky merges and nervous reviewers.
How it works

You describe your workflow once. We orchestrate the migration.

Declare how to run your eval, how to override the model, which files may be edited, and what quality bar must hold — in a single driftless.yml. The tool drives the rest.

01

scan

Find probable LLM usage and flag at-risk models.

02

configure

Scaffold a migration-ready contract from the scan.

03

compare

Score baseline vs. target on your real eval.

04

migrate

Repair prompts/configs, iterate, validate on holdout.

05

report

Produce an evidence-backed markdown scorecard.

06

open-pr

Open a PR — or an issue when blocked.

The customer owns the workflow. The tool orchestrates it. We shell out to your eval command with the model overridden — we never reimplement your pre/post-processing.

Why teams trust it

Built so an automated change is safe to merge.

Safety by construction

The engine only edits files you declare editable, applies edits in a sandbox, and never commits a change that fails on a holdout split it never tuned on.

Evaluated on your real workflow

No synthetic benchmarks. Replacements are scored through your actual eval command, with F1, schema-error rate, refusals, latency, and cost.

Evidence-backed PRs

Every migration ships a scorecard: current vs. naive vs. migrated, the exact edits made, unmet thresholds, and holdout validation — straight into the PR body.

Pluggable repair

An LLM-backed generator fixes prompts by default — provider-neutral across OpenAI and Anthropic. Customize the prompt, or plug in your own deterministic generator.

Policy-driven triggers

Decide when to migrate — deprecation deadlines, cost savings, quality gains, or new models — with per-trigger thresholds and a snooze list.

CLI-first, GitHub-native

The CLI is the engine — testable locally before any CI is involved. A composite GitHub Action runs the same commands on a schedule.

Evidence, not vibes

Every migration is a reviewable scorecard.

driftless produces a markdown report that becomes the PR body. Reviewers see exactly what changed, why, and the measured impact on held-out data — so approving an automated migration feels like reviewing a teammate's PR.

  • Current / naive-swap / migrated metrics side by side
  • The exact prompt/config edits the loop applied
  • Holdout validation — proof it generalizes
  • Blocked migrations still file an actionable issue
migration · support_classifier PASS
MetricCurrentNaive swapMigrated
model3.5-turbo4o-mini4o-mini
F11.000.001.00
schema err0%100%0%
refusals0%0%0%
holdout✓ pass
In practice

One contract. The whole lifecycle.

workflows:
  support_classifier:
    run:
      command: "python evals/run_eval.py"
      input_path: evals/inputs.jsonl
      output_path: evals/outputs.jsonl
    model:
      current: gpt-3.5-turbo
      env_var: SUPPORT_CLASSIFIER_MODEL
    files:
      editable: [prompts/system.md, prompts/examples.yml]
    eval:
      labels_path: evals/labels.jsonl
      schema_path: schemas/ticket.schema.json
    thresholds:
      min_f1: 0.9
      max_schema_error_rate: 0.02
# discover at-risk workflows + apply your policy — gates CI
$ driftless plan

  Migration plan
  ┌────────────────────┬──────────────────────┬─────────┬──────────┐
  │ Workflow           │ Migrate              │ Retires │ Decision │
  ├────────────────────┼──────────────────────┼─────────┼──────────┤
  │ support_classifier │ 3.5-turbo → 4o-mini  │ 41d     │ PR (high) │
  └────────────────────┴──────────────────────┴─────────┴──────────┘

  # exits non-zero when action is needed
# .github/workflows/llm-model-scan.yml
on:
  schedule: [{ cron: "0 9 * * 1" }]   # every Monday

jobs:
  scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: driftless/action@v1
        with:
          command: plan
Design principles

Opinions that make the output trustworthy.

01

The contract is the spine

A single typed schema drives every command. Typos surface as errors, not silent misbehavior.

02

You own the workflow

We orchestrate your eval with the model overridden — never reimplementing your logic.

03

Failure is a first-class output

Every run emits pass / partial / blocked. A blocked migration still produces an actionable artifact.

04

Holdout or it didn't happen

Nothing is committed unless it passes on data the repair loop never optimized against.

Get started

Stop fearing the next deprecation email.

Install the CLI, point it at your repo, and let evidence-backed migrations come to you as pull requests.

pipx install driftless
driftless init
driftless scan