Build suites from JSONL prompt-response logs instead of inventing cases by hand.
Automatic eval suites from logs you already have
redline
Local-first prompt regression diffs for AI engineers. Point redline at real prompt-response logs, change your prompt, then catch broken JSON, missing URLs, lost tables, refusals, and other structural regressions before they ship.
python -m pip install redline-ai
redline demo --public --compact
First-run proof
One command, ten regressions, no account setup.
The public demo is synthetic and safe to share, but it behaves like a real prompt change: the candidate gets shorter and silently drops required structure, dates, owners, URLs, and refusal behavior.
$ redline demo --public --compact
redline public dogfood: cases=10 regression=10 changed=0 missing=0 neutral=0
Confidence: HIGH | fix blocking cases before shipping
Scope: structural checks only; review factual correctness, tone, hallucinations, and subtle reasoning separately
REGRESSION case_001: candidate lost valid JSON format
REGRESSION case_002: candidate lost markdown table structure
REGRESSION case_006: candidate newly refuses
REGRESSION case_010: candidate lost bullet list structure
Next: inspect the HTML report, mark intentional changes, accept reviewed outputs.
Why it exists
You already have the eval data. redline turns it into a gate.
No cloud account is required for the core loop; CI-friendly checks run locally, cost nothing, and do not depend on hosted judges.
Write JSON, Markdown, JUnit, and self-contained HTML reports for local review, CI artifacts, and GitHub summaries.
Mark intentional changes, accept new baselines, and keep the suite moving with your prompt.
Core loop
From logs to a shipping gate in four commands.
-
01
Collect behavior
redline watch --log logs/prompts.jsonl -
02
Generate the suite
redline suite .redline/logs/prompts.jsonl --out redline-suite.json -
03
Evaluate a prompt change
redline eval --prompt prompts/v2.txt -
04
Review and accept
redline mark ...redline accept ...
Trust model
Calibrated signals, not magic approval.
What redline catches
JSON validity, missing keys, URLs, numbers, entities, tables, lists, code blocks, empty outputs, refusals, and obvious policy wording flips are surfaced with reproducible reasons.
What redline does not pretend
Tone, factual accuracy, hallucination risk, and subtle reasoning quality need pinned requirements or an optional judge. redline tells you when a run has no structural blockers; it does not pretend that means the answer is perfect.
How to close the gap
Pin edge cases with requirements, add product-specific judge rubrics for ambiguous changes, then accept reviewed candidate outputs so the baseline evolves deliberately.
Bring your stack
AI-agnostic first, adapters when you need them.
Replay any app
Use a command that reads stdin and prints stdout, or copy a runner for OpenAI, Anthropic, LiteLLM, HTTP APIs, LangChain, or LlamaIndex.
redline init --runner stdio --copy-runner
Local dashboard
Publish self-contained HTML reports as CI artifacts or browse them locally.
redline dashboard --open
Release confidence
Not just a demo. A certified local product loop.
redline ships with repeatable checks for the paths that matter: clean install, external project CI behavior, report artifacts, history trends, dashboard output, and release packaging.
The release gate builds a wheel, installs it in a fresh virtualenv, and runs the first-user command path from outside the repo.
bash scripts/release_check.sh
A temporary external project verifies strict doctor, strict validate, eval failure, GitHub annotations, history, and dashboard artifacts.
bash scripts/action_smoke.sh
History shows whether blocking regressions are getting better or worse; summary and doctor surface thin coverage and semantic gaps.
redline history --fail-on worse
Build scripts produce wheel and source distributions and run package metadata checks before anything is uploaded.
bash scripts/certify_release.sh
Public alpha ready path
Clone it, run the proof, then point it at one real prompt log.
The fastest way to understand redline is to make it catch one regression in your own AI workflow.
Open source surface