release-gate runs evals, validates agent traces, checks cost budgets, and generates an evidence pack — then gives you one number (0–100) and one decision.
How it works
release-gate slots into your CI/CD pipeline. No backend, no dashboard, no sign-up.
Declare your model, expected usage, daily budget cap, kill switch, eval cases, and trace policies. Takes about 5 minutes — or use release-gate init for an interactive wizard.
One command evaluates safety, cost, access control, fallback, eval quality, and observability. You get a 0–100 readiness score and a PROMOTE / HOLD / BLOCK decision — not just pass/fail YAML warnings.
Compare a baseline report against the candidate. Any dimension that drops more than 10 points — especially safety, fallback, or access control — is flagged as a regression and blocks the release automatically.
One command produces three audit artefacts — a machine-readable JSON report, an executive Markdown summary, and a full HTML dashboard — ready for compliance, security review, or stakeholder sign-off.
Live demo
Run the interactive demo locally — no config file needed. Or explore individual commands below.
pip install release-gate && release-gate demo
— no governance.yaml needed, works immediately after install.
pip install release-gate then
release-gate score examples/governance-safe-pass.yaml
Features
Score, compare, validate traces, resolve live pricing — one tool, one decision before every deploy.
Six weighted dimensions (safety 30%, cost 20%, access control 20%, fallback 15%, eval quality 10%, observability 5%) collapse into one number and one decision: PROMOTE, HOLD, or BLOCK.
Compare any two readiness report snapshots. Drops >10 points in any dimension — especially safety, fallback, or access control — automatically BLOCK the release. Ship with a diff, not a guess.
Declare behavior test cases in YAML: refuse_or_mask, contains_keywords, valid_json, no_tool_calls. Runs in static mode (CI-safe, no LLM key needed) or live mode with any agent callable.
Feed your agent’s execution trace (JSON or JSONL). Detects forbidden tool calls, allowed-list violations, retry storms, token budget overruns, and tool-call loops before they reach production.
One command generates three audit artefacts: readiness_report.json, executive_summary.md, and release-gate-evidence.html. Attach to PRs, compliance tickets, or security reviews.
Stop hardcoding prices. A model: block declares pricing source: static, custom, locked snapshot, OpenRouter live, or LiteLLM. Unknown pricing with on_unknown: hold fails the check — never assumes $0. Works for LLMs, embeddings, and self-hosted models.
Snapshot live prices into a tamper-evident pricing.lock.json (sha256-protected). CI scores offline, reproducibly. Stale snapshots (> max_age_days) surface as WARN so prices never drift silently.
Normal vs. runaway cost side-by-side. Engineering leaders see the dollars at risk, not YAML warnings. The HTML report uploads as a CI artifact automatically.
Sign governance.yaml with RSA-PSS + SHA-256. Verify in CI that no one changed budget limits or policies after review.
5 lines in your workflow. Exit code 0 = PROMOTE, 10 = HOLD, 1 = BLOCK. The HTML report is auto-uploaded as a CI artifact — your team reviews it without leaving GitHub.
CI/CD Integration
Works with GitHub Actions, GitLab CI, Jenkins, and any shell. All commands return structured exit codes.
release-gate demo
Interactive demo — no config needed
release-gate score
0–100 score + PROMOTE/HOLD/BLOCK
release-gate compare
Regression detection vs baseline
release-gate evidence-pack
JSON + Markdown + HTML artefacts
release-gate pricing-lock
Snapshot live model prices to lock file
release-gate impact
Cost simulation & runaway scenario
0 = PROMOTE / PASS — deploy it.
10 = HOLD / WARN — review before deploying.
1 = BLOCK / FAIL — do not deploy.
Every PR gets a readiness report, executive summary, and HTML dashboard — attached automatically so reviewers see the full picture without running anything locally.
Commit readiness_report.json as your baseline. Run release-gate compare on every PR to catch silent degradations in safety or fallback coverage.
Governance checks
Each check maps to a real failure mode — cost explosion, no kill switch, open access, bad inputs, forbidden tool use.
| Check / Layer | What it validates | Blocked when |
|---|---|---|
| ACTION_BUDGET | Estimated daily cost vs. declared budget cap | Cost exceeds max_daily_cost or no budget set |
| BUDGET_SIMULATION | Projected cost with retries, caching & spike multipliers across 10+ models | Projected cost exceeds budget or multipliers are out of range |
| FALLBACK_DECLARED | Kill switch, fallback mode, team owner, runbook URL | Any field missing — no owner means no one gets paged at 3 AM |
| IDENTITY_BOUNDARY | Auth required, rate limit configured, data isolation rules | Auth is optional or rate limit absent — anyone can exhaust budget |
| INPUT_CONTRACT | JSON Schema defined, valid & invalid sample payloads provided | Schema missing (FAIL) or no valid samples (WARN) |
| Evals (behavior) | refuse_or_mask, contains_keywords, valid_json, no_tool_calls — declared in YAML | Critical evals fail (safety category) |
| Trace Validator | Forbidden tools, allowed-list violations, retry storms, token budget, tool loops | Any forbidden tool called or retry storm detected |
| Pricing Resolver | Model token pricing from static table, custom inline, lock file, OpenRouter, or LiteLLM | Pricing unknown & on_unknown: hold — never silently assumes $0 |
See it live in 30 seconds
pip install release-gate && release-gate demo
Then set up your own agent: release-gate init