The Agent Quality Toolkit β measure, generate, guard, and improve AI agent performance across your entire codebase in minutes.
From raw codebase to actionable agent quality score β automated end to end.
Everything you need to evaluate and improve AI agent quality.
Benchmark AI agents on real coding tasks from your repository. Compare Claude vs Codex vs any agent side-by-side.
Automatically generate AGENTS.md and CLAUDE.md context files that make any AI coding agent perform better on your codebase.
Static analysis for AI diffs and context files. Catch hallucinated imports, broken references, and anti-patterns before they land.
Turn agent failures into reusable rules. Distill lessons from bad diffs into your AGENTS.md automatically.
Model Context Protocol server exposing all toolkit tools. Drop agentkit into any MCP-compatible agent workflow.
The umbrella CLI that ties it all together. One command to run the full pipeline, score your repo, and generate reports.
Quality score in under 60 seconds β no configuration required.
All major commands at a glance.
| Command | Description |
|---|---|
| agentkit quickstart | Fastest path to a composite quality score β start here |
| agentkit run . | Full pipeline analysis on the current directory |
| agentkit analyze github:owner/repo | Analyze any public GitHub repository |
| agentkit benchmark | Compare Claude vs Codex on your codebase tasks |
| agentkit score | Compute and display composite score |
| agentkit gate --min-score 70 | Fail CI if score falls below threshold |
| agentkit demo --record | Print VHS tape commands for terminal recording |
| agentkit org github:vercel | Score every public repo in a GitHub org |
| agentkit doctor | Check toolchain health and configuration |
| agentkit init | Initialize agentkit in the current project |