High-level snapshot comparing the latest release against the golden baseline (the reference evaluation used as the quality target).
| Metric | What it measures |
|---|---|
| Unit test pass rate | Percentage of generated unit tests that pass. Higher means more reliable code generation. |
| Contract tests | API compliance checks against the OpenAPI spec (passed/total). 88/88 = full compliance. |
| Lint findings | Static analysis warnings in generated code. Lower is better — 0 means clean code. |
| Qualitative score | AI-graded documentation quality on a 0–1 scale (higher is better). |
| Execution time | Wall-clock time for the full evaluation run. Lower means faster generation. |
| Total tokens | Total LLM tokens consumed (input + output). Lower means more cost-efficient. |
| Metric | Golden | Latest (v0.1.5) | vs Golden |
|---|---|---|---|
| Unit test pass rate | 100.0% (180/180) | 100.0% (175/175) | = |
| Contract tests | 88/88 | 88/88 | = |
| Lint findings | 0 | 0 | = |
| Qualitative score | 0.854 | 0.898 | +0.044 |
| Execution time | 23.8m | 17.9m | -5.9m |
| Total tokens | 18.39M | 13.66M | -4.74M |
Measures whether the code generated by each rules version actually works correctly. This is the most fundamental quality gate — code that doesn’t pass its own tests is broken.
Unit tests validate individual functions and components in isolation. The AIDLC rules instruct the AI to generate both source code and test suites.
Pass/Total = tests that passed out of total generated. Rate = pass percentage (100% = all tests passing). Failures = tests that ran but produced wrong results.
| Version | Pass/Total | Rate | Failures |
|---|---|---|---|
| v0.1.0 | 250/250 | 100.0% | 0 |
| v0.1.1 | 194/194 | 100.0% | 0 |
| v0.1.2 | 180/180 | 100.0% | 0 |
| v0.1.3 | 126/126 | 100.0% | 0 |
| v0.1.4 | 156/156 | 100.0% | 0 |
| v0.1.5 | 175/175 | 100.0% | 0 |
Contract tests verify that the generated API implementation matches its OpenAPI specification. Each test sends a request to an endpoint and checks that the HTTP status code and response shape match the spec.
88 endpoints are tested per version. Pass/Total = endpoints that returned the expected status code. Rate = pass percentage (100% = full spec compliance).
Failures lists the specific endpoints that deviated from the spec.
| Version | Pass/Total | Rate | Failures |
|---|---|---|---|
| v0.1.0 | 88/88 | 100.0% | 0 |
| v0.1.1 | 88/88 | 100.0% | 0 |
| v0.1.2 | 88/88 | 100.0% | 0 |
| v0.1.3 | 85/88 | 96.6% | 3 |
| v0.1.4 | 88/88 | 100.0% | 0 |
| v0.1.5 | 88/88 | 100.0% | 0 |
POST /api/v1/arithmetic/add — expected 422, got 200 (add missing field → 422)POST /api/v1/arithmetic/divide — expected 400, got 200 (divide by zero → error)POST /api/v1/arithmetic/modulo — expected 400, got 200 (modulo by zero → error)Measures the quality of generated documentation by comparing it against human-authored reference documents. An AI evaluator scores each document on completeness, accuracy, and clarity, producing a 0–1 score (1.0 = perfect match to reference quality).
The weighted average across all evaluated documents. This is the single best indicator of how well the rules produce documentation.
Scores above 0.90 are considered strong; below 0.70 signals significant gaps.
Golden baseline: 0.854
| Version | Overall | vs Golden |
|---|---|---|
| v0.1.0 | 0.860 | +0.006 |
| v0.1.1 | 0.888 | +0.033 |
| v0.1.2 | 0.893 | +0.038 |
| v0.1.3 | 0.866 | +0.012 |
| v0.1.4 | 0.891 | +0.037 |
| v0.1.5 | 0.898 | +0.044 |
Documents are grouped by SDLC phase. Inception covers early-stage design artifacts (requirements, architecture plans, component designs) — these are generated first and set the foundation.
Construction covers build-time artifacts (build instructions, test instructions, build-and-test summaries) — these depend on inception outputs being correct.
A drop in inception quality often cascades into construction.
| Version | Inception | Construction |
|---|---|---|
| v0.1.0 | 0.880 | 0.840 |
| v0.1.1 | 0.894 | 0.882 |
| v0.1.2 | 0.921 | 0.864 |
| v0.1.3 | 0.886 | 0.846 |
| v0.1.4 | 0.890 | 0.892 |
| v0.1.5 | 0.879 | 0.918 |
Individual quality scores for each generated document across all versions. This reveals which specific documents are consistently strong, improving, or problematic. Documents scoring below 0.70 (red) are the top candidates for rules improvements.
| Document | v0.1.0 | v0.1.1 | v0.1.2 | v0.1.3 | v0.1.4 | v0.1.5 |
|---|---|---|---|---|---|---|
application-design-plan.md |
— | 0.96 | 1.00 | 0.96 | 1.00 | 0.95 |
build-and-test-summary.md |
0.93 | 0.95 | 0.90 | 0.90 | 0.97 | 0.95 |
build-instructions.md |
0.75 | 0.75 | 0.78 | 0.88 | 0.77 | 0.87 |
component-dependency.md |
0.97 | 0.95 | 0.96 | 0.96 | 1.00 | 0.95 |
component-methods.md |
0.93 | 0.90 | 0.96 | 0.98 | 0.93 | 0.96 |
components.md |
1.00 | 0.98 | 1.00 | 0.97 | 0.98 | 0.98 |
execution-plan.md |
0.97 | 0.91 | 0.98 | 0.93 | 0.97 | 0.97 |
integration-test-instructions.md |
0.85 | 0.87 | 0.82 | 0.70 | 0.88 | 0.91 |
requirement-verification-questions.md |
0.38 | 0.54 | 0.54 | 0.38 | 0.36 | 0.28 |
requirements.md |
1.00 | 1.00 | 0.97 | 1.00 | 0.97 | 0.97 |
sci-calc-code-generation-plan.md |
0.97 | 0.98 | 0.92 | 0.98 | 0.98 | 0.98 |
services.md |
0.91 | 0.91 | 0.96 | 0.91 | 0.91 | 0.97 |
unit-test-instructions.md |
0.70 | 0.86 | 0.90 | 0.77 | 0.86 | 0.88 |
green ≥ 0.90 yellow 0.70–0.89 red < 0.70
Tracks whether the generated output includes the same set of documents as the reference. Unmatched Ref = reference documents the AI failed to generate (missing output). Unmatched Candidate = extra documents the AI generated that don’t exist in the reference (unexpected output). Ideally both columns are 0, meaning the AI produced exactly the expected set of documents.
| Version | Unmatched Ref | Unmatched Candidate |
|---|---|---|
| v0.1.0 | 1 | 1 |
| v0.1.1 | 0 | 4 |
| v0.1.2 | 0 | 1 |
| v0.1.3 | 0 | 6 |
| v0.1.4 | 0 | 6 |
| v0.1.5 | 0 | 0 |
Tracks the computational resources consumed by each evaluation run. These metrics directly affect cost (tokens) and developer wait time (execution time). Lower values are generally better, as long as quality metrics remain stable.
Total LLM tokens consumed during the run, broken down by agent. Total = all tokens across all agents (input + output).
Executor = the agent that generates code and documents. Simulator = the agent that simulates user interactions for testing.
Token count is the primary cost driver — each token represents a unit of LLM usage billed by the provider.
| Version | Total | Executor | Simulator | |
|---|---|---|---|---|
| v0.1.0 | 9.26M | 4.65M | 119.3K | |
| v0.1.1 | 13.34M | 6.56M | 266.2K | |
| v0.1.2 | 8.34M | 4.15M | 295.5K | |
| v0.1.3 | 11.52M | 5.72M | 222.3K | |
| v0.1.4 | 11.52M | 5.67M | 251.9K | |
| v0.1.5 | 13.66M | 6.88M | 90.2K |
Wall-clock duration of the full evaluation pipeline, broken down by handoff. Each handoff (H1, H2, H3) represents a sequential phase.
H1 is typically code generation (the longest phase), H2 is build/test execution, and H3 is result collection and reporting.
Wall Clock is the total end-to-end time.
| Version | Wall Clock | Handoff Breakdown | |
|---|---|---|---|
| v0.1.0 | 16.0m | H1: 13.8m · H2: 0.9m · H3: 1.3m | |
| v0.1.1 | 18.6m | H1: 17.1m · H2: 1.0m · H3: 0.5m | |
| v0.1.2 | 15.5m | H1: 11.7m · H2: 1.4m · H3: 2.4m | |
| v0.1.3 | 18.8m | H1: 15.8m · H2: 1.3m · H3: 1.7m | |
| v0.1.4 | 16.8m | H1: 14.8m · H2: 1.3m · H3: 0.6m | |
| v0.1.5 | 17.9m | H1: 15.0m · H2: 0.7m · H3: 2.2m |
Measures how much of the LLM’s context window is being used across API calls. Max = the largest single context seen during the run (approaching the model’s limit risks truncation or degraded output).
Avg = the mean context size across all API calls. Median = the midpoint context size (less affected by outliers than avg).
High context pressure can indicate overly verbose prompts or accumulated conversation history.
| Version | Max | Avg | Median |
|---|---|---|---|
| v0.1.0 | 97.4K | 44.9K | 43.7K |
| v0.1.1 | 138.5K | 57.2K | 50.4K |
| v0.1.2 | 96.4K | 38.8K | 26.6K |
| v0.1.3 | 118.6K | 49.7K | 42.4K |
| v0.1.4 | 109.8K | 48.8K | 48.5K |
| v0.1.5 | 121.7K | 56.6K | 55.2K |
Static analysis of the generated codebase. These metrics reflect the cleanliness and maintainability of the AI-generated code, independent of whether it passes tests.
| Metric | What it measures |
|---|---|
| Lint Findings | Warnings from static analysis (style violations, unused variables, etc.). 0 = clean. |
| Security Findings | Vulnerabilities detected by security scanners (SQL injection, XSS, etc.). N/A if no scanner was configured. |
| Source Files | Number of non-test source files in the generated project. |
| LOC | Total lines of code across all source files. Large swings may indicate generated boilerplate or missing modules. |
| Version | Lint Findings | Security Findings | Source Files | LOC |
|---|---|---|---|---|
| v0.1.0 | 0 | N/A | 977 | 398.5K |
| v0.1.1 | 0 | N/A | 977 | 398.1K |
| v0.1.2 | 0 | N/A | 976 | 397.4K |
| v0.1.3 | 0 | N/A | 977 | 397.7K |
| v0.1.4 | 0 | N/A | 976 | 397.5K |
| v0.1.5 | 0 | N/A | 976 | 397.6K |
Tracks whether the evaluation pipeline itself ran smoothly, independent of output quality.
| Metric | What it measures |
|---|---|
| Error Events | Runtime errors logged during the run (exceptions, timeouts, API failures). 0 = clean run. |
| Handoffs | Number of sequential pipeline phases completed. Typically 3 (generate, build/test, report). A different count may indicate an early abort or retry. |
| Server Startup | Whether the generated application server started successfully. A failure here means the generated code couldn’t even boot, preventing contract tests from running. |
| Version | Error Events | Handoffs | Server Startup |
|---|---|---|---|
| v0.1.0 | 0 | 3 | PASS |
| v0.1.1 | 0 | 3 | PASS |
| v0.1.2 | 0 | 3 | PASS |
| v0.1.3 | 0 | 3 | PASS |
| v0.1.4 | 0 | 3 | PASS |
| v0.1.5 | 0 | 3 | PASS |
Each row shows the change from one release to the next, making it easy to spot which specific version introduced an improvement or regression. Positive values (+) indicate an increase; negative (−) indicate a decrease. For Unit Tests and Contract, positive is better (more tests passing). For Qualitative, positive is better (higher quality score). For Tokens and Time, negative is better (more efficient).
| Transition | Unit Tests | Contract | Qualitative | Tokens | Time |
|---|---|---|---|---|---|
| v0.1.0 → v0.1.1 | -56 | +0 | +0.028 | +4.08M | +155s |
| v0.1.1 → v0.1.2 | -14 | +0 | +0.005 | -5.00M | -188s |
| v0.1.2 → v0.1.3 | -54 | -3 | -0.026 | +3.19M | +200s |
| v0.1.3 → v0.1.4 | +30 | +3 | +0.025 | -9.2K | -122s |
| v0.1.4 → v0.1.5 | +19 | +0 | +0.007 | +2.14M | +71s |
Evaluation results from non-release sources — the main branch and open pull requests. These represent in-progress work that hasn’t been tagged as a release yet. Use this data to preview whether upcoming changes will improve or regress metrics before they ship.
No pre-release data available.