AIDLC Rules Trend Report

6 releases (v0.1.0 through v0.1.5) · awslabs/aidlc-workflows · 2026-03-20T16:04:28.635438+00:00

A. Executive Summary

Qualitative Score
0.898
Golden: 0.854
Contract Tests
88/88
100.0% pass rate
Unit Tests
100.0%
175/175 passed
Lint Findings
0
Golden: 0
Execution Time
17.9m
Golden: 23.8m
Total Tokens
13.66M
Golden: 18.39M

High-level snapshot comparing the latest release against the golden baseline (the reference evaluation used as the quality target).

Metric What it measures
Unit test pass ratePercentage of generated unit tests that pass. Higher means more reliable code generation.
Contract testsAPI compliance checks against the OpenAPI spec (passed/total). 88/88 = full compliance.
Lint findingsStatic analysis warnings in generated code. Lower is better — 0 means clean code.
Qualitative scoreAI-graded documentation quality on a 0–1 scale (higher is better).
Execution timeWall-clock time for the full evaluation run. Lower means faster generation.
Total tokensTotal LLM tokens consumed (input + output). Lower means more cost-efficient.
Metric Golden Latest (v0.1.5) vs Golden
Unit test pass rate 100.0% (180/180) 100.0% (175/175) =
Contract tests 88/88 88/88 =
Lint findings 0 0 =
Qualitative score 0.854 0.898 +0.044
Execution time 23.8m 17.9m -5.9m
Total tokens 18.39M 13.66M -4.74M

B. Functional Correctness

Measures whether the code generated by each rules version actually works correctly. This is the most fundamental quality gate — code that doesn’t pass its own tests is broken.

B.1 Unit Tests

Unit tests validate individual functions and components in isolation. The AIDLC rules instruct the AI to generate both source code and test suites.

Pass/Total = tests that passed out of total generated. Rate = pass percentage (100% = all tests passing). Failures = tests that ran but produced wrong results.

Version Pass/Total Rate Failures
v0.1.0 250/250 100.0% 0
v0.1.1 194/194 100.0% 0
v0.1.2 180/180 100.0% 0
v0.1.3 126/126 100.0% 0
v0.1.4 156/156 100.0% 0
v0.1.5 175/175 100.0% 0

B.2 Contract Tests (API Compliance)

Contract tests verify that the generated API implementation matches its OpenAPI specification. Each test sends a request to an endpoint and checks that the HTTP status code and response shape match the spec.

88 endpoints are tested per version. Pass/Total = endpoints that returned the expected status code. Rate = pass percentage (100% = full spec compliance).

Failures lists the specific endpoints that deviated from the spec.

Version Pass/Total Rate Failures
v0.1.0 88/88 100.0% 0
v0.1.1 88/88 100.0% 0
v0.1.2 88/88 100.0% 0
v0.1.3 85/88 96.6% 3
v0.1.4 88/88 100.0% 0
v0.1.5 88/88 100.0% 0
v0.1.3 failures:

C. Qualitative Evaluation

Measures the quality of generated documentation by comparing it against human-authored reference documents. An AI evaluator scores each document on completeness, accuracy, and clarity, producing a 0–1 score (1.0 = perfect match to reference quality).

C.1 Overall Score

The weighted average across all evaluated documents. This is the single best indicator of how well the rules produce documentation.

Scores above 0.90 are considered strong; below 0.70 signals significant gaps.

Golden baseline: 0.854

Version Overall vs Golden
v0.1.0 0.860 +0.006
v0.1.1 0.888 +0.033
v0.1.2 0.893 +0.038
v0.1.3 0.866 +0.012
v0.1.4 0.891 +0.037
v0.1.5 0.898 +0.044

C.2 Phase Breakdown

Documents are grouped by SDLC phase. Inception covers early-stage design artifacts (requirements, architecture plans, component designs) — these are generated first and set the foundation.

Construction covers build-time artifacts (build instructions, test instructions, build-and-test summaries) — these depend on inception outputs being correct.

A drop in inception quality often cascades into construction.

Version Inception Construction
v0.1.0 0.880 0.840
v0.1.1 0.894 0.882
v0.1.2 0.921 0.864
v0.1.3 0.886 0.846
v0.1.4 0.890 0.892
v0.1.5 0.879 0.918

C.3 Per-Document Heatmap

Individual quality scores for each generated document across all versions. This reveals which specific documents are consistently strong, improving, or problematic. Documents scoring below 0.70 (red) are the top candidates for rules improvements.

Document v0.1.0 v0.1.1 v0.1.2 v0.1.3 v0.1.4 v0.1.5
application-design-plan.md 0.96 1.00 0.96 1.00 0.95
build-and-test-summary.md 0.93 0.95 0.90 0.90 0.97 0.95
build-instructions.md 0.75 0.75 0.78 0.88 0.77 0.87
component-dependency.md 0.97 0.95 0.96 0.96 1.00 0.95
component-methods.md 0.93 0.90 0.96 0.98 0.93 0.96
components.md 1.00 0.98 1.00 0.97 0.98 0.98
execution-plan.md 0.97 0.91 0.98 0.93 0.97 0.97
integration-test-instructions.md 0.85 0.87 0.82 0.70 0.88 0.91
requirement-verification-questions.md 0.38 0.54 0.54 0.38 0.36 0.28
requirements.md 1.00 1.00 0.97 1.00 0.97 0.97
sci-calc-code-generation-plan.md 0.97 0.98 0.92 0.98 0.98 0.98
services.md 0.91 0.91 0.96 0.91 0.91 0.97
unit-test-instructions.md 0.70 0.86 0.90 0.77 0.86 0.88

green ≥ 0.90 yellow 0.70–0.89 red < 0.70

C.4 Document Coverage

Tracks whether the generated output includes the same set of documents as the reference. Unmatched Ref = reference documents the AI failed to generate (missing output). Unmatched Candidate = extra documents the AI generated that don’t exist in the reference (unexpected output). Ideally both columns are 0, meaning the AI produced exactly the expected set of documents.

Version Unmatched Ref Unmatched Candidate
v0.1.0 1 1
v0.1.1 0 4
v0.1.2 0 1
v0.1.3 0 6
v0.1.4 0 6
v0.1.5 0 0

D. Efficiency & Cost Metrics

Tracks the computational resources consumed by each evaluation run. These metrics directly affect cost (tokens) and developer wait time (execution time). Lower values are generally better, as long as quality metrics remain stable.

D.1 Token Consumption

Total LLM tokens consumed during the run, broken down by agent. Total = all tokens across all agents (input + output).

Executor = the agent that generates code and documents. Simulator = the agent that simulates user interactions for testing.

Token count is the primary cost driver — each token represents a unit of LLM usage billed by the provider.

Version Total Executor Simulator
v0.1.0 9.26M 4.65M 119.3K
v0.1.1 13.34M 6.56M 266.2K
v0.1.2 8.34M 4.15M 295.5K
v0.1.3 11.52M 5.72M 222.3K
v0.1.4 11.52M 5.67M 251.9K
v0.1.5 13.66M 6.88M 90.2K

D.2 Execution Time

Wall-clock duration of the full evaluation pipeline, broken down by handoff. Each handoff (H1, H2, H3) represents a sequential phase.

H1 is typically code generation (the longest phase), H2 is build/test execution, and H3 is result collection and reporting.

Wall Clock is the total end-to-end time.

Version Wall Clock Handoff Breakdown
v0.1.0 16.0m H1: 13.8m · H2: 0.9m · H3: 1.3m
v0.1.1 18.6m H1: 17.1m · H2: 1.0m · H3: 0.5m
v0.1.2 15.5m H1: 11.7m · H2: 1.4m · H3: 2.4m
v0.1.3 18.8m H1: 15.8m · H2: 1.3m · H3: 1.7m
v0.1.4 16.8m H1: 14.8m · H2: 1.3m · H3: 0.6m
v0.1.5 17.9m H1: 15.0m · H2: 0.7m · H3: 2.2m

D.3 Context Window Pressure

Measures how much of the LLM’s context window is being used across API calls. Max = the largest single context seen during the run (approaching the model’s limit risks truncation or degraded output).

Avg = the mean context size across all API calls. Median = the midpoint context size (less affected by outliers than avg).

High context pressure can indicate overly verbose prompts or accumulated conversation history.

Version Max Avg Median
v0.1.0 97.4K 44.9K 43.7K
v0.1.1 138.5K 57.2K 50.4K
v0.1.2 96.4K 38.8K 26.6K
v0.1.3 118.6K 49.7K 42.4K
v0.1.4 109.8K 48.8K 48.5K
v0.1.5 121.7K 56.6K 55.2K

E. Code Quality

Static analysis of the generated codebase. These metrics reflect the cleanliness and maintainability of the AI-generated code, independent of whether it passes tests.

Metric What it measures
Lint FindingsWarnings from static analysis (style violations, unused variables, etc.). 0 = clean.
Security FindingsVulnerabilities detected by security scanners (SQL injection, XSS, etc.). N/A if no scanner was configured.
Source FilesNumber of non-test source files in the generated project.
LOCTotal lines of code across all source files. Large swings may indicate generated boilerplate or missing modules.
Version Lint Findings Security Findings Source Files LOC
v0.1.0 0 N/A 977 398.5K
v0.1.1 0 N/A 977 398.1K
v0.1.2 0 N/A 976 397.4K
v0.1.3 0 N/A 977 397.7K
v0.1.4 0 N/A 976 397.5K
v0.1.5 0 N/A 976 397.6K

F. Stability & Reliability

Tracks whether the evaluation pipeline itself ran smoothly, independent of output quality.

Metric What it measures
Error EventsRuntime errors logged during the run (exceptions, timeouts, API failures). 0 = clean run.
HandoffsNumber of sequential pipeline phases completed. Typically 3 (generate, build/test, report). A different count may indicate an early abort or retry.
Server StartupWhether the generated application server started successfully. A failure here means the generated code couldn’t even boot, preventing contract tests from running.
Version Error Events Handoffs Server Startup
v0.1.0 0 3 PASS
v0.1.1 0 3 PASS
v0.1.2 0 3 PASS
v0.1.3 0 3 PASS
v0.1.4 0 3 PASS
v0.1.5 0 3 PASS

G. Version-over-Version Deltas

Each row shows the change from one release to the next, making it easy to spot which specific version introduced an improvement or regression. Positive values (+) indicate an increase; negative (−) indicate a decrease. For Unit Tests and Contract, positive is better (more tests passing). For Qualitative, positive is better (higher quality score). For Tokens and Time, negative is better (more efficient).

Transition Unit Tests Contract Qualitative Tokens Time
v0.1.0 → v0.1.1 -56 +0 +0.028 +4.08M +155s
v0.1.1 → v0.1.2 -14 +0 +0.005 -5.00M -188s
v0.1.2 → v0.1.3 -54 -3 -0.026 +3.19M +200s
v0.1.3 → v0.1.4 +30 +3 +0.025 -9.2K -122s
v0.1.4 → v0.1.5 +19 +0 +0.007 +2.14M +71s

H. Pre-Release Data Points

Evaluation results from non-release sources — the main branch and open pull requests. These represent in-progress work that hasn’t been tagged as a release yet. Use this data to preview whether upcoming changes will improve or regress metrics before they ship.

No pre-release data available.