Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-bugfix-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
python-explain-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
python-security-fix-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
typescript-dependency-easy-001easy0.740$0.0010
typescript-explain-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: b1aa1fa3-578a-4644-ad0c-01b96b904848 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:05:41.615313+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
b1aa1fa3-578a-4644-ad0c-01b96b904848coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:41.615313+00:00
492c9f1e-0e57-4094-af06-4be5a8f11daacoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:41.550343+00:00
487c51aa-7d35-4d0a-bf55-033a4e43a014coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:41.519559+00:00
f898f6e3-9514-4fd3-b22d-cd9f507915a6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:41.469300+00:00
eb83e7a7-2d70-4fdc-9de9-7307f22e5df4coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:41.422240+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic32
################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
b1aa1fa3-578a-4644-ad0c-01b96b904848typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.615313+00:00
492c9f1e-0e57-4094-af06-4be5a8f11daatypescript-explain-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.550343+00:00
487c51aa-7d35-4d0a-bf55-033a4e43a014typescript-dependency-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.519559+00:00
f898f6e3-9514-4fd3-b22d-cd9f507915a6typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.469300+00:00
eb83e7a7-2d70-4fdc-9de9-7307f22e5df4typescript-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.422240+00:00
250665af-4f7b-41d7-9ab8-f06146a0cbd0python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.376636+00:00
3f837978-0397-431c-91cd-014e8e82bf93python-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.354868+00:00
d8908e04-54e9-4f32-9408-dfab86f8e883python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.320671+00:00
54ca06b6-326b-400a-bbc3-56e55e585a51python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.276053+00:00
f7edaf6b-d942-41bc-8276-e44521e534bapython-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:05:41.243064+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0057  (coder)