Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: dc03f728-35db-449e-88fb-81c42d2e7df3 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:18:43.556238+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
dc03f728-35db-449e-88fb-81c42d2e7df3coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:43.556238+00:00
626519fe-a70a-4be2-91a1-7ccbacf53c0acoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:43.488424+00:00
1815a75b-b37a-4bae-ac30-4570e5cb29bdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:43.414238+00:00
eb94bc2a-f79a-435b-a658-8ecb5955eb04coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:43.344567+00:00
6a1b3a91-d960-4476-a753-465312836794coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:43.282359+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic203
###########################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
dc03f728-35db-449e-88fb-81c42d2e7df3python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.556238+00:00
626519fe-a70a-4be2-91a1-7ccbacf53c0atypescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.488424+00:00
1815a75b-b37a-4bae-ac30-4570e5cb29bdpython-config-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.414238+00:00
eb94bc2a-f79a-435b-a658-8ecb5955eb04typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.344567+00:00
6a1b3a91-d960-4476-a753-465312836794python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.282359+00:00
4eff0a7c-65b8-4a77-b29c-e13fc904e105typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.211020+00:00
7a244139-881a-412b-959b-24bc3fba00cdpython-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.153842+00:00
a22b24ed-9061-403f-af18-34c47298af9atypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.074632+00:00
32b1c97c-9f4e-4728-ac2a-a2e2a2d9e609python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:18:42.960353+00:00
6007c7ea-cdf7-4ba6-9641-e7eb371ca67ctypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:18:42.889343+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0055  (coder)