Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 2 | Pass rate: 100.0% | Cost: $0.0020

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 2d2bf9aa-1600-4e52-9eba-43ef1e2bddcb | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:18:44.048554+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
2d2bf9aa-1600-4e52-9eba-43ef1e2bddcbcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:44.048554+00:00
9d1cfef7-980a-404e-87e3-60a683eb1bd1coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:43.959692+00:00
dc03f728-35db-449e-88fb-81c42d2e7df3coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:43.556238+00:00
626519fe-a70a-4be2-91a1-7ccbacf53c0acoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:43.488424+00:00
1815a75b-b37a-4bae-ac30-4570e5cb29bdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:18:43.414238+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic205
#############################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
2d2bf9aa-1600-4e52-9eba-43ef1e2bddcbtypescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:18:44.048554+00:00
9d1cfef7-980a-404e-87e3-60a683eb1bd1python-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.959692+00:00
dc03f728-35db-449e-88fb-81c42d2e7df3python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.556238+00:00
626519fe-a70a-4be2-91a1-7ccbacf53c0atypescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.488424+00:00
1815a75b-b37a-4bae-ac30-4570e5cb29bdpython-config-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.414238+00:00
eb94bc2a-f79a-435b-a658-8ecb5955eb04typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.344567+00:00
6a1b3a91-d960-4476-a753-465312836794python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.282359+00:00
4eff0a7c-65b8-4a77-b29c-e13fc904e105typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.211020+00:00
7a244139-881a-412b-959b-24bc3fba00cdpython-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.153842+00:00
a22b24ed-9061-403f-af18-34c47298af9atypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:18:43.074632+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0055  (coder)