Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: d9729952-b2c9-473c-b9e6-3dd46989d17e | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-28T00:42:53.264986+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
d9729952-b2c9-473c-b9e6-3dd46989d17ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:42:53.264986+00:00
9df241f8-7077-4b0e-a929-c28f52991544coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:42:53.201593+00:00
e4920b45-652b-4d0f-84d6-aa09a5b23075coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:42:53.153489+00:00
1fb3a82a-f0f3-4d3b-abcc-76bb4f430556coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:42:53.074526+00:00
29b9952d-be17-467c-857c-c7e5cf93abcccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:42:53.015458+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic987
###########################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
d9729952-b2c9-473c-b9e6-3dd46989d17epython-recovery-easy-001wrong-logic0.740$0.00102026-04-28T00:42:53.264986+00:00
9df241f8-7077-4b0e-a929-c28f52991544typescript-config-easy-001wrong-logic0.740$0.00102026-04-28T00:42:53.201593+00:00
e4920b45-652b-4d0f-84d6-aa09a5b23075python-config-easy-001wrong-logic0.740$0.00102026-04-28T00:42:53.153489+00:00
1fb3a82a-f0f3-4d3b-abcc-76bb4f430556typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:42:53.074526+00:00
29b9952d-be17-467c-857c-c7e5cf93abccpython-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:42:53.015458+00:00
d67a8f44-681a-4704-bac3-9a8b4f2fa25ftypescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:42:52.931557+00:00
43423f5b-3b0a-4b42-ae7e-edbd1f0a43aapython-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:42:52.869220+00:00
3b65a46d-f1f4-4c34-b621-4f6da448aeebtypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:42:52.785852+00:00
75392ed0-ea4f-4aae-b500-6cff2967fab5python-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:42:52.717203+00:00
978f47c0-b955-476f-92f0-6b95372e46ddtypescript-performance-easy-001wrong-logic0.740$0.00102026-04-28T00:42:52.639038+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0048  (coder)