Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: e0145d7a-f7db-4f4b-b507-ed0dffc89a47 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T17:10:33.090260+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
e0145d7a-f7db-4f4b-b507-ed0dffc89a47coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:33.090260+00:00
7e0d9173-9648-415d-aace-a78906c8d555coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:33.020482+00:00
d18d6251-b23c-4eed-a504-bd5948c15ffdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:32.945594+00:00
b5bcb019-d219-4e96-845c-bf88e0e6f5adcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:32.873338+00:00
b8ce179a-5b5d-4eb5-aab6-a92f4b0a0dcdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:32.809201+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic596
####################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
e0145d7a-f7db-4f4b-b507-ed0dffc89a47python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T17:10:33.090260+00:00
7e0d9173-9648-415d-aace-a78906c8d555typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T17:10:33.020482+00:00
d18d6251-b23c-4eed-a504-bd5948c15ffdpython-config-easy-001wrong-logic0.740$0.00102026-04-27T17:10:32.945594+00:00
b5bcb019-d219-4e96-845c-bf88e0e6f5adtypescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T17:10:32.873338+00:00
b8ce179a-5b5d-4eb5-aab6-a92f4b0a0dcdpython-refactor-easy-001wrong-logic0.740$0.00102026-04-27T17:10:32.809201+00:00
b7d66a18-1136-4215-ad17-ebec3bd4ae89typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T17:10:32.735834+00:00
68baee16-86e7-4afb-bc28-2695de3c4d36python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T17:10:32.645833+00:00
e8b7ad8c-08fe-4eb3-b20b-f4f516ca21f1typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T17:10:32.564941+00:00
90323d2e-e65f-4ff0-af47-649db2f8f293python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T17:10:32.484983+00:00
41b48f67-4643-438d-9ee6-68f8520f12fftypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T17:10:32.394283+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0051  (coder)