Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 137b0f75-a5b2-44b7-96d3-4b05b67f97b1 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:14:33.764114+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
137b0f75-a5b2-44b7-96d3-4b05b67f97b1coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:33.764114+00:00
02556f12-9ea8-4ca3-b7f7-e8a0dd9f60c5coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:33.691886+00:00
772c7552-31fb-4d20-954a-191242dd935ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:33.651072+00:00
0796e6ba-e684-44d1-95bd-4a2798583d22coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:33.600742+00:00
f433dbc8-5c8d-4ad3-a8e1-e2108d18edc5coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:33.549997+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic173
#############################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
137b0f75-a5b2-44b7-96d3-4b05b67f97b1python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.764114+00:00
02556f12-9ea8-4ca3-b7f7-e8a0dd9f60c5typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.691886+00:00
772c7552-31fb-4d20-954a-191242dd935epython-config-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.651072+00:00
0796e6ba-e684-44d1-95bd-4a2798583d22typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.600742+00:00
f433dbc8-5c8d-4ad3-a8e1-e2108d18edc5python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.549997+00:00
b60f6ba7-73b4-4905-9667-5783f19bab23typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.503024+00:00
f341fc20-fceb-4831-991c-308a6e4acfa6python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.447917+00:00
7b5012f3-a07b-4b89-a9cf-86b72d637e28typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.378025+00:00
989879df-f3e2-4bfe-99d9-af50d6147b61python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.330965+00:00
1b8770c1-176f-4307-958b-51d99bc31e9atypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:14:33.289034+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0055  (coder)