Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 1c453a44-b9b5-4b0d-90a8-1aacf878a009 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:26:58.742229+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
1c453a44-b9b5-4b0d-90a8-1aacf878a009coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:58.742229+00:00
cff24737-1310-43d4-9c79-ef91872c74c1coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:58.639544+00:00
9057dc27-0005-417d-bb50-020bff22da78coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:58.526781+00:00
41fe9b02-3a3c-4c73-9c81-206bfc5b6624coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:58.416555+00:00
9156090b-0dad-4826-a08f-4c1d0e1214f9coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:58.339417+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic297
#########################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
1c453a44-b9b5-4b0d-90a8-1aacf878a009python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.742229+00:00
cff24737-1310-43d4-9c79-ef91872c74c1typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.639544+00:00
9057dc27-0005-417d-bb50-020bff22da78python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.526781+00:00
41fe9b02-3a3c-4c73-9c81-206bfc5b6624typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.416555+00:00
9156090b-0dad-4826-a08f-4c1d0e1214f9python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.339417+00:00
735095ae-342d-4175-9153-cac6e0bf3f4dtypescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.242955+00:00
bc77e479-d0c9-480e-9191-fa30a0e4741epython-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.115786+00:00
8a62dbb7-342e-4b8a-bba9-25af68314b96typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.028241+00:00
3c026ad9-e69a-4149-8951-7b7a18e5564epython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:26:57.905714+00:00
b65d6ca8-1750-47ca-9051-cb2797558112typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:26:57.801017+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0054  (coder)