Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 2 | Pass rate: 100.0% | Cost: $0.0020

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: b5291672-7a75-48fc-8d98-4c8a958553b6 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:26:59.641087+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
b5291672-7a75-48fc-8d98-4c8a958553b6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:59.641087+00:00
c377b3f8-7981-4e76-8ea5-2c6ba4b51d1fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:59.580940+00:00
1c453a44-b9b5-4b0d-90a8-1aacf878a009coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:58.742229+00:00
cff24737-1310-43d4-9c79-ef91872c74c1coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:58.639544+00:00
9057dc27-0005-417d-bb50-020bff22da78coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:26:58.526781+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic299
###########################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
b5291672-7a75-48fc-8d98-4c8a958553b6typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:26:59.641087+00:00
c377b3f8-7981-4e76-8ea5-2c6ba4b51d1fpython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:26:59.580940+00:00
1c453a44-b9b5-4b0d-90a8-1aacf878a009python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.742229+00:00
cff24737-1310-43d4-9c79-ef91872c74c1typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.639544+00:00
9057dc27-0005-417d-bb50-020bff22da78python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.526781+00:00
41fe9b02-3a3c-4c73-9c81-206bfc5b6624typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.416555+00:00
9156090b-0dad-4826-a08f-4c1d0e1214f9python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.339417+00:00
735095ae-342d-4175-9153-cac6e0bf3f4dtypescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.242955+00:00
bc77e479-d0c9-480e-9191-fa30a0e4741epython-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.115786+00:00
8a62dbb7-342e-4b8a-bba9-25af68314b96typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:26:58.028241+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0054  (coder)