Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 60651b4d-041e-4e14-9b31-caa10f93c637 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:57:51.548162+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
60651b4d-041e-4e14-9b31-caa10f93c637coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:51.548162+00:00
ca8fd887-13ce-4650-adfd-80f2420ad1afcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:51.495917+00:00
b2cec896-34e6-4a24-a9ef-af91d9b55db4coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:51.436363+00:00
9fb74dcf-ce81-4b5f-bca1-ee970cab49d6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:51.387453+00:00
91472bad-e42f-4f1a-88aa-484d73e8a159coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:51.335695+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic502
######################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
60651b4d-041e-4e14-9b31-caa10f93c637python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.548162+00:00
ca8fd887-13ce-4650-adfd-80f2420ad1aftypescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.495917+00:00
b2cec896-34e6-4a24-a9ef-af91d9b55db4python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.436363+00:00
9fb74dcf-ce81-4b5f-bca1-ee970cab49d6typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.387453+00:00
91472bad-e42f-4f1a-88aa-484d73e8a159python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.335695+00:00
513d8dd4-0e04-4f1b-a3aa-4b3c8d6f68d6typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.275569+00:00
34648b74-99d5-473d-8e44-b88694693f87python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.228277+00:00
50318cd7-9cd1-4b2f-9f0c-cb9803f04e1ftypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.158995+00:00
fb0b11ef-96ff-4d97-bbe2-a9ed6b4fe913python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.087693+00:00
a8a8a7ee-04de-49ff-b71f-d8f0c38ec31atypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:57:51.024734+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0052  (coder)