Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 90407aba-6b01-4b4d-bf94-7bb58e635059 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-28T00:09:17.946900+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
90407aba-6b01-4b4d-bf94-7bb58e635059coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:17.946900+00:00
78b42711-c186-4652-91b9-03a1e28efee2coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:17.860692+00:00
4bae9d5d-8b91-46f6-8823-ac975f12ab8fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:17.766524+00:00
4f6f989d-815d-4b65-aed5-b898d1bf8959coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:17.667193+00:00
cf1e9820-b567-4878-ad87-5b70ca0c2c93coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:17.566928+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic799
###############################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
90407aba-6b01-4b4d-bf94-7bb58e635059python-recovery-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.946900+00:00
78b42711-c186-4652-91b9-03a1e28efee2typescript-config-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.860692+00:00
4bae9d5d-8b91-46f6-8823-ac975f12ab8fpython-config-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.766524+00:00
4f6f989d-815d-4b65-aed5-b898d1bf8959typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.667193+00:00
cf1e9820-b567-4878-ad87-5b70ca0c2c93python-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.566928+00:00
9acaced2-eb39-4f51-b9d9-f810bfaf80e7typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.488836+00:00
505a8fec-4468-48c9-8cc2-c49e886df414python-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.412933+00:00
bcd34ad9-d3e2-450b-9e08-e3b3cafd4b96typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.315459+00:00
38605ee4-39a0-4fdf-acef-6788480f9dc8python-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.224946+00:00
59d43363-8efe-4f16-8099-09523bff17fftypescript-performance-easy-001wrong-logic0.740$0.00102026-04-28T00:09:17.131510+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0049  (coder)