Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 1694ee6f-260b-47fd-9913-3cfed9ff85cd | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-28T00:18:55.470679+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
1694ee6f-260b-47fd-9913-3cfed9ff85cdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:18:55.470679+00:00
00f66d5d-b8aa-4213-b34c-d54c6ed4ff00coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:18:55.406525+00:00
d400cf3a-86f5-46af-a02a-bb428618e0adcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:18:55.339252+00:00
87129a11-eb85-4b77-8a98-98c71a62120acoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:18:55.253768+00:00
64a83501-3505-4164-9fc3-841697e0a9dfcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:18:55.173360+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic878
##############################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
1694ee6f-260b-47fd-9913-3cfed9ff85cdpython-recovery-easy-001wrong-logic0.740$0.00102026-04-28T00:18:55.470679+00:00
00f66d5d-b8aa-4213-b34c-d54c6ed4ff00typescript-config-easy-001wrong-logic0.740$0.00102026-04-28T00:18:55.406525+00:00
d400cf3a-86f5-46af-a02a-bb428618e0adpython-config-easy-001wrong-logic0.740$0.00102026-04-28T00:18:55.339252+00:00
87129a11-eb85-4b77-8a98-98c71a62120atypescript-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:18:55.253768+00:00
64a83501-3505-4164-9fc3-841697e0a9dfpython-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:18:55.173360+00:00
7ad332bc-d4aa-4e2e-9233-fb8586e9c45ftypescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:18:55.104814+00:00
070afb3e-3e10-4c49-adae-0814368b2a6bpython-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:18:55.018385+00:00
1923e007-a6cf-4bb7-b5d5-beb50818fcadtypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:18:54.956699+00:00
e45b2171-94ae-4f84-9994-95e3420ec770python-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:18:54.885140+00:00
629b3d70-82a5-4301-8f0a-b03e2b652d3ctypescript-performance-easy-001wrong-logic0.740$0.00102026-04-28T00:18:54.815947+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0048  (coder)