Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 798e200d-14f9-47a2-99e0-6a7ebc0bef14 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-28T00:23:59.665403+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
798e200d-14f9-47a2-99e0-6a7ebc0bef14coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:23:59.665403+00:00
e586c3f1-cf80-4949-a063-5e531d0653cdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:23:59.585503+00:00
22c80a6e-c9d5-4ea2-b0f7-5245dc6b78decoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:23:59.528256+00:00
b84cd43d-74d8-4cce-a498-ffc24cdf634ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:23:59.451244+00:00
e0b73a27-2971-4a59-a887-27afc7078afdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:23:59.382273+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic908
############################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
798e200d-14f9-47a2-99e0-6a7ebc0bef14python-recovery-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.665403+00:00
e586c3f1-cf80-4949-a063-5e531d0653cdtypescript-config-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.585503+00:00
22c80a6e-c9d5-4ea2-b0f7-5245dc6b78depython-config-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.528256+00:00
b84cd43d-74d8-4cce-a498-ffc24cdf634etypescript-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.451244+00:00
e0b73a27-2971-4a59-a887-27afc7078afdpython-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.382273+00:00
0b3f5366-9b01-4704-9aa4-7cbbeb908c34typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.301319+00:00
103ddc6a-b082-4efb-8696-075eca3a7967python-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.235662+00:00
46962931-2b38-43e1-a2ee-580eab9fbd99typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.172635+00:00
8719cecf-47bc-4a39-b593-a7fdc8352fa7python-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.113062+00:00
f2e9f7cb-744b-4861-a19b-cd5abf853486typescript-performance-easy-001wrong-logic0.740$0.00102026-04-28T00:23:59.052553+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0048  (coder)