Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 0bfaa44c-76b0-4f60-80af-4d39533c78ee | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:27:01.495394+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
0bfaa44c-76b0-4f60-80af-4d39533c78eecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:01.495394+00:00
61d71e9f-603f-4a3c-bd0f-c89103e948e1coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:01.370729+00:00
16a74901-1765-4ff5-b779-d70d863af41fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:01.294506+00:00
ca9c9bcd-0213-40ab-8b90-6d7c8fe260c6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:01.194015+00:00
ec9fc8b9-78db-4a93-bdf9-22c5b9fd6dbdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:01.133723+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic314
##########################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
0bfaa44c-76b0-4f60-80af-4d39533c78eepython-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:27:01.495394+00:00
61d71e9f-603f-4a3c-bd0f-c89103e948e1typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:27:01.370729+00:00
16a74901-1765-4ff5-b779-d70d863af41fpython-config-easy-001wrong-logic0.740$0.00102026-04-27T16:27:01.294506+00:00
ca9c9bcd-0213-40ab-8b90-6d7c8fe260c6typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:27:01.194015+00:00
ec9fc8b9-78db-4a93-bdf9-22c5b9fd6dbdpython-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:27:01.133723+00:00
d2bee63d-2a2a-4e80-b149-ce3273b4fb27typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:27:01.046022+00:00
9fbf0040-dfc2-466a-956d-08f4388decb5python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:27:00.955678+00:00
c5f22aa7-2ebf-4831-bb21-c115a7ca4195typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:27:00.881808+00:00
b5471786-1cd7-4f7d-acff-fff9ee396becpython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:27:00.767129+00:00
7db1a379-7720-4357-8b7f-068f984d502btypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:27:00.688678+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0054  (coder)