Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 200431d2-d112-434e-9017-34ae7a807852 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:30:50.698483+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
200431d2-d112-434e-9017-34ae7a807852coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:50.698483+00:00
a704664f-e391-4e5b-be78-195a5f27508dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:50.635234+00:00
21a6efed-e39d-4e43-94cf-093768d8828acoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:50.562654+00:00
e237292a-807d-40f4-a7d1-18ed94ab26e6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:50.496104+00:00
dad627b2-be64-4b0a-abf5-c5d2491343efcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:50.362965+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic391
#######################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
200431d2-d112-434e-9017-34ae7a807852python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.698483+00:00
a704664f-e391-4e5b-be78-195a5f27508dtypescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.635234+00:00
21a6efed-e39d-4e43-94cf-093768d8828apython-config-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.562654+00:00
e237292a-807d-40f4-a7d1-18ed94ab26e6typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.496104+00:00
dad627b2-be64-4b0a-abf5-c5d2491343efpython-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.362965+00:00
9d328720-ad85-44e2-8cfc-359ad3778668typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.238341+00:00
11fc19e0-9778-4d1a-a794-708062ade6f8python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.141267+00:00
6496b2d3-c5b3-406e-8d4c-5594de4a1d26typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.024577+00:00
28ddb56e-1cec-4bf0-9a3a-aa4e7fb75392python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:30:49.957293+00:00
bfc170c2-826d-42c2-b272-9b6bade5d2d5typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:30:49.869788+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0053  (coder)