Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 4cf00ad7-5ab4-442b-8d83-fb6468e159f6 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-28T00:11:20.721221+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
4cf00ad7-5ab4-442b-8d83-fb6468e159f6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:20.721221+00:00
76739be0-01b9-4e04-ba1b-73043688cc49coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:20.641555+00:00
074e8f44-9f47-40aa-b86a-9b02dbeca517coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:20.567569+00:00
668e9c0e-7227-4a68-aaf3-238caeb9ca24coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:20.490452+00:00
76a235cb-219e-4059-8c49-97f17a243effcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:20.404881+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic846
##############################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
4cf00ad7-5ab4-442b-8d83-fb6468e159f6python-recovery-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.721221+00:00
76739be0-01b9-4e04-ba1b-73043688cc49typescript-config-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.641555+00:00
074e8f44-9f47-40aa-b86a-9b02dbeca517python-config-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.567569+00:00
668e9c0e-7227-4a68-aaf3-238caeb9ca24typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.490452+00:00
76a235cb-219e-4059-8c49-97f17a243effpython-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.404881+00:00
e04c51ed-73ee-4e16-b451-d5003f31a8c1typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.330411+00:00
91885cca-e8df-4c3f-bf4d-9fcc87e9fd8apython-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.247999+00:00
e89c9ecf-0f0b-4cf5-891b-f5a3a361ee13typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.178676+00:00
dbcf33bf-2577-45b1-9e78-24c0bec20292python-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.110603+00:00
5ad49db0-6025-4a03-825f-22521964a021typescript-performance-easy-001wrong-logic0.740$0.00102026-04-28T00:11:20.038417+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0049  (coder)