Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 78157d6a-993d-4304-9177-ec6bba2db79d | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-28T00:11:14.795210+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
78157d6a-993d-4304-9177-ec6bba2db79dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:14.795210+00:00
4f06c1c9-dd51-4fa6-a22b-a92caa381a13coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:14.712107+00:00
bf89b3d1-d1b6-4e25-9eb9-edd609213e4acoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:14.635280+00:00
442b1f7b-ccc5-41f4-accb-2e41de625f47coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:14.549953+00:00
ad68fe51-eed1-48a7-9359-4b26148745a6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:11:14.482294+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic831
###############################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
78157d6a-993d-4304-9177-ec6bba2db79dpython-recovery-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.795210+00:00
4f06c1c9-dd51-4fa6-a22b-a92caa381a13typescript-config-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.712107+00:00
bf89b3d1-d1b6-4e25-9eb9-edd609213e4apython-config-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.635280+00:00
442b1f7b-ccc5-41f4-accb-2e41de625f47typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.549953+00:00
ad68fe51-eed1-48a7-9359-4b26148745a6python-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.482294+00:00
d543d7f9-133f-4250-9619-c5f8ca522233typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.407451+00:00
ed627a87-0e23-4419-a834-0aa8ffd9046dpython-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.321848+00:00
f1a6c8dc-5d56-42ac-a9d6-a56822b12f80typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.241930+00:00
661aee82-6c51-4d7d-b62a-859c06324f60python-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.156629+00:00
ac84a97d-6198-4a26-9f35-f925dc1004c6typescript-performance-easy-001wrong-logic0.740$0.00102026-04-28T00:11:14.089522+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0049  (coder)