Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 55502cad-a93f-409f-9916-d2194648f6b4 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:51:58.409225+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
55502cad-a93f-409f-9916-d2194648f6b4coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:51:58.409225+00:00
113a67a4-1928-449c-a5bd-583cf13c486ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:51:58.351601+00:00
ffd0cc0d-37fa-4b84-ba83-a14bc7112b44coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:51:58.275776+00:00
4f1da62f-7f6b-488f-9cbb-8de0c1fb0062coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:51:58.201661+00:00
8fe76932-1f57-42f2-af1a-093034ea7747coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:51:58.130385+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic470
######################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
55502cad-a93f-409f-9916-d2194648f6b4python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:51:58.409225+00:00
113a67a4-1928-449c-a5bd-583cf13c486ctypescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:51:58.351601+00:00
ffd0cc0d-37fa-4b84-ba83-a14bc7112b44python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:51:58.275776+00:00
4f1da62f-7f6b-488f-9cbb-8de0c1fb0062typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:51:58.201661+00:00
8fe76932-1f57-42f2-af1a-093034ea7747python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:51:58.130385+00:00
ca7848bc-a0c6-4043-8183-997bf9ed79abtypescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:51:58.067317+00:00
aaa6e10b-7f60-4bf6-8c02-31529e5a6fe3python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:51:57.999680+00:00
42d81c55-dc42-46e4-b847-6a1f108221f3typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:51:57.921194+00:00
a6734676-cc76-4022-b060-78600e4b905apython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:51:57.852459+00:00
9b6a78d8-96fc-4b00-920a-170a74674395typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:51:57.787163+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0052  (coder)