Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 87a8fb8a-5bb5-43ab-8cdf-b7c474b6e33b | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:27:10.832083+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
87a8fb8a-5bb5-43ab-8cdf-b7c474b6e33bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:10.832083+00:00
5b597ec2-8fdd-42c0-ae58-accaf9bd69f0coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:10.752012+00:00
66c88760-1727-4236-8b8f-cbc2ea1e10a0coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:10.672320+00:00
e455a13e-b41e-4d8f-843b-7a1eb262fd52coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:10.584493+00:00
b1ef165b-cc41-49b5-b728-814a63dad93acoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:10.483850+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic329
#########################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
87a8fb8a-5bb5-43ab-8cdf-b7c474b6e33bpython-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:27:10.832083+00:00
5b597ec2-8fdd-42c0-ae58-accaf9bd69f0typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:27:10.752012+00:00
66c88760-1727-4236-8b8f-cbc2ea1e10a0python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:27:10.672320+00:00
e455a13e-b41e-4d8f-843b-7a1eb262fd52typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:27:10.584493+00:00
b1ef165b-cc41-49b5-b728-814a63dad93apython-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:27:10.483850+00:00
b109295b-e544-4f7d-8b0b-f599dfda1230typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:27:10.402702+00:00
603e694a-2430-4c1c-8c93-30fb447e7d37python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:27:10.313367+00:00
bb248260-2060-4fd0-8bc1-158d9c81ad9etypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:27:10.122575+00:00
0fc22c89-be06-41d8-9f5e-b4d1d37b9a2apython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:27:10.007009+00:00
91fce6ba-b1f3-4d5b-80fc-9281be14021etypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:27:09.929528+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0054  (coder)