Eval Report: ci-post-merge

Profile: gdm-swebench-lite-v1 | Tasks: 50 | Pass rate: 100.0% | Cost: $0.0500

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010
typescript-recovery-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
typescript-dependency-easy-001easy0.740$0.0010
python-explain-easy-001easy0.740$0.0010
typescript-explain-easy-001easy0.740$0.0010
python-security-fix-medium-001medium0.740$0.0010
shell-security-fix-medium-001medium0.740$0.0010
python-bugfix-medium-001medium0.740$0.0010
shell-bugfix-medium-001medium0.740$0.0010
python-performance-medium-001medium0.740$0.0010
shell-performance-medium-001medium0.740$0.0010
python-test-writing-medium-001medium0.740$0.0010
shell-test-writing-medium-001medium0.740$0.0010
python-multi-file-medium-001medium0.740$0.0010
shell-multi-file-medium-001medium0.740$0.0010
python-refactor-medium-001medium0.740$0.0010
shell-refactor-medium-001medium0.740$0.0010
python-config-medium-001medium0.740$0.0010
shell-config-medium-001medium0.740$0.0010
python-recovery-medium-001medium0.740$0.0010
shell-recovery-medium-001medium0.740$0.0010
python-dependency-medium-001medium0.740$0.0010
shell-dependency-medium-001medium0.740$0.0010
python-explain-medium-001medium0.740$0.0010
shell-explain-medium-001medium0.740$0.0010
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 64cca1c7-5f43-4641-8fe3-f99307b08a42 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:12:39.670399+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
64cca1c7-5f43-4641-8fe3-f99307b08a42coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:39.670399+00:00
f7b64aad-bc49-41a9-88b6-304643b2734bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:39.632293+00:00
989a8276-0e72-4734-b9ce-93a568bbf8a5coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:39.587045+00:00
a4a42e87-bf9c-44bc-a979-ded55a223abfcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:39.546216+00:00
d1ca5486-7313-443d-a562-076cef545823coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:39.515874+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic300
############################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
64cca1c7-5f43-4641-8fe3-f99307b08a42typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.670399+00:00
f7b64aad-bc49-41a9-88b6-304643b2734bpython-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.632293+00:00
989a8276-0e72-4734-b9ce-93a568bbf8a5typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.587045+00:00
a4a42e87-bf9c-44bc-a979-ded55a223abfpython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.546216+00:00
d1ca5486-7313-443d-a562-076cef545823typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.515874+00:00
c6f15644-8fa8-4c4d-a284-3572f12f372fpython-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.481685+00:00
ac09cb57-03b0-4e66-819c-6d2a75c22db9typescript-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.447263+00:00
33eb0cca-9b57-4891-8cb0-48e20be23c82python-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.411512+00:00
0bc14d31-8fa9-4f63-82e4-08715fc70851typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.369674+00:00
55eaf4c6-99b3-4dd4-b90e-d7dbfc26c76epython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:12:39.335777+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0057  (coder)