Eval Report: ci-post-merge

Profile: gdm-swebench-lite-v1 | Tasks: 50 | Pass rate: 100.0% | Cost: $0.0500

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010
typescript-recovery-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
typescript-dependency-easy-001easy0.740$0.0010
python-explain-easy-001easy0.740$0.0010
typescript-explain-easy-001easy0.740$0.0010
python-security-fix-medium-001medium0.740$0.0010
shell-security-fix-medium-001medium0.740$0.0010
python-bugfix-medium-001medium0.740$0.0010
shell-bugfix-medium-001medium0.740$0.0010
python-performance-medium-001medium0.740$0.0010
shell-performance-medium-001medium0.740$0.0010
python-test-writing-medium-001medium0.740$0.0010
shell-test-writing-medium-001medium0.740$0.0010
python-multi-file-medium-001medium0.740$0.0010
shell-multi-file-medium-001medium0.740$0.0010
python-refactor-medium-001medium0.740$0.0010
shell-refactor-medium-001medium0.740$0.0010
python-config-medium-001medium0.740$0.0010
shell-config-medium-001medium0.740$0.0010
python-recovery-medium-001medium0.740$0.0010
shell-recovery-medium-001medium0.740$0.0010
python-dependency-medium-001medium0.740$0.0010
shell-dependency-medium-001medium0.740$0.0010
python-explain-medium-001medium0.740$0.0010
shell-explain-medium-001medium0.740$0.0010
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: c9856959-fa08-4544-b9e4-b2e9650048b2 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:12:36.587672+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
c9856959-fa08-4544-b9e4-b2e9650048b2coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:36.587672+00:00
967bafb7-1a0e-4876-905a-e945b55b96f8coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:36.509948+00:00
cdfe0939-a2ca-42eb-951c-1c48da4ab1a6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:36.477797+00:00
b8adbade-f56d-41f6-acc5-7f4bfaf32b8ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:36.440007+00:00
2fed615b-7e40-4caf-862c-1f1ed66ee608coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:36.402666+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic250
##########################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
c9856959-fa08-4544-b9e4-b2e9650048b2typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.587672+00:00
967bafb7-1a0e-4876-905a-e945b55b96f8python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.509948+00:00
cdfe0939-a2ca-42eb-951c-1c48da4ab1a6typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.477797+00:00
b8adbade-f56d-41f6-acc5-7f4bfaf32b8cpython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.440007+00:00
2fed615b-7e40-4caf-862c-1f1ed66ee608typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.402666+00:00
067b2c7f-0ab7-4c13-9f79-0b27f64c0275python-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.358023+00:00
b052cbe6-2dce-4961-8db2-efba3892cfa9typescript-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.286569+00:00
f6e3e3ec-536b-4c94-aaef-b615c33a7a49python-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.248434+00:00
5b8d891d-559e-4624-841b-4dc687c5d591typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.208617+00:00
621bcdeb-d33e-4247-ac69-a833a6de8fcapython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:12:36.166612+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0057  (coder)