Eval Report: ci-post-merge

Profile: gdm-swebench-lite-v1 | Tasks: 50 | Pass rate: 100.0% | Cost: $0.0500

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010
typescript-recovery-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
typescript-dependency-easy-001easy0.740$0.0010
python-explain-easy-001easy0.740$0.0010
typescript-explain-easy-001easy0.740$0.0010
python-security-fix-medium-001medium0.740$0.0010
shell-security-fix-medium-001medium0.740$0.0010
python-bugfix-medium-001medium0.740$0.0010
shell-bugfix-medium-001medium0.740$0.0010
python-performance-medium-001medium0.740$0.0010
shell-performance-medium-001medium0.740$0.0010
python-test-writing-medium-001medium0.740$0.0010
shell-test-writing-medium-001medium0.740$0.0010
python-multi-file-medium-001medium0.740$0.0010
shell-multi-file-medium-001medium0.740$0.0010
python-refactor-medium-001medium0.740$0.0010
shell-refactor-medium-001medium0.740$0.0010
python-config-medium-001medium0.740$0.0010
shell-config-medium-001medium0.740$0.0010
python-recovery-medium-001medium0.740$0.0010
shell-recovery-medium-001medium0.740$0.0010
python-dependency-medium-001medium0.740$0.0010
shell-dependency-medium-001medium0.740$0.0010
python-explain-medium-001medium0.740$0.0010
shell-explain-medium-001medium0.740$0.0010
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 42bce1ae-c13b-4fab-a3b2-66046d333adb | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:27:18.067977+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
42bce1ae-c13b-4fab-a3b2-66046d333adbcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:18.067977+00:00
4574785f-7767-430e-83b2-f3b1e0b39bfccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:17.836205+00:00
8eedd7b3-8260-404c-a0c1-a575b1a2f90bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:17.726656+00:00
af15d949-1fee-473c-8e18-a2136dbd2ca6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:17.584418+00:00
47c44d25-5344-4bb0-ba11-7d44d45f2848coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:27:17.420972+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic700
############################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
42bce1ae-c13b-4fab-a3b2-66046d333adbtypescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:27:18.067977+00:00
4574785f-7767-430e-83b2-f3b1e0b39bfcpython-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:27:17.836205+00:00
8eedd7b3-8260-404c-a0c1-a575b1a2f90btypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:27:17.726656+00:00
af15d949-1fee-473c-8e18-a2136dbd2ca6python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:27:17.584418+00:00
47c44d25-5344-4bb0-ba11-7d44d45f2848typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:27:17.420972+00:00
48b7562e-eea5-4ee8-8de0-4b81cf6833b3python-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:27:17.334352+00:00
baba7e2c-128a-411b-8967-2515efc52979typescript-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:27:17.227381+00:00
1da723ba-d735-4bd2-8953-636784b1014apython-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:27:17.133150+00:00
f8373f60-b96b-4c10-9b8a-e83571af670etypescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:27:17.053236+00:00
afbd58f8-ca1b-4f2d-b77c-55743afb892apython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:27:16.968150+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0055  (coder)