Eval Report: ci-post-merge

Profile: gdm-swebench-lite-v1 | Tasks: 50 | Pass rate: 100.0% | Cost: $0.0500

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010
typescript-recovery-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
typescript-dependency-easy-001easy0.740$0.0010
python-explain-easy-001easy0.740$0.0010
typescript-explain-easy-001easy0.740$0.0010
python-security-fix-medium-001medium0.740$0.0010
shell-security-fix-medium-001medium0.740$0.0010
python-bugfix-medium-001medium0.740$0.0010
shell-bugfix-medium-001medium0.740$0.0010
python-performance-medium-001medium0.740$0.0010
shell-performance-medium-001medium0.740$0.0010
python-test-writing-medium-001medium0.740$0.0010
shell-test-writing-medium-001medium0.740$0.0010
python-multi-file-medium-001medium0.740$0.0010
shell-multi-file-medium-001medium0.740$0.0010
python-refactor-medium-001medium0.740$0.0010
shell-refactor-medium-001medium0.740$0.0010
python-config-medium-001medium0.740$0.0010
shell-config-medium-001medium0.740$0.0010
python-recovery-medium-001medium0.740$0.0010
shell-recovery-medium-001medium0.740$0.0010
python-dependency-medium-001medium0.740$0.0010
shell-dependency-medium-001medium0.740$0.0010
python-explain-medium-001medium0.740$0.0010
shell-explain-medium-001medium0.740$0.0010
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 1722dad2-afdd-4810-91d1-6a856e987009 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T23:48:16.386632+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
1722dad2-afdd-4810-91d1-6a856e987009coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:16.386632+00:00
3d410629-dc02-43e1-8302-d081db3f4af4coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:16.341983+00:00
6eac620e-54e1-4fbf-b33d-265211675ea4coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:16.282617+00:00
fa046c0a-bef9-4557-81ea-39ee1f90cfdfcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:16.209136+00:00
5c5f02ec-afee-4594-9547-ed524707062dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:16.157373+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic1500
############################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
1722dad2-afdd-4810-91d1-6a856e987009typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T23:48:16.386632+00:00
3d410629-dc02-43e1-8302-d081db3f4af4python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T23:48:16.341983+00:00
6eac620e-54e1-4fbf-b33d-265211675ea4typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T23:48:16.282617+00:00
fa046c0a-bef9-4557-81ea-39ee1f90cfdfpython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T23:48:16.209136+00:00
5c5f02ec-afee-4594-9547-ed524707062dtypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T23:48:16.157373+00:00
91c04305-444d-4642-ab6c-a0055caa0c97python-performance-easy-001wrong-logic0.740$0.00102026-04-27T23:48:16.096663+00:00
2077de92-84d9-487b-b206-937b1fa9cd56typescript-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T23:48:16.031000+00:00
6656d968-a899-4727-867c-8d8eee653f63python-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T23:48:15.972883+00:00
21ca7fcf-774c-4167-8943-6ea7f8e736cbtypescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T23:48:15.895318+00:00
a581e6e2-a09f-40dc-a9a8-67b7a7f8f563python-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T23:48:15.839438+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0051  (coder)