Eval Report: ci-post-merge

Profile: gdm-swebench-lite-v1 | Tasks: 50 | Pass rate: 100.0% | Cost: $0.0500

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010
typescript-recovery-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
typescript-dependency-easy-001easy0.740$0.0010
python-explain-easy-001easy0.740$0.0010
typescript-explain-easy-001easy0.740$0.0010
python-security-fix-medium-001medium0.740$0.0010
shell-security-fix-medium-001medium0.740$0.0010
python-bugfix-medium-001medium0.740$0.0010
shell-bugfix-medium-001medium0.740$0.0010
python-performance-medium-001medium0.740$0.0010
shell-performance-medium-001medium0.740$0.0010
python-test-writing-medium-001medium0.740$0.0010
shell-test-writing-medium-001medium0.740$0.0010
python-multi-file-medium-001medium0.740$0.0010
shell-multi-file-medium-001medium0.740$0.0010
python-refactor-medium-001medium0.740$0.0010
shell-refactor-medium-001medium0.740$0.0010
python-config-medium-001medium0.740$0.0010
shell-config-medium-001medium0.740$0.0010
python-recovery-medium-001medium0.740$0.0010
shell-recovery-medium-001medium0.740$0.0010
python-dependency-medium-001medium0.740$0.0010
shell-dependency-medium-001medium0.740$0.0010
python-explain-medium-001medium0.740$0.0010
shell-explain-medium-001medium0.740$0.0010
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: bd6f10ac-2f94-4d89-9155-045dbf0b18d7 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:14:37.078591+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
bd6f10ac-2f94-4d89-9155-045dbf0b18d7coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:37.078591+00:00
fddda8e4-2787-408c-8c3f-a48687f86ad6coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:37.021572+00:00
ccf7f6e1-5799-4822-8542-82c0127356f3coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:36.971405+00:00
cbf38656-5716-4244-99f3-4331144e3d36coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:36.886328+00:00
f58bc820-e605-401f-8f4d-acda7736a4bacoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:36.814328+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic350
##############################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
bd6f10ac-2f94-4d89-9155-045dbf0b18d7typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:14:37.078591+00:00
fddda8e4-2787-408c-8c3f-a48687f86ad6python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:14:37.021572+00:00
ccf7f6e1-5799-4822-8542-82c0127356f3typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:14:36.971405+00:00
cbf38656-5716-4244-99f3-4331144e3d36python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:14:36.886328+00:00
f58bc820-e605-401f-8f4d-acda7736a4batypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:14:36.814328+00:00
b81b274a-f40f-4e75-9152-b8f8d6527c6bpython-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:14:36.740310+00:00
3b2fffb8-8232-4e6e-829b-7f56cc4636edtypescript-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:14:36.686940+00:00
0cfd4276-8533-4dde-b0aa-0470c956013dpython-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:14:36.623191+00:00
032ae8b9-860e-495c-8b88-0b8e6e3d5cb9typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:14:36.572760+00:00
980705cc-5aea-4de5-b967-0e2f8d9b31acpython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:14:36.541799+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0056  (coder)