Eval Report: ci-post-merge

Profile: gdm-swebench-lite-v1 | Tasks: 50 | Pass rate: 100.0% | Cost: $0.0500

Task IDBandScorePassedCost
python-bugfix-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
python-explain-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
python-security-fix-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
typescript-dependency-easy-001easy0.740$0.0010
typescript-explain-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
typescript-recovery-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-bugfix-medium-001medium0.740$0.0010
python-config-medium-001medium0.740$0.0010
python-dependency-medium-001medium0.740$0.0010
python-explain-medium-001medium0.740$0.0010
python-multi-file-medium-001medium0.740$0.0010
python-performance-medium-001medium0.740$0.0010
python-recovery-medium-001medium0.740$0.0010
python-refactor-medium-001medium0.740$0.0010
python-security-fix-medium-001medium0.740$0.0010
python-test-writing-medium-001medium0.740$0.0010
shell-bugfix-medium-001medium0.740$0.0010
shell-config-medium-001medium0.740$0.0010
shell-dependency-medium-001medium0.740$0.0010
shell-explain-medium-001medium0.740$0.0010
shell-multi-file-medium-001medium0.740$0.0010
shell-performance-medium-001medium0.740$0.0010
shell-recovery-medium-001medium0.740$0.0010
shell-refactor-medium-001medium0.740$0.0010
shell-security-fix-medium-001medium0.740$0.0010
shell-test-writing-medium-001medium0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
python-explain-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
python-security-fix-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: eb12577d-580e-4448-9481-2d7f6980e98d | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:05:44.006636+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
eb12577d-580e-4448-9481-2d7f6980e98dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:44.006636+00:00
991e336b-a00e-4af0-84c6-47236aca1b3ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:43.969852+00:00
d4614acb-0202-4477-a47e-8560775ffc86coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:43.914492+00:00
e122accc-17fa-422a-af75-c3002376f7d0coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:43.872828+00:00
3932cf30-cd49-4b13-9a94-c85b953e8ec8coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:05:43.833511+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic50
##################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
eb12577d-580e-4448-9481-2d7f6980e98dpython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:05:44.006636+00:00
991e336b-a00e-4af0-84c6-47236aca1b3epython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:05:43.969852+00:00
d4614acb-0202-4477-a47e-8560775ffc86python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:05:43.914492+00:00
e122accc-17fa-422a-af75-c3002376f7d0python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:05:43.872828+00:00
3932cf30-cd49-4b13-9a94-c85b953e8ec8python-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:05:43.833511+00:00
11d29efc-a988-43b2-ad51-2bc7666b1616python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:05:43.780856+00:00
11da93c1-a989-453a-9dfe-3f87a8fa46c1python-explain-easy-001wrong-logic0.740$0.00102026-04-27T16:05:43.752301+00:00
e5f21a26-792a-4219-a22b-73dd68cc4ddcpython-dependency-easy-001wrong-logic0.740$0.00102026-04-27T16:05:43.723861+00:00
c5fd7251-9f28-4f96-9e84-0867f9cbf09apython-config-easy-001wrong-logic0.740$0.00102026-04-27T16:05:43.687131+00:00
ce461b13-95b5-4221-82e3-c0f2f91502a8python-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:05:43.620938+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0058  (coder)