Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 49252a4a-90b2-422b-a2af-bd3fa86cc2fb | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-09T02:55:55.484496+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
49252a4a-90b2-422b-a2af-bd3fa86cc2fbcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T02:55:55.484496+00:00
2b4e5e63-15c1-4609-9567-d34fe910666fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T02:55:55.322789+00:00
08dbaa2f-2b4d-4749-bbb0-10c419fb6023coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T02:55:55.185211+00:00
58d0c234-2222-4cd3-bba1-41b8504d8637coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T02:55:54.953939+00:00
d9000efe-41a9-48db-8c8d-a56010036ba6coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T02:55:54.614334+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
49252a4a-90b2-422b-a2af-bd3fa86cc2fbcanary-shell-ops-004wrong-file0.310$0.00102026-05-09T02:55:55.484496+00:00
2b4e5e63-15c1-4609-9567-d34fe910666fpython-bugfix-easy-001wrong-logic0.740$0.00102026-05-09T02:55:55.322789+00:00
08dbaa2f-2b4d-4749-bbb0-10c419fb6023typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T02:55:55.185211+00:00
58d0c234-2222-4cd3-bba1-41b8504d8637canary-python-cache-005wrong-file0.310$0.00102026-05-09T02:55:54.953939+00:00
d9000efe-41a9-48db-8c8d-a56010036ba6canary-typescript-auth-006wrong-file0.310$0.00102026-05-09T02:55:54.614334+00:00
6b395988-15fa-457c-a425-1f6e34a8436bcanary-python-security-001wrong-file0.310$0.00102026-05-09T02:55:54.377330+00:00
98f303d5-9567-4633-a4b3-24317239947bcanary-typescript-session-003wrong-file0.310$0.00102026-05-09T02:55:54.060351+00:00
4e366bd4-47cc-44e0-9122-d49d9ba79834python-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T02:55:53.696097+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)