Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: dea382af-2d48-4038-a0aa-163b27d451d7 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-23T20:21:14.843520+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
dea382af-2d48-4038-a0aa-163b27d451d7coder3a1cb59613c43efee035337a7eb0f518754b79e10.3102026-05-23T20:21:14.843520+00:00
69862265-e653-4dcb-b222-69a69df94815coder3a1cb59613c43efee035337a7eb0f518754b79e10.7402026-05-23T20:21:14.795318+00:00
041cdba8-c0fe-406f-b432-016a049e01e5coder3a1cb59613c43efee035337a7eb0f518754b79e10.3102026-05-23T20:21:14.731878+00:00
12646b94-95ca-4887-adab-fc0af052b6d2coder3a1cb59613c43efee035337a7eb0f518754b79e10.7402026-05-23T20:21:14.676318+00:00
676f22d5-cc31-444e-9d61-91febd3fb66acoder3a1cb59613c43efee035337a7eb0f518754b79e10.3102026-05-23T20:21:14.605149+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
dea382af-2d48-4038-a0aa-163b27d451d7canary-typescript-session-003wrong-file0.310$0.00102026-05-23T20:21:14.843520+00:00
69862265-e653-4dcb-b222-69a69df94815python-bugfix-easy-001wrong-logic0.740$0.00102026-05-23T20:21:14.795318+00:00
041cdba8-c0fe-406f-b432-016a049e01e5canary-typescript-auth-006wrong-file0.310$0.00102026-05-23T20:21:14.731878+00:00
12646b94-95ca-4887-adab-fc0af052b6d2typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T20:21:14.676318+00:00
676f22d5-cc31-444e-9d61-91febd3fb66acanary-python-regression-002wrong-file0.310$0.00102026-05-23T20:21:14.605149+00:00
a82bc04e-5fd3-4a8c-af51-72a4efe2ba41canary-python-cache-005wrong-file0.310$0.00102026-05-23T20:21:14.543442+00:00
04c7fad1-c705-4679-a3e9-715bdae66cebcanary-shell-ops-004wrong-file0.310$0.00102026-05-23T20:21:14.464621+00:00
ac61e722-685f-43e0-bcb4-003a9a2cfea2python-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T20:21:14.420909+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)