Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: 919a15fa-3368-496f-876f-eb8a27a2ec1b | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-23T20:07:49.195844+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
919a15fa-3368-496f-876f-eb8a27a2ec1bcoder89f0f5456c5b8670ca70d1a941ab0d7272df13100.3102026-05-23T20:07:49.195844+00:00
1ca83d35-77e5-474c-9f29-a2d0589e8cd1coder89f0f5456c5b8670ca70d1a941ab0d7272df13100.7402026-05-23T20:07:49.101685+00:00
e2e6b00b-fd7b-4037-bf1d-47f37f0e0e93coder89f0f5456c5b8670ca70d1a941ab0d7272df13100.3102026-05-23T20:07:49.039952+00:00
1e13ad55-b70e-496c-a2e6-ab5dc2ba0b54coder89f0f5456c5b8670ca70d1a941ab0d7272df13100.7402026-05-23T20:07:48.960329+00:00
b5c1aa67-86cf-4953-aec8-4da5df8cd4facoder89f0f5456c5b8670ca70d1a941ab0d7272df13100.3102026-05-23T20:07:48.911326+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
919a15fa-3368-496f-876f-eb8a27a2ec1bcanary-typescript-session-003wrong-file0.310$0.00102026-05-23T20:07:49.195844+00:00
1ca83d35-77e5-474c-9f29-a2d0589e8cd1python-bugfix-easy-001wrong-logic0.740$0.00102026-05-23T20:07:49.101685+00:00
e2e6b00b-fd7b-4037-bf1d-47f37f0e0e93canary-typescript-auth-006wrong-file0.310$0.00102026-05-23T20:07:49.039952+00:00
1e13ad55-b70e-496c-a2e6-ab5dc2ba0b54typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T20:07:48.960329+00:00
b5c1aa67-86cf-4953-aec8-4da5df8cd4facanary-python-regression-002wrong-file0.310$0.00102026-05-23T20:07:48.911326+00:00
02239f36-145d-4e10-af2d-1223618f3201canary-python-cache-005wrong-file0.310$0.00102026-05-23T20:07:48.848404+00:00
e651f240-d7ee-4b5f-8d5d-0834ad603e5acanary-shell-ops-004wrong-file0.310$0.00102026-05-23T20:07:48.784903+00:00
dea0d64f-720a-4da3-b3ff-deb173330bcdpython-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T20:07:48.721645+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)