Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 66c334dc-304d-4e45-89b1-538202ff7566 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-04-27T16:10:50.952191+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
66c334dc-304d-4e45-89b1-538202ff7566coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:10:50.952191+00:00
b844bf49-851d-435e-843e-cefcfeeef0facoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:10:50.905685+00:00
22193a76-099e-4646-96e6-9f901c1eeb55coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:10:50.874157+00:00
bc29e129-48c8-43e0-a5bc-22587f0c3d78coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:10:50.828909+00:00
b3f3794f-ecae-4e1a-98a2-93dc0d0651e8coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:10:50.782848+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
66c334dc-304d-4e45-89b1-538202ff7566canary-shell-ops-004wrong-file0.310$0.00102026-04-27T16:10:50.952191+00:00
b844bf49-851d-435e-843e-cefcfeeef0fapython-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:10:50.905685+00:00
22193a76-099e-4646-96e6-9f901c1eeb55typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:10:50.874157+00:00
bc29e129-48c8-43e0-a5bc-22587f0c3d78canary-python-cache-005wrong-file0.310$0.00102026-04-27T16:10:50.828909+00:00
b3f3794f-ecae-4e1a-98a2-93dc0d0651e8canary-typescript-auth-006wrong-file0.310$0.00102026-04-27T16:10:50.782848+00:00
6c7e8b43-867f-4aa2-afa6-acd1089a65facanary-python-security-001wrong-file0.310$0.00102026-04-27T16:10:50.732749+00:00
79dd447f-41a0-411d-b972-18f07d7d8332canary-typescript-session-003wrong-file0.310$0.00102026-04-27T16:10:50.692871+00:00
57c43807-1413-45b8-901d-61c59986cf6bpython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:10:50.662870+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)