Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 497d957e-d315-4744-8d98-579c7584b229 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-04-27T16:58:24.744891+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
497d957e-d315-4744-8d98-579c7584b229coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:58:24.744891+00:00
e3689825-4f87-4c2b-b913-f108f42ae820coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:58:24.691747+00:00
bec27586-9f81-4cfe-ad3e-0a079a2c9410coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:58:24.618716+00:00
e85a7890-247f-4cd8-b142-6458f5e3a9a9coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:58:24.550182+00:00
c68537f1-b9ab-4c18-96ee-8705bdabe47acoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:58:24.483248+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
497d957e-d315-4744-8d98-579c7584b229canary-shell-ops-004wrong-file0.310$0.00102026-04-27T16:58:24.744891+00:00
e3689825-4f87-4c2b-b913-f108f42ae820python-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:58:24.691747+00:00
bec27586-9f81-4cfe-ad3e-0a079a2c9410typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:58:24.618716+00:00
e85a7890-247f-4cd8-b142-6458f5e3a9a9canary-python-cache-005wrong-file0.310$0.00102026-04-27T16:58:24.550182+00:00
c68537f1-b9ab-4c18-96ee-8705bdabe47acanary-typescript-auth-006wrong-file0.310$0.00102026-04-27T16:58:24.483248+00:00
f01c386f-9a89-48ef-857d-52822937679dcanary-python-security-001wrong-file0.310$0.00102026-04-27T16:58:24.400085+00:00
17aa54ff-7502-466a-8682-c098526203f0canary-typescript-session-003wrong-file0.310$0.00102026-04-27T16:58:24.330134+00:00
6d424f7f-5fd8-46b0-9218-376d0a506fa6python-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:58:24.265264+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)