Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0800

Task IDBandScorePassedCost
canary-6hard0.309$0.0100
canary-0hard0.309$0.0100
canary-2hard0.309$0.0100
python-bugfix-easy-001easy0.740$0.0100
python-config-easy-001easy0.740$0.0100
canary-7hard0.309$0.0100
canary-12hard0.309$0.0100
python-dependency-easy-001easy0.740$0.0100

Leaderboard Snapshot

Latest run: 3c4b352e-810e-47d9-84a3-0a8c619447b4 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-26T09:02:40.097797+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
3c4b352e-810e-47d9-84a3-0a8c619447b4coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-26T09:02:40.097797+00:00
7738b8c6-889d-4359-acc4-4f38f05fd72fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3092026-04-26T09:02:40.013384+00:00
458dbee6-3baf-41c0-9f88-a294011eed9ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.3092026-04-26T09:02:39.928069+00:00
16b9da7d-755e-412e-9f9f-a3050605ecc2coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-26T09:02:39.830184+00:00
6322e861-34fb-4ea6-8ed8-2cc99aab6313coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-26T09:02:39.757214+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0100  (coder)