Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
python-config-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 5be9cc23-1170-4cda-85c6-f15ea1f570cc | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-04-27T15:44:15.046881+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
5be9cc23-1170-4cda-85c6-f15ea1f570cccoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T15:44:15.046881+00:00
66def481-7d91-4305-8a90-da35804589c5coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T15:44:14.987334+00:00
e536c52f-4907-4bb6-b8fa-f29aacd3237ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T15:44:14.943627+00:00
98a3a464-d329-48d9-b236-62e444d5cc2ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T15:44:14.900759+00:00
05a670a3-9308-4336-af1f-ddcc986fcd29coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T15:44:14.853498+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)