Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: b79f2497-af82-48e7-bc5d-adf2f6dbdd82 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-04-27T16:59:35.717526+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
b79f2497-af82-48e7-bc5d-adf2f6dbdd82coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:59:35.717526+00:00
35d9af66-31df-432f-9296-15d90c2e901fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:59:35.619848+00:00
34990051-70f1-4e8f-908c-e4453862e313coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:59:35.557560+00:00
ac749e2d-da1a-4520-9d33-f7eb7ed7c91dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:59:35.503235+00:00
6ca49e3d-33eb-4aeb-abc9-bd07e50c7d11coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:59:35.426241+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
b79f2497-af82-48e7-bc5d-adf2f6dbdd82canary-shell-ops-004wrong-file0.310$0.00102026-04-27T16:59:35.717526+00:00
35d9af66-31df-432f-9296-15d90c2e901fpython-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:59:35.619848+00:00
34990051-70f1-4e8f-908c-e4453862e313typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:59:35.557560+00:00
ac749e2d-da1a-4520-9d33-f7eb7ed7c91dcanary-python-cache-005wrong-file0.310$0.00102026-04-27T16:59:35.503235+00:00
6ca49e3d-33eb-4aeb-abc9-bd07e50c7d11canary-typescript-auth-006wrong-file0.310$0.00102026-04-27T16:59:35.426241+00:00
d53de61b-dc40-464f-a1fc-b4389887c2c9canary-python-security-001wrong-file0.310$0.00102026-04-27T16:59:35.370612+00:00
dd863ac2-172f-44ef-bfd3-ce347ba41d28canary-typescript-session-003wrong-file0.310$0.00102026-04-27T16:59:35.316539+00:00
87c66d09-2ebc-40d8-96ef-66e63f582a8fpython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:59:35.255931+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)