Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: 8ffa1a1e-7347-40a1-b388-4dc03f3cde6a | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-22T13:57:01.025619+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
8ffa1a1e-7347-40a1-b388-4dc03f3cde6acoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-22T13:57:01.025619+00:00
7793d6f2-6678-41f3-ab6f-3c170b0db867coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-22T13:57:00.940455+00:00
f0054904-b9f2-4a1f-86c7-49b4038f29a3coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-22T13:57:00.847922+00:00
92185a49-98ce-4446-82d5-153ace505470coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-22T13:57:00.770789+00:00
e24e42ac-b988-4aad-978a-7c384dc216decoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-22T13:57:00.684887+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
8ffa1a1e-7347-40a1-b388-4dc03f3cde6acanary-typescript-session-003wrong-file0.310$0.00102026-05-22T13:57:01.025619+00:00
7793d6f2-6678-41f3-ab6f-3c170b0db867python-bugfix-easy-001wrong-logic0.740$0.00102026-05-22T13:57:00.940455+00:00
f0054904-b9f2-4a1f-86c7-49b4038f29a3canary-typescript-auth-006wrong-file0.310$0.00102026-05-22T13:57:00.847922+00:00
92185a49-98ce-4446-82d5-153ace505470typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-22T13:57:00.770789+00:00
e24e42ac-b988-4aad-978a-7c384dc216decanary-python-regression-002wrong-file0.310$0.00102026-05-22T13:57:00.684887+00:00
de46a676-c089-4d6b-8ac2-6200a6fef221canary-python-cache-005wrong-file0.310$0.00102026-05-22T13:57:00.596733+00:00
cab23baf-7f28-4bea-820b-677d9e53f3d4canary-shell-ops-004wrong-file0.310$0.00102026-05-22T13:57:00.501579+00:00
fe6ff0bd-b171-4492-ab2f-888acd21d240python-security-fix-easy-001wrong-logic0.740$0.00102026-05-22T13:57:00.429532+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)