Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 8663d562-c186-427d-a4a1-1deee6d7801a | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-08T00:44:16.500153+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
8663d562-c186-427d-a4a1-1deee6d7801acoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-08T00:44:16.500153+00:00
bb724fd9-bfa1-4b31-a60b-91f2621f8021coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-08T00:44:16.431564+00:00
2b666102-153b-4d7b-9a29-e6e6682d3bf4coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-08T00:44:16.364116+00:00
821014a0-9cd4-4338-8a47-cde05890b363coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-08T00:44:16.311153+00:00
0183f6b2-ff0d-4675-b1ac-a8ec5b476d4bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-08T00:44:16.256482+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
8663d562-c186-427d-a4a1-1deee6d7801acanary-shell-ops-004wrong-file0.310$0.00102026-05-08T00:44:16.500153+00:00
bb724fd9-bfa1-4b31-a60b-91f2621f8021python-bugfix-easy-001wrong-logic0.740$0.00102026-05-08T00:44:16.431564+00:00
2b666102-153b-4d7b-9a29-e6e6682d3bf4typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-08T00:44:16.364116+00:00
821014a0-9cd4-4338-8a47-cde05890b363canary-python-cache-005wrong-file0.310$0.00102026-05-08T00:44:16.311153+00:00
0183f6b2-ff0d-4675-b1ac-a8ec5b476d4bcanary-typescript-auth-006wrong-file0.310$0.00102026-05-08T00:44:16.256482+00:00
bff6e745-870e-4f36-ae38-5a8d475f09edcanary-python-security-001wrong-file0.310$0.00102026-05-08T00:44:16.187976+00:00
7c7f945c-3707-44d8-bfa7-43bc7c279368canary-typescript-session-003wrong-file0.310$0.00102026-05-08T00:44:16.124475+00:00
0a5ae992-ae38-4857-8423-3887af6207e2python-security-fix-easy-001wrong-logic0.740$0.00102026-05-08T00:44:16.050681+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)