Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: 59a86dbe-e69c-45a1-a91c-5f2dd1facd2f | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-22T13:43:12.197678+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
59a86dbe-e69c-45a1-a91c-5f2dd1facd2fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-22T13:43:12.197678+00:00
e208d309-d6d6-4b06-9433-fd64249ea983coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-22T13:43:12.129405+00:00
94c06b37-f86c-464f-95f6-01760f64fa66coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-22T13:43:12.047611+00:00
34eadd85-8bdf-4838-9413-66fd6bceaa1ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-22T13:43:11.963076+00:00
497e04ab-63fa-40d1-a0c6-03f373595425coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-22T13:43:11.865224+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
59a86dbe-e69c-45a1-a91c-5f2dd1facd2fcanary-typescript-session-003wrong-file0.310$0.00102026-05-22T13:43:12.197678+00:00
e208d309-d6d6-4b06-9433-fd64249ea983python-bugfix-easy-001wrong-logic0.740$0.00102026-05-22T13:43:12.129405+00:00
94c06b37-f86c-464f-95f6-01760f64fa66canary-typescript-auth-006wrong-file0.310$0.00102026-05-22T13:43:12.047611+00:00
34eadd85-8bdf-4838-9413-66fd6bceaa1etypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-22T13:43:11.963076+00:00
497e04ab-63fa-40d1-a0c6-03f373595425canary-python-regression-002wrong-file0.310$0.00102026-05-22T13:43:11.865224+00:00
8dc40c69-cfee-48a7-a915-43c6a111e941canary-python-cache-005wrong-file0.310$0.00102026-05-22T13:43:11.766318+00:00
eb908e13-a2f3-4081-8418-4c73ee43e1dccanary-shell-ops-004wrong-file0.310$0.00102026-05-22T13:43:11.677706+00:00
663a7916-a83f-408d-8337-31deaa477a6epython-security-fix-easy-001wrong-logic0.740$0.00102026-05-22T13:43:11.583418+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)