Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: 2796168f-2df4-4ae4-9a50-d5238a425ced | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-23T18:44:38.483817+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
2796168f-2df4-4ae4-9a50-d5238a425cedcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-23T18:44:38.483817+00:00
c111d14a-495d-4072-a64e-72321e6d4218coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-23T18:44:38.372979+00:00
90c0153d-87ed-414e-a8c7-b3c292d0ec52coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-23T18:44:38.262103+00:00
e39e35d7-178f-4766-845e-4bb9896f1485coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-23T18:44:38.150512+00:00
a84aa12f-8cb3-4f68-8a4b-277b0ae28316coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-23T18:44:38.039060+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
2796168f-2df4-4ae4-9a50-d5238a425cedcanary-typescript-session-003wrong-file0.310$0.00102026-05-23T18:44:38.483817+00:00
c111d14a-495d-4072-a64e-72321e6d4218python-bugfix-easy-001wrong-logic0.740$0.00102026-05-23T18:44:38.372979+00:00
90c0153d-87ed-414e-a8c7-b3c292d0ec52canary-typescript-auth-006wrong-file0.310$0.00102026-05-23T18:44:38.262103+00:00
e39e35d7-178f-4766-845e-4bb9896f1485typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T18:44:38.150512+00:00
a84aa12f-8cb3-4f68-8a4b-277b0ae28316canary-python-regression-002wrong-file0.310$0.00102026-05-23T18:44:38.039060+00:00
fc7ba846-f33e-45c1-8dcd-a50b6bea04f6canary-python-cache-005wrong-file0.310$0.00102026-05-23T18:44:37.927507+00:00
0e0e9f4f-0b8d-400c-be54-5c2b3b0721d2canary-shell-ops-004wrong-file0.310$0.00102026-05-23T18:44:37.816357+00:00
f2b5382b-f01f-4beb-af70-74bd22f73b9cpython-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T18:44:37.705573+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)