Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: eb1416e1-b482-46df-9ad6-ed591891876f | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-07T23:11:04.859143+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
eb1416e1-b482-46df-9ad6-ed591891876fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-07T23:11:04.859143+00:00
8a3d77b5-8166-4a1b-a3d6-3b1893ee72cecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-07T23:11:04.742427+00:00
f44373ec-c6cf-4bda-a03a-6ba0a2c4d332coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-07T23:11:04.628948+00:00
11afc00b-176f-42ae-8d47-7b26a77d082ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-07T23:11:04.515904+00:00
5bd78aea-01c2-4f8f-b473-52b4ab79da39coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-07T23:11:04.401030+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
eb1416e1-b482-46df-9ad6-ed591891876fcanary-shell-ops-004wrong-file0.310$0.00102026-05-07T23:11:04.859143+00:00
8a3d77b5-8166-4a1b-a3d6-3b1893ee72cepython-bugfix-easy-001wrong-logic0.740$0.00102026-05-07T23:11:04.742427+00:00
f44373ec-c6cf-4bda-a03a-6ba0a2c4d332typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-07T23:11:04.628948+00:00
11afc00b-176f-42ae-8d47-7b26a77d082ccanary-python-cache-005wrong-file0.310$0.00102026-05-07T23:11:04.515904+00:00
5bd78aea-01c2-4f8f-b473-52b4ab79da39canary-typescript-auth-006wrong-file0.310$0.00102026-05-07T23:11:04.401030+00:00
d24e01e7-ed4b-4d54-bcfd-0bdc1816dccbcanary-python-security-001wrong-file0.310$0.00102026-05-07T23:11:04.292112+00:00
e7ee9eab-4417-44d5-bb8f-52658015c50fcanary-typescript-session-003wrong-file0.310$0.00102026-05-07T23:11:04.179294+00:00
0a554670-31a3-4801-b3dd-aa787db257e4python-security-fix-easy-001wrong-logic0.740$0.00102026-05-07T23:11:04.067439+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)