Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: c8c974c4-cf19-4b84-9986-a06eea0e9931 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-04-27T16:15:06.071529+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
c8c974c4-cf19-4b84-9986-a06eea0e9931coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:15:06.071529+00:00
e69f5574-6a47-44e7-9daf-ccd4a9cd0706coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:15:06.010995+00:00
1e7f6475-befd-4047-a24d-24470e67f8b0coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:15:05.946876+00:00
2fc42167-3ade-4176-81e5-567249d2febdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:15:05.885304+00:00
07f319a8-20f6-450e-adc6-a27ed9ad809fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:15:05.840515+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
c8c974c4-cf19-4b84-9986-a06eea0e9931canary-shell-ops-004wrong-file0.310$0.00102026-04-27T16:15:06.071529+00:00
e69f5574-6a47-44e7-9daf-ccd4a9cd0706python-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:15:06.010995+00:00
1e7f6475-befd-4047-a24d-24470e67f8b0typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:15:05.946876+00:00
2fc42167-3ade-4176-81e5-567249d2febdcanary-python-cache-005wrong-file0.310$0.00102026-04-27T16:15:05.885304+00:00
07f319a8-20f6-450e-adc6-a27ed9ad809fcanary-typescript-auth-006wrong-file0.310$0.00102026-04-27T16:15:05.840515+00:00
de6de1ae-80a7-4047-9115-5a53d42fce87canary-python-security-001wrong-file0.310$0.00102026-04-27T16:15:05.790966+00:00
5bc03fb3-0629-4dee-9f5a-0ba187dd8b16canary-typescript-session-003wrong-file0.310$0.00102026-04-27T16:15:05.734432+00:00
1ae3ff83-39ad-4b41-b972-81eda113372fpython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:15:05.672956+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)