Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: aaaaddf0-d0ca-4494-b6de-138845228888 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-04-28T00:19:33.602590+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
aaaaddf0-d0ca-4494-b6de-138845228888coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-28T00:19:33.602590+00:00
b5d44da7-9c50-4fc0-8ff4-a99fcf6b693ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:19:33.517842+00:00
21b4997b-21e9-4165-85e6-818f4433ea31coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:19:33.430047+00:00
4e5550e3-8f4c-4192-b749-3acfb5ec8749coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-28T00:19:33.332279+00:00
dec31eb1-1011-4e41-aa83-be720edcbb3ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-28T00:19:33.254524+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
aaaaddf0-d0ca-4494-b6de-138845228888canary-shell-ops-004wrong-file0.310$0.00102026-04-28T00:19:33.602590+00:00
b5d44da7-9c50-4fc0-8ff4-a99fcf6b693epython-bugfix-easy-001wrong-logic0.740$0.00102026-04-28T00:19:33.517842+00:00
21b4997b-21e9-4165-85e6-818f4433ea31typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-28T00:19:33.430047+00:00
4e5550e3-8f4c-4192-b749-3acfb5ec8749canary-python-cache-005wrong-file0.310$0.00102026-04-28T00:19:33.332279+00:00
dec31eb1-1011-4e41-aa83-be720edcbb3ecanary-typescript-auth-006wrong-file0.310$0.00102026-04-28T00:19:33.254524+00:00
f0e5528a-d1f0-4902-b2f2-548ba0ab3f0dcanary-python-security-001wrong-file0.310$0.00102026-04-28T00:19:33.131174+00:00
a56fe592-8f42-4b08-a620-02441f36a436canary-typescript-session-003wrong-file0.310$0.00102026-04-28T00:19:33.041547+00:00
b7a6e81e-c55d-4c17-b763-7bb58ca28265python-security-fix-easy-001wrong-logic0.740$0.00102026-04-28T00:19:32.951828+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)