Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: e24722e1-9bfc-460e-b35c-b79c68351e3e | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-09T14:21:45.766120+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
e24722e1-9bfc-460e-b35c-b79c68351e3ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T14:21:45.766120+00:00
4ad4c11d-82b3-4126-87d3-e4144d95d37bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T14:21:45.691938+00:00
92d63643-1a5b-4ec0-9221-eabf5c8097eccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T14:21:45.565637+00:00
faac229a-2bcd-41c1-a298-2d809bb63af7coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T14:21:45.477332+00:00
1d29a647-d1ca-4711-a40f-de51b89ae486coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T14:21:45.368061+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
e24722e1-9bfc-460e-b35c-b79c68351e3ecanary-shell-ops-004wrong-file0.310$0.00102026-05-09T14:21:45.766120+00:00
4ad4c11d-82b3-4126-87d3-e4144d95d37bpython-bugfix-easy-001wrong-logic0.740$0.00102026-05-09T14:21:45.691938+00:00
92d63643-1a5b-4ec0-9221-eabf5c8097ectypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T14:21:45.565637+00:00
faac229a-2bcd-41c1-a298-2d809bb63af7canary-python-cache-005wrong-file0.310$0.00102026-05-09T14:21:45.477332+00:00
1d29a647-d1ca-4711-a40f-de51b89ae486canary-typescript-auth-006wrong-file0.310$0.00102026-05-09T14:21:45.368061+00:00
1afb36a4-fd42-4841-ae09-4111a09d5470canary-python-security-001wrong-file0.310$0.00102026-05-09T14:21:45.277672+00:00
e5b28c0c-4867-4e0c-843b-fcc2d4d48d34canary-typescript-session-003wrong-file0.310$0.00102026-05-09T14:21:45.174470+00:00
aece2191-d52c-4835-b509-1a27e8daf5e6python-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T14:21:45.037117+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)