Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: ddf9a5c9-8eaa-4ecb-9859-86da5a9e35e2 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-08T22:27:10.512731+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
ddf9a5c9-8eaa-4ecb-9859-86da5a9e35e2coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-08T22:27:10.512731+00:00
f691fc0c-4e5d-43c4-960d-e6c0b7e9abf5coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-08T22:27:10.382696+00:00
100a9023-6f44-45ba-868c-a48a0d406f49coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-08T22:27:10.250733+00:00
c747e69d-b9f1-4749-8cd5-832f7b206efacoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-08T22:27:10.112088+00:00
c0ce9794-e8ba-473e-ae01-a973fcf31d09coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-08T22:27:09.943053+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
ddf9a5c9-8eaa-4ecb-9859-86da5a9e35e2canary-shell-ops-004wrong-file0.310$0.00102026-05-08T22:27:10.512731+00:00
f691fc0c-4e5d-43c4-960d-e6c0b7e9abf5python-bugfix-easy-001wrong-logic0.740$0.00102026-05-08T22:27:10.382696+00:00
100a9023-6f44-45ba-868c-a48a0d406f49typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-08T22:27:10.250733+00:00
c747e69d-b9f1-4749-8cd5-832f7b206efacanary-python-cache-005wrong-file0.310$0.00102026-05-08T22:27:10.112088+00:00
c0ce9794-e8ba-473e-ae01-a973fcf31d09canary-typescript-auth-006wrong-file0.310$0.00102026-05-08T22:27:09.943053+00:00
755b48f9-ff39-4b00-892a-a0d795340a75canary-python-security-001wrong-file0.310$0.00102026-05-08T22:27:09.834060+00:00
154c0499-e637-4605-b476-e04f16be197fcanary-typescript-session-003wrong-file0.310$0.00102026-05-08T22:27:09.721610+00:00
37500f75-748e-44c2-b077-f1b3b4a91c35python-security-fix-easy-001wrong-logic0.740$0.00102026-05-08T22:27:09.566211+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)