Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: 5b2c40a9-f434-431f-9ce7-dbda397caadf | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-23T18:28:22.899929+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
5b2c40a9-f434-431f-9ce7-dbda397caadfcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-23T18:28:22.899929+00:00
3ecf6f32-ee7f-4232-b4b8-0fe84d51cb2dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-23T18:28:22.838798+00:00
5ba6f42e-80f1-4a9e-b82c-b6fe430c5f7bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-23T18:28:22.773096+00:00
398b7760-ba77-47c8-93dc-64b6169eaaeacoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-23T18:28:22.717215+00:00
3ac42156-17ba-44e9-add1-174171040b9bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-23T18:28:22.653091+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
5b2c40a9-f434-431f-9ce7-dbda397caadfcanary-typescript-session-003wrong-file0.310$0.00102026-05-23T18:28:22.899929+00:00
3ecf6f32-ee7f-4232-b4b8-0fe84d51cb2dpython-bugfix-easy-001wrong-logic0.740$0.00102026-05-23T18:28:22.838798+00:00
5ba6f42e-80f1-4a9e-b82c-b6fe430c5f7bcanary-typescript-auth-006wrong-file0.310$0.00102026-05-23T18:28:22.773096+00:00
398b7760-ba77-47c8-93dc-64b6169eaaeatypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T18:28:22.717215+00:00
3ac42156-17ba-44e9-add1-174171040b9bcanary-python-regression-002wrong-file0.310$0.00102026-05-23T18:28:22.653091+00:00
339df250-d008-47cf-a0e0-c4e8b86a4b4ccanary-python-cache-005wrong-file0.310$0.00102026-05-23T18:28:22.594187+00:00
ac34ec7a-a6ab-4ffb-9f1b-15ee8c16ad7acanary-shell-ops-004wrong-file0.310$0.00102026-05-23T18:28:22.536140+00:00
65909779-d824-4911-9d0c-2290fbe902afpython-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T18:28:22.479287+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)