Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: 565069e9-c650-49ba-9aec-0f2ae8934b3e | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-26T10:38:51.442824+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
565069e9-c650-49ba-9aec-0f2ae8934b3ecoder59417e3b6834192b1ea96a6a9010dee3105efd780.3102026-05-26T10:38:51.442824+00:00
675bdda5-b8ca-4e56-b9e7-3bb6af13ac77coder59417e3b6834192b1ea96a6a9010dee3105efd780.7402026-05-26T10:38:51.376599+00:00
4920ea21-4d1b-4b02-8a00-aa417e2eb22fcoder59417e3b6834192b1ea96a6a9010dee3105efd780.3102026-05-26T10:38:51.310367+00:00
e80c354c-f18b-472e-980b-3add93a8f0e2coder59417e3b6834192b1ea96a6a9010dee3105efd780.7402026-05-26T10:38:51.244885+00:00
884c5930-c43b-4b06-a88c-0f2d207dc7becoder59417e3b6834192b1ea96a6a9010dee3105efd780.3102026-05-26T10:38:51.170093+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
565069e9-c650-49ba-9aec-0f2ae8934b3ecanary-typescript-session-003wrong-file0.310$0.00102026-05-26T10:38:51.442824+00:00
675bdda5-b8ca-4e56-b9e7-3bb6af13ac77python-bugfix-easy-001wrong-logic0.740$0.00102026-05-26T10:38:51.376599+00:00
4920ea21-4d1b-4b02-8a00-aa417e2eb22fcanary-typescript-auth-006wrong-file0.310$0.00102026-05-26T10:38:51.310367+00:00
e80c354c-f18b-472e-980b-3add93a8f0e2typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-26T10:38:51.244885+00:00
884c5930-c43b-4b06-a88c-0f2d207dc7becanary-python-regression-002wrong-file0.310$0.00102026-05-26T10:38:51.170093+00:00
64228873-ebd3-4d13-88d5-26791eaff50ccanary-python-cache-005wrong-file0.310$0.00102026-05-26T10:38:51.104096+00:00
6c8d2659-625b-470c-9310-4aacdf40a057canary-shell-ops-004wrong-file0.310$0.00102026-05-26T10:38:51.037977+00:00
d98dbd7c-20d7-4baa-bae6-9bd9426a80f5python-security-fix-easy-001wrong-logic0.740$0.00102026-05-26T10:38:50.964244+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)