Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: 4ea91dac-df81-4ba4-91a3-002671bf4f0c | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-23T20:38:58.267306+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
4ea91dac-df81-4ba4-91a3-002671bf4f0ccoder07ca25bc2d511f5aee15446c60081c184e9c91220.3102026-05-23T20:38:58.267306+00:00
35d0e18d-9c00-4ead-b299-2a32cac5cf9bcoder07ca25bc2d511f5aee15446c60081c184e9c91220.7402026-05-23T20:38:58.179743+00:00
ac036fcb-902b-46e2-bbd9-7b654d27dd06coder07ca25bc2d511f5aee15446c60081c184e9c91220.3102026-05-23T20:38:58.074630+00:00
7c73c1b3-b859-4758-bb79-7534a7836e88coder07ca25bc2d511f5aee15446c60081c184e9c91220.7402026-05-23T20:38:57.965181+00:00
66d78fc3-466d-4064-afc8-38c91b1110c7coder07ca25bc2d511f5aee15446c60081c184e9c91220.3102026-05-23T20:38:57.854104+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
4ea91dac-df81-4ba4-91a3-002671bf4f0ccanary-typescript-session-003wrong-file0.310$0.00102026-05-23T20:38:58.267306+00:00
35d0e18d-9c00-4ead-b299-2a32cac5cf9bpython-bugfix-easy-001wrong-logic0.740$0.00102026-05-23T20:38:58.179743+00:00
ac036fcb-902b-46e2-bbd9-7b654d27dd06canary-typescript-auth-006wrong-file0.310$0.00102026-05-23T20:38:58.074630+00:00
7c73c1b3-b859-4758-bb79-7534a7836e88typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T20:38:57.965181+00:00
66d78fc3-466d-4064-afc8-38c91b1110c7canary-python-regression-002wrong-file0.310$0.00102026-05-23T20:38:57.854104+00:00
1f2bcd45-6049-4d29-b839-13b72e078b31canary-python-cache-005wrong-file0.310$0.00102026-05-23T20:38:57.758935+00:00
4bce2766-a875-45f7-9fa6-7d61bbbf9f41canary-shell-ops-004wrong-file0.310$0.00102026-05-23T20:38:57.647593+00:00
41f9e4dd-12c5-4e5d-bff3-edcd32d9a988python-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T20:38:57.551933+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)