Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 7004b004-e8f2-4d5a-9430-bcd8e01c2e50 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-04-27T16:19:23.994747+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
7004b004-e8f2-4d5a-9430-bcd8e01c2e50coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:19:23.994747+00:00
de13f802-0e37-4a61-bc4b-231c904d1b6ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:19:23.832066+00:00
d56a5b7c-32d2-4fe9-930b-618345265f9ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:19:23.742532+00:00
40b4c67f-22f7-4b80-9a44-647285729a66coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:19:23.665995+00:00
138e6d81-c39d-45b6-8f0a-1f54d5ef84f8coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:19:23.599118+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
7004b004-e8f2-4d5a-9430-bcd8e01c2e50canary-shell-ops-004wrong-file0.310$0.00102026-04-27T16:19:23.994747+00:00
de13f802-0e37-4a61-bc4b-231c904d1b6cpython-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:19:23.832066+00:00
d56a5b7c-32d2-4fe9-930b-618345265f9ctypescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:19:23.742532+00:00
40b4c67f-22f7-4b80-9a44-647285729a66canary-python-cache-005wrong-file0.310$0.00102026-04-27T16:19:23.665995+00:00
138e6d81-c39d-45b6-8f0a-1f54d5ef84f8canary-typescript-auth-006wrong-file0.310$0.00102026-04-27T16:19:23.599118+00:00
de2fea17-f6c9-4c8c-9f21-209ccdc45e9ecanary-python-security-001wrong-file0.310$0.00102026-04-27T16:19:23.536001+00:00
9b695874-530b-4a04-87a2-cd001e842cc1canary-typescript-session-003wrong-file0.310$0.00102026-04-27T16:19:23.470781+00:00
7033160b-8101-468f-b02d-168c01f77718python-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:19:23.396491+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)