Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 643b7e7c-8ef8-4c27-b30c-696079dcdc7e | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-09T03:02:06.485319+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
643b7e7c-8ef8-4c27-b30c-696079dcdc7ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T03:02:06.485319+00:00
675e0eca-6e38-4acf-93c5-1aff3fcdde6ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T03:02:06.394881+00:00
6777c77d-f408-4d5c-b47c-72ced5377a0ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T03:02:06.286787+00:00
66b1eb7a-98f9-48dc-b888-33574988194bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T03:02:06.213197+00:00
d8080741-e705-41f0-8b33-575e544a3038coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T03:02:06.140626+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
643b7e7c-8ef8-4c27-b30c-696079dcdc7ecanary-shell-ops-004wrong-file0.310$0.00102026-05-09T03:02:06.485319+00:00
675e0eca-6e38-4acf-93c5-1aff3fcdde6cpython-bugfix-easy-001wrong-logic0.740$0.00102026-05-09T03:02:06.394881+00:00
6777c77d-f408-4d5c-b47c-72ced5377a0ctypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T03:02:06.286787+00:00
66b1eb7a-98f9-48dc-b888-33574988194bcanary-python-cache-005wrong-file0.310$0.00102026-05-09T03:02:06.213197+00:00
d8080741-e705-41f0-8b33-575e544a3038canary-typescript-auth-006wrong-file0.310$0.00102026-05-09T03:02:06.140626+00:00
c0188816-fcbe-48b7-b603-d4c3ecde83bfcanary-python-security-001wrong-file0.310$0.00102026-05-09T03:02:06.052306+00:00
ecd559c7-f366-45a2-b9e8-08e0b840aa10canary-typescript-session-003wrong-file0.310$0.00102026-05-09T03:02:05.933685+00:00
647814c3-3c5a-463f-8f75-b5d0bcfcc6f8python-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T03:02:05.777423+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)