Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: 4344340d-d46a-4dfe-9c22-cce6a5edb821 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-23T17:55:53.124520+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
4344340d-d46a-4dfe-9c22-cce6a5edb821coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-23T17:55:53.124520+00:00
d8659e86-fdc0-490a-8146-c89e49b74dfecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-23T17:55:53.045529+00:00
7152361a-411e-4777-9e86-12a2c0d4e5c4coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-23T17:55:52.966544+00:00
a34c10d4-d6e9-4ad5-a0db-06050938350dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-23T17:55:52.893281+00:00
c32f7333-d4c7-4910-ad29-017e355e8285coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-23T17:55:52.822943+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
4344340d-d46a-4dfe-9c22-cce6a5edb821canary-typescript-session-003wrong-file0.310$0.00102026-05-23T17:55:53.124520+00:00
d8659e86-fdc0-490a-8146-c89e49b74dfepython-bugfix-easy-001wrong-logic0.740$0.00102026-05-23T17:55:53.045529+00:00
7152361a-411e-4777-9e86-12a2c0d4e5c4canary-typescript-auth-006wrong-file0.310$0.00102026-05-23T17:55:52.966544+00:00
a34c10d4-d6e9-4ad5-a0db-06050938350dtypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T17:55:52.893281+00:00
c32f7333-d4c7-4910-ad29-017e355e8285canary-python-regression-002wrong-file0.310$0.00102026-05-23T17:55:52.822943+00:00
17061423-64e1-4d55-9b32-15add822d9cccanary-python-cache-005wrong-file0.310$0.00102026-05-23T17:55:52.731699+00:00
fe437d2b-c21e-4707-9f77-ef86b4932b31canary-shell-ops-004wrong-file0.310$0.00102026-05-23T17:55:52.634634+00:00
c71e0b2b-ebf7-4bd1-9b73-ffb952853b38python-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T17:55:52.562114+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)