Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: b0a09138-da21-47ec-b0c8-e41ede4a1d47 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-22T14:02:00.091149+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
b0a09138-da21-47ec-b0c8-e41ede4a1d47coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-22T14:02:00.091149+00:00
1aea8745-454d-4599-bce0-80ef1b69968bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-22T14:01:59.971081+00:00
08556792-4126-490c-9646-20ec2618bbe8coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-22T14:01:59.871810+00:00
3a8146aa-b45b-4093-9231-b77908e89eb0coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-22T14:01:59.759066+00:00
9cbed913-e610-4a02-930a-a31478bcd8f9coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-22T14:01:59.656905+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
b0a09138-da21-47ec-b0c8-e41ede4a1d47canary-typescript-session-003wrong-file0.310$0.00102026-05-22T14:02:00.091149+00:00
1aea8745-454d-4599-bce0-80ef1b69968bpython-bugfix-easy-001wrong-logic0.740$0.00102026-05-22T14:01:59.971081+00:00
08556792-4126-490c-9646-20ec2618bbe8canary-typescript-auth-006wrong-file0.310$0.00102026-05-22T14:01:59.871810+00:00
3a8146aa-b45b-4093-9231-b77908e89eb0typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-22T14:01:59.759066+00:00
9cbed913-e610-4a02-930a-a31478bcd8f9canary-python-regression-002wrong-file0.310$0.00102026-05-22T14:01:59.656905+00:00
ea6687ff-9c8b-4ab4-a893-ec5220d81ec8canary-python-cache-005wrong-file0.310$0.00102026-05-22T14:01:59.549998+00:00
ae9d126e-086b-477a-8be2-65f9f5881aabcanary-shell-ops-004wrong-file0.310$0.00102026-05-22T14:01:59.439216+00:00
e7ec4ef8-5321-4191-a8e4-2c53dfa56259python-security-fix-easy-001wrong-logic0.740$0.00102026-05-22T14:01:59.334293+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)