Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 1c7e3915-393d-47a1-9ed2-f3ccb6674ebf | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-09T03:50:00.932796+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
1c7e3915-393d-47a1-9ed2-f3ccb6674ebfcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T03:50:00.932796+00:00
8691e93e-4fff-4069-b071-0c14103b7719coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T03:50:00.860403+00:00
f30607e0-cb8a-458c-b504-52281e41687fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T03:50:00.790360+00:00
eb886834-e763-482c-a5ad-01d284c5e022coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T03:50:00.720962+00:00
e8fd8454-2a1f-46a6-8f93-e0745799f5e1coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T03:50:00.648715+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
1c7e3915-393d-47a1-9ed2-f3ccb6674ebfcanary-shell-ops-004wrong-file0.310$0.00102026-05-09T03:50:00.932796+00:00
8691e93e-4fff-4069-b071-0c14103b7719python-bugfix-easy-001wrong-logic0.740$0.00102026-05-09T03:50:00.860403+00:00
f30607e0-cb8a-458c-b504-52281e41687ftypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T03:50:00.790360+00:00
eb886834-e763-482c-a5ad-01d284c5e022canary-python-cache-005wrong-file0.310$0.00102026-05-09T03:50:00.720962+00:00
e8fd8454-2a1f-46a6-8f93-e0745799f5e1canary-typescript-auth-006wrong-file0.310$0.00102026-05-09T03:50:00.648715+00:00
121edb22-c1fe-44c7-ad31-61efbdd25b6acanary-python-security-001wrong-file0.310$0.00102026-05-09T03:50:00.569721+00:00
7482a239-d522-467b-9d6a-2a6f65f94644canary-typescript-session-003wrong-file0.310$0.00102026-05-09T03:50:00.509532+00:00
ad98a746-9189-4ac4-94e2-d94648a1d128python-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T03:50:00.437176+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)