Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 492fadd4-8806-4ab3-850a-12b46966a945 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-09T14:38:02.045650+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
492fadd4-8806-4ab3-850a-12b46966a945coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T14:38:02.045650+00:00
5b3e538c-30d8-4e51-89ed-d27b72a38302coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T14:38:01.922015+00:00
10029a98-1010-4ab9-b81c-4b4040ef903acoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T14:38:01.807000+00:00
72f9b3d9-fb75-4cd1-8cd3-5fc7e1019661coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T14:38:01.687581+00:00
bf5a99dc-b743-4040-b22e-56c38c75877dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T14:38:01.569198+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
492fadd4-8806-4ab3-850a-12b46966a945canary-shell-ops-004wrong-file0.310$0.00102026-05-09T14:38:02.045650+00:00
5b3e538c-30d8-4e51-89ed-d27b72a38302python-bugfix-easy-001wrong-logic0.740$0.00102026-05-09T14:38:01.922015+00:00
10029a98-1010-4ab9-b81c-4b4040ef903atypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T14:38:01.807000+00:00
72f9b3d9-fb75-4cd1-8cd3-5fc7e1019661canary-python-cache-005wrong-file0.310$0.00102026-05-09T14:38:01.687581+00:00
bf5a99dc-b743-4040-b22e-56c38c75877dcanary-typescript-auth-006wrong-file0.310$0.00102026-05-09T14:38:01.569198+00:00
d4f6c017-fc87-479a-b225-c0b490697661canary-python-security-001wrong-file0.310$0.00102026-05-09T14:38:01.451347+00:00
ab505faa-4626-49c5-bba3-06e12d83fdadcanary-typescript-session-003wrong-file0.310$0.00102026-05-09T14:38:01.327117+00:00
21871011-3cad-4362-b995-5171571990b9python-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T14:38:01.202306+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)