Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 548e9c71-ef2b-4de2-9a68-971b458b11b5 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-08T21:30:20.992553+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
548e9c71-ef2b-4de2-9a68-971b458b11b5coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-08T21:30:20.992553+00:00
c53135bf-27be-452a-a549-752e2c4ebaa0coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-08T21:30:20.887644+00:00
d2d18016-0c68-4289-afee-fa1719c05c80coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-08T21:30:20.787030+00:00
9c218c63-5661-411a-87a8-1923a916cee9coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-08T21:30:20.701250+00:00
afc76d08-dbab-4588-a634-856a8b50c5ddcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-08T21:30:20.609170+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
548e9c71-ef2b-4de2-9a68-971b458b11b5canary-shell-ops-004wrong-file0.310$0.00102026-05-08T21:30:20.992553+00:00
c53135bf-27be-452a-a549-752e2c4ebaa0python-bugfix-easy-001wrong-logic0.740$0.00102026-05-08T21:30:20.887644+00:00
d2d18016-0c68-4289-afee-fa1719c05c80typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-08T21:30:20.787030+00:00
9c218c63-5661-411a-87a8-1923a916cee9canary-python-cache-005wrong-file0.310$0.00102026-05-08T21:30:20.701250+00:00
afc76d08-dbab-4588-a634-856a8b50c5ddcanary-typescript-auth-006wrong-file0.310$0.00102026-05-08T21:30:20.609170+00:00
ca31763b-837c-4128-b49f-4edf915506e3canary-python-security-001wrong-file0.310$0.00102026-05-08T21:30:20.503866+00:00
9f8fc943-605c-4cd3-bfdd-f8eaa90377b3canary-typescript-session-003wrong-file0.310$0.00102026-05-08T21:30:20.420227+00:00
bb80e034-5d69-45d7-82ad-ee17c61c4516python-security-fix-easy-001wrong-logic0.740$0.00102026-05-08T21:30:20.338071+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)