Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: 48398cfa-d04d-469d-9708-fc24277cbfe2 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-09T12:01:59.266902+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
48398cfa-d04d-469d-9708-fc24277cbfe2coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T12:01:59.266902+00:00
f2b4c3b3-1d3a-4ce7-b580-ba0f98cd5e39coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T12:01:59.030447+00:00
a6d5792f-0109-4238-a612-ac513a317e7acoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T12:01:58.680548+00:00
a41406e2-4c7a-4ca2-addb-04b4c75eab9fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T12:01:58.445747+00:00
3923abb4-983f-44dd-a9d5-4c94499c9e01coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T12:01:58.204714+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
48398cfa-d04d-469d-9708-fc24277cbfe2canary-shell-ops-004wrong-file0.310$0.00102026-05-09T12:01:59.266902+00:00
f2b4c3b3-1d3a-4ce7-b580-ba0f98cd5e39python-bugfix-easy-001wrong-logic0.740$0.00102026-05-09T12:01:59.030447+00:00
a6d5792f-0109-4238-a612-ac513a317e7atypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T12:01:58.680548+00:00
a41406e2-4c7a-4ca2-addb-04b4c75eab9fcanary-python-cache-005wrong-file0.310$0.00102026-05-09T12:01:58.445747+00:00
3923abb4-983f-44dd-a9d5-4c94499c9e01canary-typescript-auth-006wrong-file0.310$0.00102026-05-09T12:01:58.204714+00:00
af195889-b003-4b5c-8304-d695685fddc9canary-python-security-001wrong-file0.310$0.00102026-05-09T12:01:57.910680+00:00
e1a0aba9-c506-4bfe-a8e4-634bc2542183canary-typescript-session-003wrong-file0.310$0.00102026-05-09T12:01:57.691760+00:00
438af269-d917-4649-a1b6-d54da337ff90python-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T12:01:57.464716+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)