Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: fe314120-d6ac-49c9-884c-98c2e7b50bb6 | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-09T00:28:37.019588+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
fe314120-d6ac-49c9-884c-98c2e7b50bb6coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T00:28:37.019588+00:00
e82502a9-ec41-4274-81ce-08cbb667a143coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T00:28:36.993097+00:00
d28ada90-1f40-452c-a004-f46ab59122ebcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-05-09T00:28:36.966227+00:00
fca727cd-09c6-4299-bcfb-3ba11fd23b55coder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T00:28:36.928139+00:00
a14c182f-d770-4484-912b-32dc1aac683fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-05-09T00:28:36.889855+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
fe314120-d6ac-49c9-884c-98c2e7b50bb6canary-shell-ops-004wrong-file0.310$0.00102026-05-09T00:28:37.019588+00:00
e82502a9-ec41-4274-81ce-08cbb667a143python-bugfix-easy-001wrong-logic0.740$0.00102026-05-09T00:28:36.993097+00:00
d28ada90-1f40-452c-a004-f46ab59122ebtypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T00:28:36.966227+00:00
fca727cd-09c6-4299-bcfb-3ba11fd23b55canary-python-cache-005wrong-file0.310$0.00102026-05-09T00:28:36.928139+00:00
a14c182f-d770-4484-912b-32dc1aac683fcanary-typescript-auth-006wrong-file0.310$0.00102026-05-09T00:28:36.889855+00:00
9630aff0-e31a-4daa-bc13-1bacb58e8d41canary-python-security-001wrong-file0.310$0.00102026-05-09T00:28:36.863835+00:00
21dacaef-7484-4b52-9b9e-d5a24e1c6e48canary-typescript-session-003wrong-file0.310$0.00102026-05-09T00:28:36.835673+00:00
68e7783d-0adc-4dd1-be5e-023695383195python-security-fix-easy-001wrong-logic0.740$0.00102026-05-09T00:28:36.807544+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)