Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010
canary-python-security-001hard0.310$0.0010
canary-typescript-auth-006hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
python-config-easy-001easy0.740$0.0010
python-dependency-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010

Leaderboard Snapshot

Latest run: a5c60717-5a90-4d04-8ef8-a9812218b7af | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-04-27T16:00:54.295822+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
a5c60717-5a90-4d04-8ef8-a9812218b7afcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:00:54.295822+00:00
7b676907-e55b-45a1-ae6c-c373cb05d09bcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:00:54.263680+00:00
295a362c-a327-4f8a-ba8d-dd9302103116coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:00:54.215820+00:00
3595da29-4f83-420e-a8f6-4cd7ca84309dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:00:54.168436+00:00
1f3a7798-7144-4dca-a44f-3a8322549f7dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.3102026-04-27T16:00:54.100564+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
a5c60717-5a90-4d04-8ef8-a9812218b7afcanary-shell-ops-004wrong-file0.310$0.00102026-04-27T16:00:54.295822+00:00
7b676907-e55b-45a1-ae6c-c373cb05d09bpython-dependency-easy-001wrong-logic0.740$0.00102026-04-27T16:00:54.263680+00:00
295a362c-a327-4f8a-ba8d-dd9302103116python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:00:54.215820+00:00
3595da29-4f83-420e-a8f6-4cd7ca84309dcanary-python-cache-005wrong-file0.310$0.00102026-04-27T16:00:54.168436+00:00
1f3a7798-7144-4dca-a44f-3a8322549f7dcanary-typescript-auth-006wrong-file0.310$0.00102026-04-27T16:00:54.100564+00:00
8685f9fc-cbff-48f5-9197-b2da4afe0d9fcanary-python-security-001wrong-file0.310$0.00102026-04-27T16:00:54.055854+00:00
671e00e9-f0b9-4c9f-abdd-ec77f17cbb0acanary-typescript-session-003wrong-file0.310$0.00102026-04-27T16:00:54.015229+00:00
2da271d2-6c7b-4420-9676-45fb147f47f9python-bugfix-easy-001wrong-logic0.740$0.00102026-04-27T16:00:53.981147+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)