Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: 81c3abc5-2c1c-4444-82da-eeb9cec592db | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-26T10:34:34.437769+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
81c3abc5-2c1c-4444-82da-eeb9cec592dbcoder59417e3b6834192b1ea96a6a9010dee3105efd780.3102026-05-26T10:34:34.437769+00:00
cbde3298-00d5-40e5-9978-79f7a737301acoder59417e3b6834192b1ea96a6a9010dee3105efd780.7402026-05-26T10:34:34.349793+00:00
4c6e81ec-23a2-4339-80c2-99a85642619dcoder59417e3b6834192b1ea96a6a9010dee3105efd780.3102026-05-26T10:34:34.264734+00:00
8b9c0291-45dc-4ecd-893c-02352a242eeacoder59417e3b6834192b1ea96a6a9010dee3105efd780.7402026-05-26T10:34:34.164910+00:00
37b7d369-9274-4099-997a-0bd54cd66aa4coder59417e3b6834192b1ea96a6a9010dee3105efd780.3102026-05-26T10:34:34.066321+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
81c3abc5-2c1c-4444-82da-eeb9cec592dbcanary-typescript-session-003wrong-file0.310$0.00102026-05-26T10:34:34.437769+00:00
cbde3298-00d5-40e5-9978-79f7a737301apython-bugfix-easy-001wrong-logic0.740$0.00102026-05-26T10:34:34.349793+00:00
4c6e81ec-23a2-4339-80c2-99a85642619dcanary-typescript-auth-006wrong-file0.310$0.00102026-05-26T10:34:34.264734+00:00
8b9c0291-45dc-4ecd-893c-02352a242eeatypescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-26T10:34:34.164910+00:00
37b7d369-9274-4099-997a-0bd54cd66aa4canary-python-regression-002wrong-file0.310$0.00102026-05-26T10:34:34.066321+00:00
6e78c3e9-febc-44e3-957d-72a81e43b58bcanary-python-cache-005wrong-file0.310$0.00102026-05-26T10:34:33.967247+00:00
b457faff-bc19-4e1e-913a-8090a7301970canary-shell-ops-004wrong-file0.310$0.00102026-05-26T10:34:33.882724+00:00
ed45ffc5-783b-4b4e-8be0-d34b80e96f2bpython-security-fix-easy-001wrong-logic0.740$0.00102026-05-26T10:34:33.787052+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)