Eval Report: ci-nightly

Profile: gdm-swebench-lite-v1 | Tasks: 8 | Pass rate: 37.5% | Cost: $0.0080

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
canary-shell-ops-004hard0.310$0.0010
canary-python-cache-005hard0.310$0.0010
canary-python-regression-002hard0.310$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
canary-typescript-auth-006hard0.310$0.0010
python-bugfix-easy-001easy0.740$0.0010
canary-typescript-session-003hard0.310$0.0010

Leaderboard Snapshot

Latest run: c54c26df-6b89-41ee-aaf2-289f7393feec | Latest model: coder | Latest score: 0.310 | Recorded at: 2026-05-23T20:14:11.077810+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
c54c26df-6b89-41ee-aaf2-289f7393feeccoder89f0f5456c5b8670ca70d1a941ab0d7272df13100.3102026-05-23T20:14:11.077810+00:00
28bf7eec-29e8-41d7-8b25-a657b2d70d30coder89f0f5456c5b8670ca70d1a941ab0d7272df13100.7402026-05-23T20:14:11.014620+00:00
2ea1bf35-b3f2-4aaf-87df-c4613449a7d8coder89f0f5456c5b8670ca70d1a941ab0d7272df13100.3102026-05-23T20:14:10.949929+00:00
9f31c71e-f58e-40e8-bf65-34ba0c3f0ea1coder89f0f5456c5b8670ca70d1a941ab0d7272df13100.7402026-05-23T20:14:10.872611+00:00
2f55af0a-9138-4aa9-aab6-eeb55d45e8f2coder89f0f5456c5b8670ca70d1a941ab0d7272df13100.3102026-05-23T20:14:10.809038+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-file5
#####
wrong-logic3
###

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
c54c26df-6b89-41ee-aaf2-289f7393feeccanary-typescript-session-003wrong-file0.310$0.00102026-05-23T20:14:11.077810+00:00
28bf7eec-29e8-41d7-8b25-a657b2d70d30python-bugfix-easy-001wrong-logic0.740$0.00102026-05-23T20:14:11.014620+00:00
2ea1bf35-b3f2-4aaf-87df-c4613449a7d8canary-typescript-auth-006wrong-file0.310$0.00102026-05-23T20:14:10.949929+00:00
9f31c71e-f58e-40e8-bf65-34ba0c3f0ea1typescript-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T20:14:10.872611+00:00
2f55af0a-9138-4aa9-aab6-eeb55d45e8f2canary-python-regression-002wrong-file0.310$0.00102026-05-23T20:14:10.809038+00:00
60bb7a65-1166-41ba-b3b0-c990123fc6d5canary-python-cache-005wrong-file0.310$0.00102026-05-23T20:14:10.763391+00:00
4d789bc7-3381-4fb2-a86d-604bac9203a6canary-shell-ops-004wrong-file0.310$0.00102026-05-23T20:14:10.697695+00:00
e45b2720-7c38-4a41-bf2a-b10454efb888python-security-fix-easy-001wrong-logic0.740$0.00102026-05-23T20:14:10.635174+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [#######-------------] 37.5% @ $0.0010  (coder)