Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 62c61274-4525-423a-92b7-7cf6c98fb8d2 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:12:37.460493+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
62c61274-4525-423a-92b7-7cf6c98fb8d2coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:37.460493+00:00
43f95163-81ef-48d5-809e-6bac2706afe9coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:37.414952+00:00
2ca5f579-7974-43c8-a592-451b79043102coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:37.375083+00:00
85ebf2c0-7d97-496b-b863-49dda99af6f8coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:37.339163+00:00
9fe65b8f-e492-42df-970c-9822afd27c32coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:12:37.297624+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic141
#############################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
62c61274-4525-423a-92b7-7cf6c98fb8d2python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.460493+00:00
43f95163-81ef-48d5-809e-6bac2706afe9typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.414952+00:00
2ca5f579-7974-43c8-a592-451b79043102python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.375083+00:00
85ebf2c0-7d97-496b-b863-49dda99af6f8typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.339163+00:00
9fe65b8f-e492-42df-970c-9822afd27c32python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.297624+00:00
649172b2-efac-4cc0-86d6-05ea2cc74217typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.243122+00:00
a9c880ef-6d83-49dd-8086-e64d58709d15python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.198857+00:00
de86b9af-ca94-460f-aac4-9e244030e641typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.160372+00:00
c8ff2255-345c-4bbe-8a11-95391ffafda9python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.128527+00:00
8804c9ca-219c-4d6d-a256-e369a2cb55a4typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:12:37.089207+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0056  (coder)