Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: f952606c-3d17-42d1-a40f-6d643694f2f8 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:14:38.107066+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
f952606c-3d17-42d1-a40f-6d643694f2f8coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:38.107066+00:00
488bced3-a53c-47bf-a1a2-ee59d8536c54coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:38.059176+00:00
0f0dfb1f-dea5-437f-bd35-5039274e5fc3coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:38.018320+00:00
e0ceaea6-9215-4971-a4fa-fd668951f5d3coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:37.971104+00:00
cde49f3e-69fa-43fd-b6b5-83b66665e5bdcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:14:37.898589+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic188
############################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
f952606c-3d17-42d1-a40f-6d643694f2f8python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:14:38.107066+00:00
488bced3-a53c-47bf-a1a2-ee59d8536c54typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:14:38.059176+00:00
0f0dfb1f-dea5-437f-bd35-5039274e5fc3python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:14:38.018320+00:00
e0ceaea6-9215-4971-a4fa-fd668951f5d3typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:14:37.971104+00:00
cde49f3e-69fa-43fd-b6b5-83b66665e5bdpython-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:14:37.898589+00:00
ede6d1b2-c74f-43b1-945a-358ef704c594typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:14:37.842594+00:00
0361d55e-90a3-439e-8cfa-6a31a66820b9python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:14:37.798594+00:00
c4f63a3a-d2e6-403e-898f-5d5d95ce7bd6typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:14:37.740854+00:00
a83e58b2-820a-4c8b-bce3-4059a4418176python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:14:37.666799+00:00
c3602330-a740-4523-96d2-fd030da8ea94typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:14:37.618117+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0055  (coder)