Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 00be8b1d-dd91-486a-b023-131cc5c25d92 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T23:48:06.325770+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
00be8b1d-dd91-486a-b023-131cc5c25d92coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:06.325770+00:00
3542f8ca-a2c8-4e50-ab3c-1e7f62f9fcd9coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:06.263157+00:00
a2a3ac49-d1de-452a-87a8-049214b8ad70coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:06.193680+00:00
bf0b2912-3e32-4137-b227-cc041d130088coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:06.133373+00:00
83bffb64-e7ad-4367-a090-5e5dac896a6ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T23:48:06.063179+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic690
##################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
00be8b1d-dd91-486a-b023-131cc5c25d92python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T23:48:06.325770+00:00
3542f8ca-a2c8-4e50-ab3c-1e7f62f9fcd9typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T23:48:06.263157+00:00
a2a3ac49-d1de-452a-87a8-049214b8ad70python-config-easy-001wrong-logic0.740$0.00102026-04-27T23:48:06.193680+00:00
bf0b2912-3e32-4137-b227-cc041d130088typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T23:48:06.133373+00:00
83bffb64-e7ad-4367-a090-5e5dac896a6epython-refactor-easy-001wrong-logic0.740$0.00102026-04-27T23:48:06.063179+00:00
a9588276-2e36-484c-bfe5-e4d2e765fa14typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T23:48:05.995912+00:00
f8961a0a-b81d-42a7-ace0-215ff7a7af83python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T23:48:05.922732+00:00
f3b5efbf-f144-4cef-bae2-a6f12c89e968typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T23:48:05.842799+00:00
b7d1e5fa-8e4a-421d-a134-34ba60327dc6python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T23:48:05.772120+00:00
b9901f05-f966-4f0b-a26f-870de10c3472typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T23:48:05.706258+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0050  (coder)