Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 54e4f669-7a62-4882-941c-b1a7cb1d9f16 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:57:56.650550+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
54e4f669-7a62-4882-941c-b1a7cb1d9f16coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:56.650550+00:00
5c5825f0-8fdf-4304-a699-f35f0bbf7cc5coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:56.594388+00:00
40b823cf-1623-4dd6-a178-e6cc0e601e08coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:56.527497+00:00
b975ef2c-bd15-44a6-bd67-cb5cc3b4e66ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:56.467877+00:00
74042678-559b-4cd7-8bb8-16bb30136b9ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:56.398806+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic517
#####################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
54e4f669-7a62-4882-941c-b1a7cb1d9f16python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.650550+00:00
5c5825f0-8fdf-4304-a699-f35f0bbf7cc5typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.594388+00:00
40b823cf-1623-4dd6-a178-e6cc0e601e08python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.527497+00:00
b975ef2c-bd15-44a6-bd67-cb5cc3b4e66etypescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.467877+00:00
74042678-559b-4cd7-8bb8-16bb30136b9epython-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.398806+00:00
441e1efb-1ff9-47a0-a100-5ebbba6f2f38typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.345659+00:00
a2198110-8f05-4128-bcf5-2b4a27a6803apython-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.250477+00:00
f6906582-d4f9-4ce6-b7e3-d1da8e5fc2b6typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.181405+00:00
29d1cc85-9e4c-48b5-b68b-b9882c380652python-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.121527+00:00
bd9dacba-d3c0-4e7f-a715-6b4fb44309ectypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:57:56.063378+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0052  (coder)