Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 2 | Pass rate: 100.0% | Cost: $0.0020

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 59624642-004d-4000-a8d3-284a20a8de85 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-28T00:09:09.602716+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
59624642-004d-4000-a8d3-284a20a8de85coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:09.602716+00:00
894c099b-1747-4560-a3ca-3cdc4f6cd48ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:09.540038+00:00
f01e9f03-b43a-4ffc-90d6-225b4f21a5e3coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:08.864666+00:00
51adb134-f9e1-4345-a567-ae6e2e077953coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:08.780054+00:00
2b13a34f-8f24-457f-862e-c4239c17f6fccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-28T00:09:08.713483+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic769
#################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
59624642-004d-4000-a8d3-284a20a8de85typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-28T00:09:09.602716+00:00
894c099b-1747-4560-a3ca-3cdc4f6cd48epython-security-fix-easy-001wrong-logic0.740$0.00102026-04-28T00:09:09.540038+00:00
f01e9f03-b43a-4ffc-90d6-225b4f21a5e3python-recovery-easy-001wrong-logic0.740$0.00102026-04-28T00:09:08.864666+00:00
51adb134-f9e1-4345-a567-ae6e2e077953typescript-config-easy-001wrong-logic0.740$0.00102026-04-28T00:09:08.780054+00:00
2b13a34f-8f24-457f-862e-c4239c17f6fcpython-config-easy-001wrong-logic0.740$0.00102026-04-28T00:09:08.713483+00:00
8909c951-6f70-4bd8-a5cd-b426512ca1f8typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:09:08.640041+00:00
5efa0bb9-5047-4ecf-b583-fd768f52aba6python-refactor-easy-001wrong-logic0.740$0.00102026-04-28T00:09:08.568292+00:00
8dff3129-a049-461a-89a1-c01f3224efd1typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:09:08.498455+00:00
522700d5-b023-467c-9905-1a7c9a356bfcpython-multi-file-easy-001wrong-logic0.740$0.00102026-04-28T00:09:08.433007+00:00
a681a556-1537-4ce7-be70-5d85a66f95aetypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-28T00:09:08.357304+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0049  (coder)