Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 2 | Pass rate: 100.0% | Cost: $0.0020

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 4a96fcff-455a-43ac-bea2-4ec1c13972e0 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:57:50.326542+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
4a96fcff-455a-43ac-bea2-4ec1c13972e0coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:50.326542+00:00
dc6e6c5a-dbfe-4d09-9bdf-3bff429e7e3fcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:50.253552+00:00
1e6b64c6-2d9d-43a7-84cc-0121d8d5df7dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:49.924832+00:00
0845534a-4330-4e04-b5c3-facfdd0defc2coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:49.881079+00:00
3f03bdec-d9f8-453e-92e3-edcf265c81b3coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:49.824881+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic487
#######################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
4a96fcff-455a-43ac-bea2-4ec1c13972e0typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:57:50.326542+00:00
dc6e6c5a-dbfe-4d09-9bdf-3bff429e7e3fpython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:57:50.253552+00:00
1e6b64c6-2d9d-43a7-84cc-0121d8d5df7dpython-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.924832+00:00
0845534a-4330-4e04-b5c3-facfdd0defc2typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.881079+00:00
3f03bdec-d9f8-453e-92e3-edcf265c81b3python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.824881+00:00
7fdea9ca-6729-4d77-a715-de7041235807typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.760924+00:00
2c6b5289-8727-4932-a4f7-a86f54e5b746python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.717418+00:00
23e603ac-dd0f-40a5-95c4-a8618e7faf5atypescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.675379+00:00
891f8020-65c4-4ace-9f82-6e8fde0fa560python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.621457+00:00
afdf1cf3-f345-4daf-8496-8c39dad860eatypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.568358+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0052  (coder)