Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 1e6b64c6-2d9d-43a7-84cc-0121d8d5df7d | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:57:49.924832+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
1e6b64c6-2d9d-43a7-84cc-0121d8d5df7dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:49.924832+00:00
0845534a-4330-4e04-b5c3-facfdd0defc2coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:49.881079+00:00
3f03bdec-d9f8-453e-92e3-edcf265c81b3coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:49.824881+00:00
7fdea9ca-6729-4d77-a715-de7041235807coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:49.760924+00:00
2c6b5289-8727-4932-a4f7-a86f54e5b746coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:57:49.717418+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic485
#####################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
1e6b64c6-2d9d-43a7-84cc-0121d8d5df7dpython-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.924832+00:00
0845534a-4330-4e04-b5c3-facfdd0defc2typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.881079+00:00
3f03bdec-d9f8-453e-92e3-edcf265c81b3python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.824881+00:00
7fdea9ca-6729-4d77-a715-de7041235807typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.760924+00:00
2c6b5289-8727-4932-a4f7-a86f54e5b746python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.717418+00:00
23e603ac-dd0f-40a5-95c4-a8618e7faf5atypescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.675379+00:00
891f8020-65c4-4ace-9f82-6e8fde0fa560python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.621457+00:00
afdf1cf3-f345-4daf-8496-8c39dad860eatypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.568358+00:00
94eca4f4-17f1-4247-86b8-080263e0784fpython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.502735+00:00
1c8a3444-2ed9-400f-838a-e43fb04195cctypescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:57:49.460662+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0052  (coder)