Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 15 | Pass rate: 100.0% | Cost: $0.0150

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010
python-bugfix-easy-001easy0.740$0.0010
typescript-bugfix-easy-001easy0.740$0.0010
python-performance-easy-001easy0.740$0.0010
typescript-performance-easy-001easy0.740$0.0010
python-test-writing-easy-001easy0.740$0.0010
typescript-test-writing-easy-001easy0.740$0.0010
python-multi-file-easy-001easy0.740$0.0010
typescript-multi-file-easy-001easy0.740$0.0010
python-refactor-easy-001easy0.740$0.0010
typescript-refactor-easy-001easy0.740$0.0010
python-config-easy-001easy0.740$0.0010
typescript-config-easy-001easy0.740$0.0010
python-recovery-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: bea02f6e-2c66-496c-9813-777f21092ce4 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:10:29.173022+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
bea02f6e-2c66-496c-9813-777f21092ce4coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:10:29.173022+00:00
0fe73653-43f9-4821-8dd8-91e05df54a4ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:10:29.117797+00:00
78dd97d4-d0fc-41c5-96fa-8197f8f46511coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:10:29.075375+00:00
fccd5c50-9dad-4b55-9ece-21dfb063e884coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:10:29.042292+00:00
0a8b3bab-61a9-425d-bcf9-eeab85d5b83ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:10:28.967613+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic79
###############################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
bea02f6e-2c66-496c-9813-777f21092ce4python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:10:29.173022+00:00
0fe73653-43f9-4821-8dd8-91e05df54a4ctypescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:10:29.117797+00:00
78dd97d4-d0fc-41c5-96fa-8197f8f46511python-config-easy-001wrong-logic0.740$0.00102026-04-27T16:10:29.075375+00:00
fccd5c50-9dad-4b55-9ece-21dfb063e884typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:10:29.042292+00:00
0a8b3bab-61a9-425d-bcf9-eeab85d5b83epython-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:10:28.967613+00:00
17bae31a-a633-4ac9-a39d-16bb1a676d5ftypescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:10:28.932289+00:00
1d420357-0a32-4dc0-83c2-1ad3e4033a34python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:10:28.901291+00:00
743d77c6-ee0a-4c0f-a261-12cabef521dbtypescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:10:28.870494+00:00
b9b9ded6-52bb-4fb7-839e-0dab197519bcpython-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:10:28.836001+00:00
3a85cd4e-7898-446a-b327-36b192486b49typescript-performance-easy-001wrong-logic0.740$0.00102026-04-27T16:10:28.798533+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0056  (coder)