Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 2 | Pass rate: 100.0% | Cost: $0.0020

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 9260d600-ce4f-4bc7-83a9-c1c6272ec07e | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T17:10:30.976101+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
9260d600-ce4f-4bc7-83a9-c1c6272ec07ecoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:30.976101+00:00
794e7d9b-0a2d-42c3-804d-8e80c034a003coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:30.894167+00:00
91ab577d-e08f-4c96-872c-7ccaab94f905coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:30.323674+00:00
31e28fa4-c119-478a-95ad-6f6192646ad5coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:30.237233+00:00
93717657-6cd2-4106-adfe-d00bd1d27fd0coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T17:10:29.563951+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic581
#####################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
9260d600-ce4f-4bc7-83a9-c1c6272ec07etypescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T17:10:30.976101+00:00
794e7d9b-0a2d-42c3-804d-8e80c034a003python-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T17:10:30.894167+00:00
91ab577d-e08f-4c96-872c-7ccaab94f905python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T17:10:30.323674+00:00
31e28fa4-c119-478a-95ad-6f6192646ad5typescript-config-easy-001wrong-logic0.740$0.00102026-04-27T17:10:30.237233+00:00
93717657-6cd2-4106-adfe-d00bd1d27fd0python-config-easy-001wrong-logic0.740$0.00102026-04-27T17:10:29.563951+00:00
c57d188f-11b2-4694-ae2c-064000025a6etypescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T17:10:29.451745+00:00
308f1d43-57cb-43cb-bcb6-1ad2656fb0b5python-refactor-easy-001wrong-logic0.740$0.00102026-04-27T17:10:29.382711+00:00
a2a9ec59-45e6-4581-a0b0-a416f8247194typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T17:10:29.313286+00:00
429c779d-c119-4674-ac3a-1f5e2f47a43cpython-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T17:10:29.237043+00:00
f7ff16cb-1b98-4e0e-bed1-2003bc9afc55typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T17:10:29.157102+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0051  (coder)