Eval Report: ci-pr-smoke

Profile: gdm-swebench-lite-v1 | Tasks: 2 | Pass rate: 100.0% | Cost: $0.0020

Task IDBandScorePassedCost
python-security-fix-easy-001easy0.740$0.0010
typescript-security-fix-easy-001easy0.740$0.0010

Leaderboard Snapshot

Latest run: 2fb0e14a-5108-461e-8f47-0c0b92c00797 | Latest model: coder | Latest score: 0.740 | Recorded at: 2026-04-27T16:30:51.533855+00:00

Recent Trend

Run IDModelGit SHAScoreCreated
2fb0e14a-5108-461e-8f47-0c0b92c00797coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:51.533855+00:00
770b3ea4-a256-4796-9724-78ede2f5544ccoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:51.214557+00:00
200431d2-d112-434e-9017-34ae7a807852coder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:50.698483+00:00
a704664f-e391-4e5b-be78-195a5f27508dcoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:50.635234+00:00
21a6efed-e39d-4e43-94cf-093768d8828acoder4669773b4fbe9d507f1396f38777a1b36998faf30.7402026-04-27T16:30:50.562654+00:00

Failure Breakdown

TaxonomyFailuresBar
wrong-logic393
#########################################################################################################################################################################################################################################################################################################################################################################################################

Recent Failures

Run IDTask IDTaxonomyScoreCostCreated
2fb0e14a-5108-461e-8f47-0c0b92c00797typescript-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:30:51.533855+00:00
770b3ea4-a256-4796-9724-78ede2f5544cpython-security-fix-easy-001wrong-logic0.740$0.00102026-04-27T16:30:51.214557+00:00
200431d2-d112-434e-9017-34ae7a807852python-recovery-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.698483+00:00
a704664f-e391-4e5b-be78-195a5f27508dtypescript-config-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.635234+00:00
21a6efed-e39d-4e43-94cf-093768d8828apython-config-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.562654+00:00
e237292a-807d-40f4-a7d1-18ed94ab26e6typescript-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.496104+00:00
dad627b2-be64-4b0a-abf5-c5d2491343efpython-refactor-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.362965+00:00
9d328720-ad85-44e2-8cfc-359ad3778668typescript-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.238341+00:00
11fc19e0-9778-4d1a-a794-708062ade6f8python-multi-file-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.141267+00:00
6496b2d3-c5b3-406e-8d4c-5594de4a1d26typescript-test-writing-easy-001wrong-logic0.740$0.00102026-04-27T16:30:50.024577+00:00

Cost Frontier

pass_rate vs cost_usd (Pareto frontier marked with *)
* [####################] 100.0% @ $0.0053  (coder)