model examples accuracy validity
opus-4.6 5 82.5 100
opus-4.6 3 82.0 100
opus-4.6 1 81.5 100
sonnet-4.6 5  80.5 99.4
sonnet-4.6 3 80.8 99.8
sonnet-4.6 1 80.3 99.8
haiku-4.5 5  76 99.9
haiku-4.5 3 76.1 99.8
haiku-4.5 1 76 99.9
qwen3.5-0.8b-tuned 0 88.8 100
haiku-4.5 0 77.7 100
sonnet-4.6 0 80.4 99.9
opus-4.6 0 81.3 100
gpt-5.4-mini 0 78 100
gpt-5.4 0 82.5 100
