Model
Mean score
MMLU-Pro - COT correct
GPQA - COT correct
IFEval - IFEval Strict Acc
WildBench - WB Score
Omni-MATH - Acc
GPT-5 mini (2025-08-07)
0.819
0.835
0.756
0.927
0.855
0.722
o4-mini (2025-04-16)
0.812
0.82
0.735
0.929
0.854
0.72
o3 (2025-04-16)
0.811
0.859
0.753
0.869
0.861
0.714
GPT-5 (2025-08-07)
0.807
0.863
0.791
0.875
0.857
0.647
Qwen3 235B A22B Instruct 2507 FP8
0.798
0.844
0.726
0.835
0.866
0.718
Grok 4 (0709)
0.785
0.851
0.726
0.949
0.797
0.603
Claude 4 Opus (20250514, extended thinking)
0.78
0.875
0.709
0.849
0.852
0.616
gpt-oss-120b
0.77
0.795
0.684
0.836
0.845
0.688
Kimi K2 Instruct
0.768
0.819
0.652
0.85
0.862
0.654
Claude 4 Sonnet (20250514, extended thinking)
0.766
0.843
0.706
0.84
0.838
0.602
Claude 4.5 Sonnet (20250929)
0.762
0.869
0.686
0.85
0.854
0.553
Claude 4 Opus (20250514)
0.757
0.859
0.666
0.918
0.833
0.511
GPT-5 nano (2025-08-07)
0.748
0.778
0.679
0.932
0.806
0.547
Gemini 2.5 Pro (03-25 preview)
0.745
0.863
0.749
0.84
0.857
0.416
Claude 4 Sonnet (20250514)
0.733
0.843
0.643
0.839
0.825
0.513
Grok 3 Beta
0.727
0.788
0.65
0.884
0.849
0.464
GPT-4.1 (2025-04-14)
0.727
0.811
0.659
0.838
0.854
0.471
Qwen3 235B A22B FP8 Throughput
0.726
0.817
0.623
0.816
0.828
0.548
GPT-4.1 mini (2025-04-14)
0.726
0.783
0.614
0.904
0.838
0.491
Llama 4 Maverick (17Bx128E) Instruct FP8
0.718
0.81
0.65
0.908
0.8
0.422
Qwen3-Next 80B A3B Thinking
0.7
0.786
0.63
0.81
0.807
0.467
DeepSeek-R1-0528
0.699
0.793
0.666
0.784
0.828
0.424
Palmyra X5
0.696
0.804
0.661
0.823
0.78
0.415
Grok 3 mini Beta
0.679
0.799
0.675
0.951
0.651
0.318
Gemini 2.0 Flash
0.679
0.737
0.556
0.841
0.8
0.459
Claude 3.7 Sonnet (20250219)
0.674
0.784
0.608
0.834
0.814
0.33
gpt-oss-20b
0.674
0.74
0.594
0.732
0.737
0.565
GLM-4.5-Air-FP8
0.67
0.762
0.594
0.812
0.789
0.391
DeepSeek v3
0.665
0.723
0.538
0.832
0.831
0.403
Gemini 1.5 Pro (002)
0.657
0.737
0.534
0.837
0.813
0.364
Claude 3.5 Sonnet (20241022)
0.653
0.777
0.565
0.856
0.792
0.276
Llama 4 Scout (17Bx16E) Instruct
0.644
0.742
0.507
0.818
0.779
0.373
Gemini 2.0 Flash Lite (02-05 preview)
0.642
0.72
0.5
0.824
0.79
0.374
Amazon Nova Premier
0.637
0.726
0.518
0.803
0.788
0.35
GPT-4o (2024-11-20)
0.634
0.713
0.52
0.817
0.828
0.293
Gemini 2.5 Flash (04-17 preview)
0.626
0.639
0.39
0.898
0.817
0.384
Llama 3.1 Instruct Turbo (405B)
0.618
0.723
0.522
0.811
0.783
0.249
GPT-4.1 nano (2025-04-14)
0.616
0.55
0.507
0.843
0.811
0.367
Palmyra-X-004
0.609
0.657
0.395
0.872
0.802
0.32
Gemini 1.5 Flash (002)
0.609
0.678
0.437
0.831
0.792
0.305
Qwen2.5 Instruct Turbo (72B)
0.599
0.631
0.426
0.806
0.802
0.33
Mistral Large (2411)
0.598
0.599
0.435
0.876
0.801
0.281
Gemini 2.5 Flash-Lite
0.591
0.537
0.309
0.81
0.818
0.48
Amazon Nova Pro
0.591
0.673
0.446
0.815
0.777
0.242
Palmyra Fin
0.577
0.591
0.422
0.793
0.783
0.295
IBM Granite 4.0 Small
0.575
0.569
0.383
0.89
0.739
0.296
Llama 3.1 Instruct Turbo (70B)
0.574
0.653
0.426
0.821
0.758
0.21
GPT-4o mini (2024-07-18)
0.565
0.603
0.368
0.782
0.791
0.28
Mistral Small 3.1 (2503)
0.558
0.61
0.392
0.75
0.788
0.248
Amazon Nova Lite
0.551
0.6
0.397
0.776
0.75
0.233
Claude 3.5 Haiku (20241022)
0.549
0.605
0.363
0.792
0.76
0.224
Qwen2.5 Instruct Turbo (7B)
0.529
0.539
0.341
0.741
0.731
0.294
Amazon Nova Micro
0.522
0.511
0.383
0.76
0.743
0.214
IBM Granite 4.0 Micro
0.486
0.395
0.307
0.849
0.67
0.209
Mixtral Instruct (8x22B)
0.478
0.46
0.334
0.724
0.711
0.163
Palmyra Med
0.476
0.411
0.368
0.767
0.676
0.156
OLMo 2 32B Instruct March 2025
0.475
0.414
0.287
0.78
0.734
0.161
IBM Granite 3.3 8B Instruct
0.463
0.343
0.325
0.729
0.741
0.176
Llama 3.1 Instruct Turbo (8B)
0.444
0.406
0.247
0.743
0.686
0.137
OLMo 2 13B Instruct November 2024
0.44
0.31
0.316
0.73
0.689
0.156
OLMo 2 7B Instruct November 2024
0.405
0.292
0.296
0.693
0.628
0.116
Mixtral Instruct (8x7B)
0.397
0.335
0.296
0.575
0.673
0.105
Mistral Instruct v0.3 (7B)
0.376
0.277
0.303
0.567
0.66
0.072
OLMoE 1B-7B Instruct January 2025
0.332
0.169
0.22
0.628
0.551
0.093
Marin 8B Instruct
0.325
0.188
0.168
0.632
0.477
0.16