Model
Mean win rate
NarrativeQA - F1
NaturalQuestions (open) - F1
NaturalQuestions (closed) - F1
OpenbookQA - EM
MMLU - EM
MATH - Equivalent (CoT)
GSM8K - EM
LegalBench - EM
MedQA - EM
WMT 2014 - BLEU-4
GPT-4o (2024-05-13)
0.938
0.804
0.803
0.501
0.966
0.748
0.829
0.905
0.733
0.857
0.231
GPT-4o (2024-08-06)
0.928
0.795
0.793
0.496
0.968
0.738
0.853
0.909
0.721
0.863
0.225
DeepSeek v3
0.908
0.796
0.765
0.467
0.954
0.803
0.912
0.94
0.718
0.809
0.209
Claude 3.5 Sonnet (20240620)
0.885
0.746
0.749
0.502
0.972
0.799
0.813
0.949
0.707
0.825
0.229
Amazon Nova Pro
0.885
0.791
0.829
0.405
0.96
0.758
0.821
0.87
0.736
0.811
0.229
GPT-4 (0613)
0.867
0.768
0.79
0.457
0.96
0.735
0.802
0.932
0.713
0.815
0.211
GPT-4 Turbo (2024-04-09)
0.864
0.761
0.795
0.482
0.97
0.711
0.833
0.824
0.727
0.783
0.218
Llama 3.1 Instruct Turbo (405B)
0.854
0.749
0.756
0.456
0.94
0.759
0.827
0.949
0.707
0.805
0.238
Claude 3.5 Sonnet (20241022)
0.846
0.77
0.665
0.467
0.966
0.809
0.904
0.956
0.647
0.859
0.226
Gemini 1.5 Pro (002)
0.842
0.756
0.726
0.455
0.952
0.795
0.92
0.817
0.747
0.771
0.231
Llama 3.2 Vision Instruct Turbo (90B)
0.819
0.777
0.739
0.457
0.942
0.703
0.791
0.936
0.68
0.769
0.224
Gemini 2.0 Flash (Experimental)
0.813
0.783
0.722
0.443
0.946
0.717
0.901
0.946
0.674
0.73
0.212
Llama 3.3 Instruct Turbo (70B)
0.812
0.791
0.737
0.431
0.928
0.7
0.808
0.942
0.725
0.761
0.219
Llama 3.1 Instruct Turbo (70B)
0.808
0.772
0.738
0.452
0.938
0.709
0.783
0.938
0.687
0.769
0.223
Palmyra-X-004
0.808
0.773
0.754
0.457
0.926
0.739
0.767
0.905
0.73
0.775
0.203
Llama 3 (70B)
0.793
0.798
0.743
0.475
0.934
0.695
0.663
0.805
0.733
0.777
0.225
Qwen2 Instruct (72B)
0.77
0.727
0.776
0.39
0.954
0.769
0.79
0.92
0.712
0.746
0.207
Qwen2.5 Instruct Turbo (72B)
0.745
0.745
0.676
0.359
0.962
0.77
0.884
0.9
0.74
0.753
0.207
Mistral Large 2 (2407)
0.744
0.779
0.734
0.453
0.932
0.725
0.677
0.912
0.646
0.775
0.192
Gemini 1.5 Pro (001)
0.739
0.783
0.748
0.378
0.902
0.772
0.825
0.836
0.757
0.692
0.189
Amazon Nova Lite
0.708
0.768
0.815
0.352
0.928
0.693
0.779
0.829
0.659
0.696
0.204
Mixtral (8x22B)
0.705
0.779
0.726
0.478
0.882
0.701
0.656
0.8
0.708
0.704
0.209
GPT-4o mini (2024-07-18)
0.701
0.768
0.746
0.386
0.92
0.668
0.802
0.843
0.653
0.748
0.206
GPT-4 Turbo (1106 preview)
0.698
0.727
0.763
0.435
0.95
0.699
0.857
0.668
0.626
0.817
0.205
Claude 3 Opus (20240229)
0.683
0.351
0.264
0.441
0.956
0.768
0.76
0.924
0.662
0.775
0.24
Palmyra X V3 (72B)
0.679
0.706
0.685
0.407
0.938
0.702
0.723
0.831
0.709
0.684
0.262
Gemma 2 Instruct (27B)
0.675
0.79
0.731
0.353
0.918
0.664
0.746
0.812
0.7
0.684
0.214
Gemini 1.5 Flash (001)
0.667
0.783
0.723
0.332
0.928
0.703
0.753
0.785
0.661
0.68
0.225
PaLM-2 (Unicorn)
0.644
0.583
0.674
0.435
0.938
0.702
0.674
0.831
0.677
0.684
0.26
Jamba 1.5 Large
0.637
0.664
0.718
0.394
0.948
0.683
0.692
0.846
0.675
0.698
0.203
Qwen1.5 (72B)
0.608
0.601
0.758
0.417
0.93
0.647
0.683
0.799
0.694
0.67
0.201
Solar Pro
0.602
0.753
0.792
0.297
0.922
0.679
0.567
0.871
0.67
0.698
0.169
Palmyra X V2 (33B)
0.589
0.752
0.752
0.428
0.878
0.621
0.58
0.735
0.644
0.598
0.239
Gemini 1.5 Flash (002)
0.573
0.746
0.718
0.323
0.914
0.679
0.908
0.328
0.67
0.656
0.212
Yi (34B)
0.57
0.782
0.775
0.443
0.92
0.65
0.375
0.648
0.618
0.656
0.172
Gemma 2 Instruct (9B)
0.562
0.768
0.738
0.328
0.91
0.645
0.724
0.762
0.639
0.63
0.201
Qwen1.5 Chat (110B)
0.55
0.721
0.739
0.35
0.922
0.704
0.568
0.815
0.624
0.64
0.192
Qwen1.5 (32B)
0.546
0.589
0.777
0.353
0.932
0.628
0.733
0.773
0.636
0.656
0.193
Claude 3.5 Haiku (20241022)
0.531
0.763
0.639
0.344
0.854
0.671
0.872
0.815
0.631
0.722
0.135
PaLM-2 (Bison)
0.526
0.718
0.813
0.39
0.878
0.608
0.421
0.61
0.645
0.547
0.241
Amazon Nova Micro
0.524
0.744
0.779
0.285
0.888
0.64
0.76
0.794
0.615
0.608
0.192
Claude v1.3
0.518
0.723
0.699
0.409
0.908
0.631
0.54
0.784
0.629
0.618
0.219
Mixtral (8x7B 32K seqlen)
0.51
0.767
0.699
0.427
0.868
0.649
0.494
0.622
0.63
0.652
0.19
Phi-3 (14B)
0.509
0.724
0.729
0.278
0.916
0.675
0.611
0.878
0.593
0.696
0.17
Claude 2.0
0.489
0.718
0.67
0.428
0.862
0.639
0.603
0.583
0.643
0.652
0.219
DeepSeek LLM Chat (67B)
0.488
0.581
0.733
0.412
0.88
0.641
0.615
0.795
0.637
0.628
0.186
Qwen2.5 Instruct Turbo (7B)
0.488
0.742
0.725
0.205
0.862
0.658
0.835
0.83
0.632
0.6
0.155
Llama 2 (70B)
0.482
0.763
0.674
0.46
0.838
0.58
0.323
0.567
0.673
0.618
0.196
Phi-3 (7B)
0.473
0.754
0.675
0.324
0.912
0.659
0.703
-
0.584
0.672
0.154
Yi Large (Preview)
0.471
0.373
0.586
0.428
0.946
0.712
0.712
0.69
0.519
0.66
0.176
Command R Plus
0.441
0.735
0.711
0.343
0.828
0.59
0.403
0.738
0.672
0.567
0.203
GPT-3.5 (text-davinci-003)
0.439
0.731
0.77
0.413
0.828
0.555
0.449
0.615
0.622
0.531
0.191
Claude 2.1
0.437
0.677
0.611
0.375
0.872
0.643
0.632
0.604
0.643
0.644
0.204
Qwen1.5 (14B)
0.425
0.711
0.772
0.3
0.862
0.626
0.686
0.693
0.593
0.515
0.178
Gemini 1.0 Pro (002)
0.422
0.751
0.714
0.391
0.788
0.534
0.665
0.816
0.475
0.483
0.194
Jamba 1.5 Mini
0.414
0.746
0.71
0.388
0.89
0.582
0.318
0.691
0.503
0.632
0.179
Claude Instant 1.2
0.399
0.616
0.731
0.343
0.844
0.631
0.499
0.721
0.586
0.559
0.194
Llama 3 (8B)
0.387
0.754
0.681
0.378
0.766
0.602
0.391
0.499
0.637
0.581
0.183
Claude 3 Sonnet (20240229)
0.377
0.111
0.072
0.028
0.918
0.652
0.084
0.907
0.49
0.684
0.218
GPT-3.5 Turbo (0613)
0.358
0.655
0.678
0.335
0.838
0.614
0.667
0.501
0.528
0.622
0.187
LLaMA (65B)
0.345
0.755
0.672
0.433
0.754
0.584
0.257
0.489
0.48
0.507
0.189
Arctic Instruct
0.338
0.654
0.586
0.39
0.828
0.575
0.519
0.768
0.588
0.581
0.172
Gemma (7B)
0.336
0.752
0.665
0.336
0.808
0.571
0.5
0.559
0.581
0.513
0.187
GPT-3.5 (text-davinci-002)
0.336
0.719
0.71
0.394
0.796
0.568
0.428
0.479
0.58
0.525
0.174
Mistral NeMo (2402)
0.333
0.731
0.65
0.265
0.822
0.604
0.668
0.782
0.415
0.59
0.177
Mistral Large (2402)
0.328
0.454
0.485
0.311
0.894
0.638
0.75
0.694
0.479
0.499
0.182
Command
0.327
0.749
0.777
0.391
0.774
0.525
0.236
0.452
0.578
0.445
0.088
Llama 3.2 Vision Instruct Turbo (11B)
0.325
0.756
0.671
0.234
0.724
0.511
0.739
0.823
0.435
0.27
0.179
Llama 3.1 Instruct Turbo (8B)
0.303
0.756
0.677
0.209
0.74
0.5
0.703
0.798
0.342
0.245
0.181
Command R
0.299
0.742
0.72
0.352
0.782
0.567
0.266
0.551
0.507
0.555
0.149
Mistral v0.1 (7B)
0.292
0.716
0.687
0.367
0.776
0.584
0.297
0.377
0.58
0.525
0.16
DBRX Instruct
0.289
0.488
0.55
0.284
0.91
0.643
0.358
0.671
0.426
0.694
0.131
Mistral Small (2402)
0.288
0.519
0.587
0.304
0.862
0.593
0.621
0.734
0.389
0.616
0.169
Jamba Instruct
0.287
0.658
0.636
0.384
0.796
0.582
0.38
0.67
0.54
0.519
0.164
Qwen1.5 (7B)
0.275
0.448
0.749
0.27
0.806
0.569
0.561
0.6
0.523
0.479
0.153
Mistral Medium (2312)
0.268
0.449
0.468
0.29
0.83
0.618
0.565
0.706
0.452
0.61
0.169
Claude 3 Haiku (20240307)
0.263
0.244
0.252
0.144
0.838
0.662
0.131
0.699
0.46
0.702
0.148
Yi (6B)
0.253
0.702
0.748
0.31
0.8
0.53
0.126
0.375
0.519
0.497
0.117
Llama 2 (13B)
0.233
0.741
0.64
0.371
0.634
0.505
0.102
0.266
0.591
0.392
0.167
Falcon (40B)
0.217
0.671
0.676
0.392
0.662
0.507
0.128
0.267
0.442
0.419
0.162
Jurassic-2 Jumbo (178B)
0.215
0.728
0.65
0.385
0.688
0.483
0.103
0.239
0.533
0.431
0.114
Mistral Instruct v0.3 (7B)
0.196
0.716
0.68
0.253
0.79
0.51
0.289
0.538
0.331
0.517
0.142
Jurassic-2 Grande (17B)
0.172
0.744
0.627
0.35
0.614
0.471
0.064
0.159
0.468
0.39
0.102
Phi-2
0.169
0.703
0.68
0.155
0.798
0.518
0.255
0.581
0.334
0.41
0.038
Llama 2 (7B)
0.152
0.686
0.612
0.333
0.544
0.425
0.097
0.154
0.502
0.392
0.144
Luminous Supreme (70B)
0.145
0.743
0.656
0.299
0.284
0.316
0.078
0.137
0.452
0.276
0.102
Command Light
0.105
0.629
0.686
0.195
0.398
0.386
0.098
0.149
0.397
0.312
0.023
Luminous Extended (30B)
0.078
0.684
0.611
0.253
0.272
0.248
0.04
0.075
0.421
0.276
0.083
Falcon (7B)
0.064
0.621
0.58
0.285
0.26
0.288
0.044
0.055
0.346
0.254
0.094
OLMo (7B)
0.052
0.597
0.603
0.259
0.222
0.305
0.029
0.044
0.341
0.229
0.097
Luminous Base (13B)
0.041
0.633
0.577
0.197
0.286
0.243
0.026
0.028
0.332
0.26
0.066