
========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
C++
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/22.0 [00:00<?, ? files/s]
Loading files:   5%|▍         | 1.00/22.0 [00:11<03:53, 11.1s/ files]
Loading files:   9%|▉         | 2.00/22.0 [00:11<01:34, 4.70s/ files]
Loading files:  14%|█▎        | 3.00/22.0 [00:11<00:50, 2.63s/ files]
Loading files:  18%|█▊        | 4.00/22.0 [00:11<00:29, 1.65s/ files]
Loading files:  23%|██▎       | 5.00/22.0 [00:11<00:18, 1.11s/ files]
Loading files:  27%|██▋       | 6.00/22.0 [00:11<00:12, 1.27 files/s]
Loading files:  32%|███▏      | 7.00/22.0 [00:12<00:08, 1.72 files/s]
Loading files:  36%|███▋      | 8.00/22.0 [00:12<00:06, 2.25 files/s]
Loading files:  41%|████      | 9.00/22.0 [00:12<00:05, 2.51 files/s]
Loading files:  45%|████▌     | 10.0/22.0 [00:12<00:03, 3.04 files/s]
Loading files:  50%|█████     | 11.0/22.0 [00:14<00:10, 1.09 files/s]
Loading files:  55%|█████▍    | 12.0/22.0 [00:16<00:10, 1.04s/ files]
Loading files:  59%|█████▉    | 13.0/22.0 [00:16<00:06, 1.31 files/s]
Loading files:  64%|██████▎   | 14.0/22.0 [00:16<00:04, 1.75 files/s]
Loading files:  68%|██████▊   | 15.0/22.0 [00:16<00:03, 2.27 files/s]
Loading files:  73%|███████▎  | 16.0/22.0 [00:16<00:02, 2.88 files/s]
Loading files:  77%|███████▋  | 17.0/22.0 [00:17<00:02, 2.44 files/s]
Loading files:  82%|████████▏ | 18.0/22.0 [00:17<00:01, 2.92 files/s]
Loading files:  86%|████████▋ | 19.0/22.0 [00:17<00:00, 3.41 files/s]
Loading files:  91%|█████████ | 20.0/22.0 [00:17<00:00, 3.94 files/s]
Loading files:  95%|█████████▌| 21.0/22.0 [00:18<00:00, 3.32 files/s]
Loading files: 100%|██████████| 22.0/22.0 [00:18<00:00, 3.69 files/s]
Loading files: 100%|██████████| 22.0/22.0 [00:18<00:00, 1.19 files/s]
Loaded 500,000 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0     15 (0.0%)        0.0%                                                     
                    4      2 (0.0%)        0.0%                                                     
                    5     23 (0.0%)        0.0%                                                     
                    6    488 (0.1%)        0.1%                                                     
                    7  1,046 (0.2%)        0.3%                                                     
                    8  2,564 (0.5%)        0.8%                                                     
                    9  6,428 (1.3%)        2.1%  ██                                                 
                   10  5,672 (1.1%)        3.2%  █                                                  
                   11        29,893        9.2%  █████████                                          
                             (6.0%)                                                                 
                   12        76,243       24.5%  ████████████████████████                           
                            (15.2%)                                                                 
                   13       154,348       55.3%  ██████████████████████████████████████████████████ 
                            (30.9%)                                                                 
                   14        90,041       73.4%  █████████████████████████████                      
                            (18.0%)                                                                 
                   15        92,720       91.9%  ██████████████████████████████                     
                            (18.5%)                                                                 
                   16        30,374       98.0%  █████████                                          
                             (6.1%)                                                                 
                   17  8,482 (1.7%)       99.7%  ██                                                 
                   18  1,649 (0.3%)      100.0%                                                     
                   19     12 (0.0%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   500,000 │
│ Mean Length   │    5527.5 │
│ Std Dev       │   56011.5 │
│ Min Length    │         3 │
│ Max Length    │ 7,442,785 │
│ Median Length │    1426.0 │
│ P1            │        91 │
│ P5            │       220 │
│ P10           │       331 │
│ P25           │       639 │
│ P50           │     1,426 │
│ P75           │     3,557 │
│ P90           │     8,663 │
│ P95           │    15,045 │
│ P99           │    46,325 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [3, 331)        49,895       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [331, 531)        49,917       20.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [531, 757)        50,058       30.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [757, 1,043)        49,969       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,043, 1,426)        50,053       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,426, 1,980)        50,066       60.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
       [1,980, 2,874)        50,027       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,874, 4,515)        50,008       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [4,515, 8,663)        50,003       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [8,663, 7,442,786)        50,004      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
C
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/26.0 [00:00<?, ? files/s]
Loading files:   4%|▍         | 1.00/26.0 [00:09<03:48, 9.13s/ files]
Loading files:   8%|▊         | 2.00/26.0 [00:11<02:01, 5.07s/ files]
Loading files:  12%|█▏        | 3.00/26.0 [00:11<01:05, 2.85s/ files]
Loading files:  15%|█▌        | 4.00/26.0 [00:11<00:39, 1.79s/ files]
Loading files:  19%|█▉        | 5.00/26.0 [00:11<00:25, 1.22s/ files]
Loading files:  23%|██▎       | 6.00/26.0 [00:12<00:17, 1.16 files/s]
Loading files:  27%|██▋       | 7.00/26.0 [00:12<00:11, 1.59 files/s]
Loading files:  31%|███       | 8.00/26.0 [00:12<00:08, 2.06 files/s]
Loading files:  35%|███▍      | 9.00/26.0 [00:12<00:06, 2.54 files/s]
Loading files:  38%|███▊      | 10.0/26.0 [00:12<00:05, 2.98 files/s]
Loading files:  42%|████▏     | 11.0/26.0 [00:15<00:16, 1.09s/ files]
Loading files:  46%|████▌     | 12.0/26.0 [00:16<00:12, 1.14 files/s]
Loading files:  50%|█████     | 13.0/26.0 [00:16<00:08, 1.49 files/s]
Loading files:  54%|█████▍    | 14.0/26.0 [00:16<00:06, 1.93 files/s]
Loading files:  58%|█████▊    | 15.0/26.0 [00:16<00:04, 2.41 files/s]
Loading files:  65%|██████▌   | 17.0/26.0 [00:16<00:02, 3.69 files/s]
Loading files:  69%|██████▉   | 18.0/26.0 [00:16<00:01, 4.14 files/s]
Loading files:  73%|███████▎  | 19.0/26.0 [00:17<00:01, 3.56 files/s]
Loading files:  77%|███████▋  | 20.0/26.0 [00:17<00:01, 4.05 files/s]
Loading files:  81%|████████  | 21.0/26.0 [00:17<00:01, 3.87 files/s]
Loading files:  85%|████████▍ | 22.0/26.0 [00:18<00:01, 2.32 files/s]
Loading files:  88%|████████▊ | 23.0/26.0 [00:18<00:01, 2.77 files/s]
Loading files:  92%|█████████▏| 24.0/26.0 [00:19<00:00, 2.59 files/s]
Loading files:  96%|█████████▌| 25.0/26.0 [00:19<00:00, 2.05 files/s]
Loading files: 100%|██████████| 26.0/26.0 [00:20<00:00, 2.30 files/s]
Loading files: 100%|██████████| 26.0/26.0 [00:20<00:00, 1.28 files/s]
Loaded 488,153 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0      9 (0.0%)        0.0%                                                     
                    2      1 (0.0%)        0.0%                                                     
                    3      2 (0.0%)        0.0%                                                     
                    4     16 (0.0%)        0.0%                                                     
                    5     60 (0.0%)        0.0%                                                     
                    6  1,203 (0.2%)        0.3%                                                     
                    7  2,504 (0.5%)        0.8%                                                     
                    8  6,105 (1.3%)        2.0%  ██                                                 
                    9        12,946        4.7%  ████                                               
                             (2.7%)                                                                 
                   10        10,860        6.9%  ███                                                
                             (2.2%)                                                                 
                   11        50,052       17.2%  ██████████████████                                 
                            (10.3%)                                                                 
                   12        89,082       35.4%  ████████████████████████████████                   
                            (18.2%)                                                                 
                   13       136,477       63.4%  ██████████████████████████████████████████████████ 
                            (28.0%)                                                                 
                   14        75,241       78.8%  ███████████████████████████                        
                            (15.4%)                                                                 
                   15        71,436       93.4%  ██████████████████████████                         
                            (14.6%)                                                                 
                   16        23,824       98.3%  ████████                                           
                             (4.9%)                                                                 
                   17  6,676 (1.4%)       99.7%  ██                                                 
                   18  1,659 (0.3%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   488,153 │
│ Mean Length   │    6995.2 │
│ Std Dev       │   60885.0 │
│ Min Length    │         2 │
│ Max Length    │ 8,305,354 │
│ Median Length │    1354.0 │
│ P1            │        50 │
│ P5            │       133 │
│ P10           │       217 │
│ P25           │       490 │
│ P50           │     1,354 │
│ P75           │     3,731 │
│ P90           │    10,277 │
│ P95           │    19,546 │
│ P99           │    73,425 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [2, 217)        48,597       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [217, 388)        48,942       20.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
           [388, 607)        48,804       30.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [607, 913)        48,789       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [913, 1,354)        48,931       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,354, 1,970)        48,763       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,970, 2,953)        48,866       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,953, 4,915)        48,814       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
      [4,915, 10,277)        48,829       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
  [10,277, 8,305,355)        48,818      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
C-Sharp
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/14.0 [00:00<?, ? files/s]
Loading files:   7%|▋         | 1.00/14.0 [00:11<02:29, 11.5s/ files]
Loading files:  14%|█▍        | 2.00/14.0 [00:12<01:01, 5.16s/ files]
Loading files:  21%|██▏       | 3.00/14.0 [00:12<00:31, 2.90s/ files]
Loading files:  29%|██▊       | 4.00/14.0 [00:12<00:18, 1.82s/ files]
Loading files:  36%|███▌      | 5.00/14.0 [00:12<00:11, 1.22s/ files]
Loading files:  43%|████▎     | 6.00/14.0 [00:12<00:06, 1.16 files/s]
Loading files:  50%|█████     | 7.00/14.0 [00:13<00:04, 1.57 files/s]
Loading files:  57%|█████▋    | 8.00/14.0 [00:13<00:02, 2.05 files/s]
Loading files:  64%|██████▍   | 9.00/14.0 [00:13<00:01, 2.57 files/s]
Loading files:  71%|███████▏  | 10.0/14.0 [00:13<00:01, 3.08 files/s]
Loading files:  79%|███████▊  | 11.0/14.0 [00:17<00:03, 1.31s/ files]
Loading files:  86%|████████▌ | 12.0/14.0 [00:17<00:01, 1.04 files/s]
Loading files:  93%|█████████▎| 13.0/14.0 [00:17<00:00, 1.40 files/s]
Loading files: 100%|██████████| 14.0/14.0 [00:17<00:00, 1.83 files/s]
Loading files: 100%|██████████| 14.0/14.0 [00:17<00:00, 1.26s/ files]
Loaded 499,981 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0      6 (0.0%)        0.0%                                                     
                    5     11 (0.0%)        0.0%                                                     
                    6    114 (0.0%)        0.0%                                                     
                    7    263 (0.1%)        0.1%                                                     
                    8    831 (0.2%)        0.2%                                                     
                    9  3,111 (0.6%)        0.9%  █                                                  
                   10  4,087 (0.8%)        1.7%  █                                                  
                   11        23,075        6.3%  ███████                                            
                             (4.6%)                                                                 
                   12        53,933       17.1%  ██████████████████                                 
                            (10.8%)                                                                 
                   13       146,011       46.3%  ██████████████████████████████████████████████████ 
                            (29.2%)                                                                 
                   14        92,148       64.7%  ███████████████████████████████                    
                            (18.4%)                                                                 
                   15       128,471       90.4%  ███████████████████████████████████████████        
                            (25.7%)                                                                 
                   16        31,604       96.7%  ██████████                                         
                             (6.3%)                                                                 
                   17        11,799       99.1%  ████                                               
                             (2.4%)                                                                 
                   18  3,529 (0.7%)       99.8%  █                                                  
                   19    988 (0.2%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   499,981 │
│ Mean Length   │    3359.2 │
│ Std Dev       │   18619.0 │
│ Min Length    │         3 │
│ Max Length    │ 4,627,860 │
│ Median Length │    1194.0 │
│ P1            │       120 │
│ P5            │       213 │
│ P10           │       290 │
│ P25           │       530 │
│ P50           │     1,194 │
│ P75           │     2,924 │
│ P90           │     6,683 │
│ P95           │    11,176 │
│ P99           │    31,969 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [3, 290)        49,862       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [290, 443)        50,110       20.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
           [443, 627)        49,883       30.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [627, 866)        50,026       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [866, 1,194)        50,078       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,194, 1,661)        49,965       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,661, 2,387)        50,029       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,387, 3,676)        50,026       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [3,676, 6,683)        50,003       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [6,683, 4,627,861)        49,999      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
Go
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/17.0 [00:00<?, ? files/s]
Loading files:   6%|▌         | 1.00/17.0 [00:05<01:35, 5.97s/ files]
Loading files:  12%|█▏        | 2.00/17.0 [00:11<01:24, 5.66s/ files]
Loading files:  18%|█▊        | 3.00/17.0 [00:11<00:44, 3.14s/ files]
Loading files:  24%|██▎       | 4.00/17.0 [00:11<00:25, 1.95s/ files]
Loading files:  29%|██▉       | 5.00/17.0 [00:11<00:15, 1.29s/ files]
Loading files:  35%|███▌      | 6.00/17.0 [00:11<00:09, 1.13 files/s]
Loading files:  41%|████      | 7.00/17.0 [00:12<00:06, 1.59 files/s]
Loading files:  47%|████▋     | 8.00/17.0 [00:12<00:04, 2.16 files/s]
Loading files:  53%|█████▎    | 9.00/17.0 [00:12<00:02, 2.82 files/s]
Loading files:  59%|█████▉    | 10.0/17.0 [00:12<00:02, 3.23 files/s]
Loading files:  65%|██████▍   | 11.0/17.0 [00:13<00:04, 1.50 files/s]
Loading files:  71%|███████   | 12.0/17.0 [00:15<00:04, 1.11 files/s]
Loading files:  76%|███████▋  | 13.0/17.0 [00:15<00:03, 1.30 files/s]
Loading files:  82%|████████▏ | 14.0/17.0 [00:16<00:01, 1.57 files/s]
Loading files:  88%|████████▊ | 15.0/17.0 [00:16<00:00, 2.07 files/s]
Loading files:  94%|█████████▍| 16.0/17.0 [00:16<00:00, 2.50 files/s]
Loading files: 100%|██████████| 17.0/17.0 [00:16<00:00, 3.02 files/s]
Loading files: 100%|██████████| 17.0/17.0 [00:16<00:00, 1.02 files/s]
Loaded 499,977 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0     13 (0.0%)        0.0%                                                     
                    1      1 (0.0%)        0.0%                                                     
                    5      4 (0.0%)        0.0%                                                     
                    6    126 (0.0%)        0.0%                                                     
                    7    339 (0.1%)        0.1%                                                     
                    8  1,171 (0.2%)        0.3%                                                     
                    9  4,141 (0.8%)        1.2%  █                                                  
                   10  3,875 (0.8%)        1.9%  █                                                  
                   11        19,479        5.8%  ██████                                             
                             (3.9%)                                                                 
                   12        43,434       14.5%  ███████████████                                    
                             (8.7%)                                                                 
                   13        91,898       32.9%  ████████████████████████████████                   
                            (18.4%)                                                                 
                   14        71,491       47.2%  █████████████████████████                          
                            (14.3%)                                                                 
                   15       139,842       75.2%  ██████████████████████████████████████████████████ 
                            (28.0%)                                                                 
                   16        68,424       88.9%  ████████████████████████                           
                            (13.7%)                                                                 
                   17        44,941       97.8%  ████████████████                                   
                             (9.0%)                                                                 
                   18        10,620      100.0%  ███                                                
                             (2.1%)                                                                 
                   19    178 (0.0%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   499,977 │
│ Mean Length   │    3180.2 │
│ Std Dev       │   19540.9 │
│ Min Length    │         6 │
│ Max Length    │ 5,090,620 │
│ Median Length │    1324.0 │
│ P1            │        73 │
│ P5            │       169 │
│ P10           │       269 │
│ P25           │       565 │
│ P50           │     1,324 │
│ P75           │     3,048 │
│ P90           │     6,416 │
│ P95           │    10,178 │
│ P99           │    25,339 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [6, 269)        49,947       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [269, 461)        49,899       20.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [461, 680)        50,139       30.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
           [680, 961)        49,875       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [961, 1,324)        50,094       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,324, 1,815)        49,952       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,815, 2,540)        50,059       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,540, 3,741)        49,995       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [3,741, 6,416)        50,013       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [6,416, 5,090,621)        50,004      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
Java
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/1.00 [00:00<?, ? files/s]
Loading files: 100%|██████████| 1.00/1.00 [01:51<00:00, 111s/ files]
Loading files: 100%|██████████| 1.00/1.00 [01:51<00:00, 111s/ files]
Loaded 499,992 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                   -1      1 (0.0%)        0.0%                                                     
                    0      7 (0.0%)        0.0%                                                     
                    1      1 (0.0%)        0.0%                                                     
                    5      7 (0.0%)        0.0%                                                     
                    6    211 (0.0%)        0.0%                                                     
                    7    414 (0.1%)        0.1%                                                     
                    8  1,200 (0.2%)        0.4%                                                     
                    9  4,249 (0.8%)        1.2%  █                                                  
                   10  4,948 (1.0%)        2.2%  █                                                  
                   11        27,459        7.7%  ██████████                                         
                             (5.5%)                                                                 
                   12        62,200       20.1%  ██████████████████████                             
                            (12.4%)                                                                 
                   13       136,470       47.4%  ██████████████████████████████████████████████████ 
                            (27.3%)                                                                 
                   14        85,295       64.5%  ███████████████████████████████                    
                            (17.1%)                                                                 
                   15       125,161       89.5%  █████████████████████████████████████████████      
                            (25.0%)                                                                 
                   16        37,092       96.9%  █████████████                                      
                             (7.4%)                                                                 
                   17        11,105       99.2%  ████                                               
                             (2.2%)                                                                 
                   18  4,128 (0.8%)      100.0%  █                                                  
                   19     44 (0.0%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   499,992 │
│ Mean Length   │    3278.1 │
│ Std Dev       │   20262.2 │
│ Min Length    │         6 │
│ Max Length    │ 6,250,742 │
│ Median Length │    1320.0 │
│ P1            │        85 │
│ P5            │       190 │
│ P10           │       291 │
│ P25           │       586 │
│ P50           │     1,320 │
│ P75           │     3,016 │
│ P90           │     6,585 │
│ P95           │    10,794 │
│ P99           │    28,422 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [6, 291)        49,962       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [291, 480)        50,025       20.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [480, 701)        49,973       30.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [701, 973)        49,931       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [973, 1,320)        50,085       50.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
       [1,320, 1,796)        50,003       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,796, 2,505)        49,987       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,505, 3,723)        50,025       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [3,723, 6,585)        49,999       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [6,585, 6,250,743)        50,002      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
JavaScript
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/2.00 [00:00<?, ? files/s]
Loading files:  50%|█████     | 1.00/2.00 [00:05<00:05, 5.99s/ files]
Loading files: 100%|██████████| 2.00/2.00 [01:50<00:00, 64.1s/ files]
Loading files: 100%|██████████| 2.00/2.00 [01:50<00:00, 55.4s/ files]
Loaded 499,989 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0      7 (0.0%)        0.0%                                                     
                    3      1 (0.0%)        0.0%                                                     
                    4      7 (0.0%)        0.0%                                                     
                    5     21 (0.0%)        0.0%                                                     
                    6    423 (0.1%)        0.1%                                                     
                    7    960 (0.2%)        0.3%                                                     
                    8  2,958 (0.6%)        0.9%                                                     
                    9  8,371 (1.7%)        2.5%  ██                                                 
                   10  7,931 (1.6%)        4.1%  ██                                                 
                   11        39,629       12.1%  █████████████                                      
                             (7.9%)                                                                 
                   12        72,230       26.5%  ███████████████████████                            
                            (14.4%)                                                                 
                   13       108,843       48.3%  ████████████████████████████████████               
                            (21.8%)                                                                 
                   14        79,512       64.2%  ██████████████████████████                         
                            (15.9%)                                                                 
                   15       150,961       94.4%  ██████████████████████████████████████████████████ 
                            (30.2%)                                                                 
                   16        23,707       99.1%  ███████                                            
                             (4.7%)                                                                 
                   17  3,589 (0.7%)       99.8%  █                                                  
                   18    823 (0.2%)      100.0%                                                     
                   19     16 (0.0%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   499,989 │
│ Mean Length   │    3368.4 │
│ Std Dev       │   33606.8 │
│ Min Length    │         3 │
│ Max Length    │ 4,073,474 │
│ Median Length │    1039.0 │
│ P1            │        58 │
│ P5            │       143 │
│ P10           │       223 │
│ P25           │       454 │
│ P50           │     1,039 │
│ P75           │     2,444 │
│ P90           │     5,381 │
│ P95           │     8,878 │
│ P99           │    26,065 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [3, 223)        49,832       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [223, 373)        50,096       20.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [373, 543)        49,817       29.9%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [543, 755)        50,183       40.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
         [755, 1,039)        50,018       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,039, 1,431)        49,966       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,431, 2,017)        50,052       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,017, 3,029)        50,012       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [3,029, 5,381)        50,004       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [5,381, 4,073,475)        50,009      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
Markdown
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/2.00 [00:00<?, ? files/s]
Loading files:  50%|█████     | 1.00/2.00 [00:26<00:26, 26.6s/ files]
Loading files: 100%|██████████| 2.00/2.00 [01:35<00:00, 51.6s/ files]
Loading files: 100%|██████████| 2.00/2.00 [01:35<00:00, 47.9s/ files]
Loaded 499,999 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    3     17 (0.0%)        0.0%                                                     
                    4    190 (0.0%)        0.0%                                                     
                    5    979 (0.2%)        0.2%                                                     
                    6        18,020        3.8%  ███████                                            
                             (3.6%)                                                                 
                    7        33,123       10.5%  ██████████████                                     
                             (6.6%)                                                                 
                    8        71,425       24.8%  ██████████████████████████████                     
                            (14.3%)                                                                 
                    9       117,140       48.2%  ██████████████████████████████████████████████████ 
                            (23.4%)                                                                 
                   10        59,020       60.0%  █████████████████████████                          
                            (11.8%)                                                                 
                   11       114,894       83.0%  █████████████████████████████████████████████████  
                            (23.0%)                                                                 
                   12        53,041       93.6%  ██████████████████████                             
                            (10.6%)                                                                 
                   13        19,211       97.4%  ████████                                           
                             (3.8%)                                                                 
                   14  5,450 (1.1%)       98.5%  ██                                                 
                   15  5,512 (1.1%)       99.6%  ██                                                 
                   16  1,609 (0.3%)       99.9%                                                     
                   17    326 (0.1%)      100.0%                                                     
                   18     36 (0.0%)      100.0%                                                     
                   19      6 (0.0%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   499,999 │
│ Mean Length   │    2839.2 │
│ Std Dev       │   31630.1 │
│ Min Length    │         2 │
│ Max Length    │ 9,303,697 │
│ Median Length │     583.0 │
│ P1            │        10 │
│ P5            │        19 │
│ P10           │        31 │
│ P25           │       106 │
│ P50           │       583 │
│ P75           │     2,190 │
│ P90           │     5,744 │
│ P95           │    10,308 │
│ P99           │    27,406 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
              [2, 31)        49,272        9.9%  ████████████████████████████████████████████████   
                             (9.9%)                                                                 
             [31, 71)        49,834       19.8%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
            [71, 162)        50,580       29.9%  ██████████████████████████████████████████████████ 
                            (10.1%)                                                                 
           [162, 316)        50,242       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [316, 583)        49,943       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [583, 999)        50,063       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [999, 1,681)        49,992       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,681, 2,888)        50,073       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,888, 5,744)        49,986       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [5,744, 9,303,698)        50,014      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
PHP
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/17.0 [00:00<?, ? files/s]
Loading files:   6%|▌         | 1.00/17.0 [00:09<02:30, 9.42s/ files]
Loading files:  12%|█▏        | 2.00/17.0 [00:11<01:18, 5.22s/ files]
Loading files:  18%|█▊        | 3.00/17.0 [00:12<00:41, 2.98s/ files]
Loading files:  24%|██▎       | 4.00/17.0 [00:12<00:24, 1.87s/ files]
Loading files:  29%|██▉       | 5.00/17.0 [00:12<00:15, 1.26s/ files]
Loading files:  35%|███▌      | 6.00/17.0 [00:12<00:09, 1.14 files/s]
Loading files:  41%|████      | 7.00/17.0 [00:12<00:06, 1.58 files/s]
Loading files:  47%|████▋     | 8.00/17.0 [00:12<00:04, 2.09 files/s]
Loading files:  53%|█████▎    | 9.00/17.0 [00:12<00:03, 2.64 files/s]
Loading files:  59%|█████▉    | 10.0/17.0 [00:13<00:03, 2.02 files/s]
Loading files:  65%|██████▍   | 11.0/17.0 [00:15<00:05, 1.08 files/s]
Loading files:  71%|███████   | 12.0/17.0 [00:16<00:04, 1.22 files/s]
Loading files:  76%|███████▋  | 13.0/17.0 [00:16<00:02, 1.53 files/s]
Loading files:  82%|████████▏ | 14.0/17.0 [00:16<00:01, 1.73 files/s]
Loading files:  88%|████████▊ | 15.0/17.0 [00:16<00:00, 2.23 files/s]
Loading files:  94%|█████████▍| 16.0/17.0 [00:17<00:00, 2.87 files/s]
Loading files: 100%|██████████| 17.0/17.0 [00:17<00:00, 3.58 files/s]
Loading files: 100%|██████████| 17.0/17.0 [00:17<00:00, 1.01s/ files]
Loaded 499,991 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0      6 (0.0%)        0.0%                                                     
                    1      1 (0.0%)        0.0%                                                     
                    4      4 (0.0%)        0.0%                                                     
                    5     51 (0.0%)        0.0%                                                     
                    6  1,866 (0.4%)        0.4%  █                                                  
                    7  3,399 (0.7%)        1.1%  ██                                                 
                    8  9,575 (1.9%)        3.0%  █████                                              
                    9        27,528        8.5%  ████████████████                                   
                             (5.5%)                                                                 
                   10        26,141       13.7%  ███████████████                                    
                             (5.2%)                                                                 
                   11        80,672       29.8%  ████████████████████████████████████████████████   
                            (16.1%)                                                                 
                   12        77,958       45.4%  ██████████████████████████████████████████████     
                            (15.6%)                                                                 
                   13        79,962       61.4%  ███████████████████████████████████████████████    
                            (16.0%)                                                                 
                   14        46,771       70.8%  ███████████████████████████                        
                             (9.4%)                                                                 
                   15        83,579       87.5%  ██████████████████████████████████████████████████ 
                            (16.7%)                                                                 
                   16        45,932       96.7%  ███████████████████████████                        
                             (9.2%)                                                                 
                   17        11,247       98.9%  ██████                                             
                             (2.2%)                                                                 
                   18  3,834 (0.8%)       99.7%  ██                                                 
                   19  1,464 (0.3%)      100.0%                                                     
                  122      1 (0.0%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   499,991 │
│ Mean Length   │    3841.0 │
│ Std Dev       │   30759.4 │
│ Min Length    │         4 │
│ Max Length    │ 9,567,173 │
│ Median Length │    1400.0 │
│ P1            │        67 │
│ P5            │       164 │
│ P10           │       261 │
│ P25           │       597 │
│ P50           │     1,400 │
│ P75           │     3,405 │
│ P90           │     7,465 │
│ P95           │    12,177 │
│ P99           │    33,641 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [4, 261)        49,981       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [261, 478)        49,866       20.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [478, 721)        50,078       30.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
         [721, 1,005)        50,023       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,005, 1,400)        50,015       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,400, 1,964)        49,970       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,964, 2,806)        50,048       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,806, 4,214)        49,984       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [4,214, 7,465)        50,021       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [7,465, 9,567,174)        50,005      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
Python
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/19.0 [00:00<?, ? files/s]
Loading files:   5%|▌         | 1.00/19.0 [00:09<02:44, 9.16s/ files]
Loading files:  11%|█         | 2.00/19.0 [00:11<01:25, 5.00s/ files]
Loading files:  16%|█▌        | 3.00/19.0 [00:11<00:44, 2.78s/ files]
Loading files:  21%|██        | 4.00/19.0 [00:11<00:25, 1.73s/ files]
Loading files:  26%|██▋       | 5.00/19.0 [00:11<00:16, 1.15s/ files]
Loading files:  32%|███▏      | 6.00/19.0 [00:11<00:10, 1.25 files/s]
Loading files:  37%|███▋      | 7.00/19.0 [00:11<00:06, 1.72 files/s]
Loading files:  42%|████▏     | 8.00/19.0 [00:12<00:04, 2.29 files/s]
Loading files:  47%|████▋     | 9.00/19.0 [00:12<00:03, 2.96 files/s]
Loading files:  53%|█████▎    | 10.0/19.0 [00:12<00:02, 3.63 files/s]
Loading files:  58%|█████▊    | 11.0/19.0 [00:12<00:01, 4.38 files/s]
Loading files:  63%|██████▎   | 12.0/19.0 [00:15<00:07, 1.11s/ files]
Loading files:  68%|██████▊   | 13.0/19.0 [00:15<00:04, 1.22 files/s]
Loading files:  74%|███████▎  | 14.0/19.0 [00:15<00:03, 1.63 files/s]
Loading files:  79%|███████▉  | 15.0/19.0 [00:15<00:01, 2.16 files/s]
Loading files:  84%|████████▍ | 16.0/19.0 [00:16<00:01, 2.77 files/s]
Loading files:  89%|████████▉ | 17.0/19.0 [00:16<00:00, 3.46 files/s]
Loading files:  95%|█████████▍| 18.0/19.0 [00:16<00:00, 4.18 files/s]
Loading files: 100%|██████████| 19.0/19.0 [00:16<00:00, 4.82 files/s]
Loading files: 100%|██████████| 19.0/19.0 [00:16<00:00, 1.16 files/s]
Loaded 500,000 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0      3 (0.0%)        0.0%                                                     
                    3      2 (0.0%)        0.0%                                                     
                    4      7 (0.0%)        0.0%                                                     
                    5     44 (0.0%)        0.0%                                                     
                    6  1,243 (0.2%)        0.3%                                                     
                    7  2,320 (0.5%)        0.7%  █                                                  
                    8  6,837 (1.4%)        2.1%  ██                                                 
                    9        16,012        5.3%  ███████                                            
                             (3.2%)                                                                 
                   10        13,958        8.1%  ██████                                             
                             (2.8%)                                                                 
                   11        60,265       20.1%  ██████████████████████████                         
                            (12.1%)                                                                 
                   12        98,476       39.8%  ███████████████████████████████████████████        
                            (19.7%)                                                                 
                   13       114,092       62.7%  ██████████████████████████████████████████████████ 
                            (22.8%)                                                                 
                   14        67,144       76.1%  █████████████████████████████                      
                            (13.4%)                                                                 
                   15        83,389       92.8%  ████████████████████████████████████               
                            (16.7%)                                                                 
                   16        25,126       97.8%  ███████████                                        
                             (5.0%)                                                                 
                   17  7,129 (1.4%)       99.2%  ███                                                
                   18  2,939 (0.6%)       99.8%  █                                                  
                   19  1,014 (0.2%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   500,000 │
│ Mean Length   │    3556.1 │
│ Std Dev       │   22910.8 │
│ Min Length    │         4 │
│ Max Length    │ 5,887,358 │
│ Median Length │    1231.0 │
│ P1            │        47 │
│ P5            │       128 │
│ P10           │       214 │
│ P25           │       483 │
│ P50           │     1,231 │
│ P75           │     3,270 │
│ P90           │     7,516 │
│ P95           │    12,234 │
│ P99           │    31,582 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [4, 214)        49,879       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [214, 393)        49,815       19.9%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [393, 588)        50,173       30.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
           [588, 855)        50,045       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [855, 1,231)        50,062       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,231, 1,777)        50,020       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,777, 2,641)        50,006       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,641, 4,135)        49,998       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [4,135, 7,516)        49,999       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [7,516, 5,887,359)        50,003      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
Ruby
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/10.0 [00:00<?, ? files/s]
Loading files:  10%|█         | 1.00/10.0 [00:05<00:46, 5.18s/ files]
Loading files:  20%|██        | 2.00/10.0 [00:12<00:49, 6.19s/ files]
Loading files:  40%|████      | 4.00/10.0 [00:12<00:14, 2.37s/ files]
Loading files:  60%|██████    | 6.00/10.0 [00:12<00:05, 1.29s/ files]
Loading files:  80%|████████  | 8.00/10.0 [00:12<00:01, 1.24 files/s]
Loading files: 100%|██████████| 10.0/10.0 [00:12<00:00, 1.82 files/s]
Loading files: 100%|██████████| 10.0/10.0 [00:12<00:00, 1.28s/ files]
Loaded 499,994 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0      3 (0.0%)        0.0%                                                     
                    4      2 (0.0%)        0.0%                                                     
                    5     13 (0.0%)        0.0%                                                     
                    6    309 (0.1%)        0.1%                                                     
                    7    829 (0.2%)        0.2%                                                     
                    8  3,122 (0.6%)        0.9%                                                     
                    9  8,536 (1.7%)        2.6%  ██                                                 
                   10  8,665 (1.7%)        4.3%  ██                                                 
                   11        37,611       11.8%  ██████████                                         
                             (7.5%)                                                                 
                   12        65,401       24.9%  ███████████████████                                
                            (13.1%)                                                                 
                   13        96,420       44.2%  ████████████████████████████                       
                            (19.3%)                                                                 
                   14        75,464       59.3%  ██████████████████████                             
                            (15.1%)                                                                 
                   15       171,123       93.5%  ██████████████████████████████████████████████████ 
                            (34.2%)                                                                 
                   16        27,317       99.0%  ███████                                            
                             (5.5%)                                                                 
                   17  4,230 (0.8%)       99.8%  █                                                  
                   18    944 (0.2%)      100.0%                                                     
                   19      5 (0.0%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   499,994 │
│ Mean Length   │    1226.8 │
│ Std Dev       │   12258.3 │
│ Min Length    │         3 │
│ Max Length    │ 5,252,554 │
│ Median Length │     463.0 │
│ P1            │        35 │
│ P5            │        75 │
│ P10           │       111 │
│ P25           │       192 │
│ P50           │       463 │
│ P75           │     1,137 │
│ P90           │     2,456 │
│ P95           │     3,951 │
│ P99           │    10,526 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [3, 111)        49,518        9.9%  █████████████████████████████████████████████████  
                             (9.9%)                                                                 
           [111, 159)        49,866       19.9%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [159, 228)        50,302       29.9%  ██████████████████████████████████████████████████ 
                            (10.1%)                                                                 
           [228, 323)        50,250       40.0%  █████████████████████████████████████████████████  
                            (10.1%)                                                                 
           [323, 463)        49,983       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [463, 649)        49,941       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [649, 934)        50,045       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [934, 1,409)        50,089       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,409, 2,456)        49,974       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [2,456, 5,252,555)        50,026      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
Rust
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/25.0 [00:00<?, ? files/s]
Loading files:   4%|▍         | 1.00/25.0 [00:08<03:19, 8.33s/ files]
Loading files:   8%|▊         | 2.00/25.0 [00:11<01:57, 5.10s/ files]
Loading files:  12%|█▏        | 3.00/25.0 [00:11<01:03, 2.89s/ files]
Loading files:  16%|█▌        | 4.00/25.0 [00:11<00:39, 1.88s/ files]
Loading files:  20%|██        | 5.00/25.0 [00:11<00:25, 1.27s/ files]
Loading files:  24%|██▍       | 6.00/25.0 [00:12<00:16, 1.12 files/s]
Loading files:  28%|██▊       | 7.00/25.0 [00:12<00:11, 1.52 files/s]
Loading files:  32%|███▏      | 8.00/25.0 [00:12<00:08, 1.96 files/s]
Loading files:  36%|███▌      | 9.00/25.0 [00:12<00:06, 2.49 files/s]
Loading files:  40%|████      | 10.0/25.0 [00:12<00:04, 3.04 files/s]
Loading files:  44%|████▍     | 11.0/25.0 [00:12<00:03, 3.57 files/s]
Loading files:  48%|████▊     | 12.0/25.0 [00:15<00:10, 1.19 files/s]
Loading files:  52%|█████▏    | 13.0/25.0 [00:15<00:08, 1.36 files/s]
Loading files:  56%|█████▌    | 14.0/25.0 [00:15<00:06, 1.78 files/s]
Loading files:  60%|██████    | 15.0/25.0 [00:15<00:04, 2.13 files/s]
Loading files:  64%|██████▍   | 16.0/25.0 [00:16<00:03, 2.62 files/s]
Loading files:  68%|██████▊   | 17.0/25.0 [00:16<00:02, 3.15 files/s]
Loading files:  72%|███████▏  | 18.0/25.0 [00:16<00:01, 3.64 files/s]
Loading files:  76%|███████▌  | 19.0/25.0 [00:16<00:01, 4.22 files/s]
Loading files:  80%|████████  | 20.0/25.0 [00:17<00:02, 2.15 files/s]
Loading files:  84%|████████▍ | 21.0/25.0 [00:18<00:02, 1.94 files/s]
Loading files:  88%|████████▊ | 22.0/25.0 [00:18<00:01, 2.41 files/s]
Loading files:  92%|█████████▏| 23.0/25.0 [00:18<00:00, 2.52 files/s]
Loading files:  96%|█████████▌| 24.0/25.0 [00:19<00:00, 2.58 files/s]
Loading files: 100%|██████████| 25.0/25.0 [00:19<00:00, 2.66 files/s]
Loading files: 100%|██████████| 25.0/25.0 [00:19<00:00, 1.28 files/s]
Loaded 499,982 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                   -1      1 (0.0%)        0.0%                                                     
                    0     14 (0.0%)        0.0%                                                     
                    5      1 (0.0%)        0.0%                                                     
                    6     60 (0.0%)        0.0%                                                     
                    7    164 (0.0%)        0.0%                                                     
                    8    383 (0.1%)        0.1%                                                     
                    9  1,196 (0.2%)        0.4%                                                     
                   10  1,725 (0.3%)        0.7%                                                     
                   11        12,449        3.2%  ████                                               
                             (2.5%)                                                                 
                   12        34,521       10.1%  ████████████                                       
                             (6.9%)                                                                 
                   13        75,293       25.2%  ██████████████████████████                         
                            (15.1%)                                                                 
                   14        66,475       38.5%  ███████████████████████                            
                            (13.3%)                                                                 
                   15       140,200       66.5%  ██████████████████████████████████████████████████ 
                            (28.0%)                                                                 
                   16        76,225       81.7%  ███████████████████████████                        
                            (15.2%)                                                                 
                   17        59,793       93.7%  █████████████████████                              
                            (12.0%)                                                                 
                   18        18,748       97.5%  ██████                                             
                             (3.7%)                                                                 
                   19        12,734      100.0%  ████                                               
                             (2.5%)                                                                 

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   499,982 │
│ Mean Length   │    6249.1 │
│ Std Dev       │   37083.4 │
│ Min Length    │         4 │
│ Max Length    │ 5,539,342 │
│ Median Length │    2035.0 │
│ P1            │        39 │
│ P5            │       131 │
│ P10           │       263 │
│ P25           │       756 │
│ P50           │     2,035 │
│ P75           │     5,191 │
│ P90           │    11,935 │
│ P95           │    19,748 │
│ P99           │    56,214 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [4, 263)        49,935       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [263, 576)        49,927       20.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [576, 950)        50,041       30.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [950, 1,418)        50,069       40.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
       [1,418, 2,035)        49,985       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,035, 2,905)        50,007       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,905, 4,224)        49,992       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [4,224, 6,528)        50,021       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
      [6,528, 11,935)        50,003       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
  [11,935, 5,539,343)        50,002      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
Shell
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/11.0 [00:00<?, ? files/s]
Loading files:   9%|▉         | 1.00/11.0 [00:05<00:55, 5.55s/ files]
Loading files:  18%|█▊        | 2.00/11.0 [00:11<00:52, 5.88s/ files]
Loading files:  36%|███▋      | 4.00/11.0 [00:11<00:15, 2.27s/ files]
Loading files:  55%|█████▍    | 6.00/11.0 [00:12<00:06, 1.23s/ files]
Loading files:  73%|███████▎  | 8.00/11.0 [00:12<00:02, 1.29 files/s]
Loading files:  91%|█████████ | 10.0/11.0 [00:12<00:00, 1.63 files/s]
Loading files: 100%|██████████| 11.0/11.0 [00:15<00:00, 1.06s/ files]
Loading files: 100%|██████████| 11.0/11.0 [00:15<00:00, 1.43s/ files]
Loaded 497,498 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                   -1      1 (0.0%)        0.0%                                                     
                    0     11 (0.0%)        0.0%                                                     
                    3      2 (0.0%)        0.0%                                                     
                    4     48 (0.0%)        0.0%                                                     
                    5    255 (0.1%)        0.1%                                                     
                    6  4,530 (0.9%)        1.0%  █                                                  
                    7  7,025 (1.4%)        2.4%  ██                                                 
                    8        15,390        5.5%  █████                                              
                             (3.1%)                                                                 
                    9        33,582       12.2%  ████████████                                       
                             (6.8%)                                                                 
                   10        32,100       18.7%  ████████████                                       
                             (6.5%)                                                                 
                   11       127,285       44.3%  ████████████████████████████████████████████████   
                            (25.6%)                                                                 
                   12       130,312       70.5%  ██████████████████████████████████████████████████ 
                            (26.2%)                                                                 
                   13        87,166       88.0%  █████████████████████████████████                  
                            (17.5%)                                                                 
                   14        32,009       94.4%  ████████████                                       
                             (6.4%)                                                                 
                   15        24,087       99.3%  █████████                                          
                             (4.8%)                                                                 
                   16  3,198 (0.6%)       99.9%  █                                                  
                   17    424 (0.1%)      100.0%                                                     
                   18     73 (0.0%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   497,498 │
│ Mean Length   │    1371.3 │
│ Std Dev       │   18264.2 │
│ Min Length    │         3 │
│ Max Length    │ 6,275,319 │
│ Median Length │     431.0 │
│ P1            │        26 │
│ P5            │        49 │
│ P10           │        73 │
│ P25           │       160 │
│ P50           │       431 │
│ P75           │     1,120 │
│ P90           │     2,609 │
│ P95           │     4,442 │
│ P99           │    12,521 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
              [3, 73)        49,200        9.9%  █████████████████████████████████████████████████  
                             (9.9%)                                                                 
            [73, 128)        49,721       19.9%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [128, 197)        50,017       29.9%  █████████████████████████████████████████████████  
                            (10.1%)                                                                 
           [197, 294)        49,595       39.9%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [294, 431)        50,054       50.0%  ██████████████████████████████████████████████████ 
                            (10.1%)                                                                 
           [431, 627)        49,892       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [627, 912)        49,650       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [912, 1,410)        49,855       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,410, 2,609)        49,734       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [2,609, 6,275,320)        49,780      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
SQL
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/30.0 [00:00<?, ? files/s]
Loading files:   3%|▎         | 1.00/30.0 [00:09<04:41, 9.70s/ files]
Loading files:   7%|▋         | 2.00/30.0 [00:12<02:31, 5.40s/ files]
Loading files:  10%|█         | 3.00/30.0 [00:12<01:21, 3.03s/ files]
Loading files:  13%|█▎        | 4.00/30.0 [00:12<00:50, 1.93s/ files]
Loading files:  17%|█▋        | 5.00/30.0 [00:12<00:33, 1.32s/ files]
Loading files:  20%|██        | 6.00/30.0 [00:13<00:23, 1.04 files/s]
Loading files:  23%|██▎       | 7.00/30.0 [00:13<00:16, 1.38 files/s]
Loading files:  27%|██▋       | 8.00/30.0 [00:13<00:12, 1.71 files/s]
Loading files:  30%|███       | 9.00/30.0 [00:13<00:10, 2.09 files/s]
Loading files:  33%|███▎      | 10.0/30.0 [00:14<00:09, 2.18 files/s]
Loading files:  37%|███▋      | 11.0/30.0 [00:15<00:11, 1.72 files/s]
Loading files:  40%|████      | 12.0/30.0 [00:16<00:14, 1.22 files/s]
Loading files:  43%|████▎     | 13.0/30.0 [00:16<00:11, 1.48 files/s]
Loading files:  47%|████▋     | 14.0/30.0 [00:17<00:08, 1.85 files/s]
Loading files:  50%|█████     | 15.0/30.0 [00:17<00:06, 2.20 files/s]
Loading files:  53%|█████▎    | 16.0/30.0 [00:17<00:06, 2.13 files/s]
Loading files:  57%|█████▋    | 17.0/30.0 [00:18<00:05, 2.45 files/s]
Loading files:  60%|██████    | 18.0/30.0 [00:18<00:04, 2.75 files/s]
Loading files:  63%|██████▎   | 19.0/30.0 [00:18<00:04, 2.37 files/s]
Loading files:  67%|██████▋   | 20.0/30.0 [00:19<00:03, 2.90 files/s]
Loading files:  70%|███████   | 21.0/30.0 [00:19<00:02, 3.16 files/s]
Loading files:  73%|███████▎  | 22.0/30.0 [00:19<00:02, 3.68 files/s]
Loading files:  77%|███████▋  | 23.0/30.0 [00:20<00:02, 2.44 files/s]
Loading files:  80%|████████  | 24.0/30.0 [00:20<00:02, 2.95 files/s]
Loading files:  83%|████████▎ | 25.0/30.0 [00:20<00:01, 3.15 files/s]
Loading files:  87%|████████▋ | 26.0/30.0 [00:21<00:01, 2.02 files/s]
Loading files:  90%|█████████ | 27.0/30.0 [00:21<00:01, 2.41 files/s]
Loading files:  93%|█████████▎| 28.0/30.0 [00:21<00:00, 2.83 files/s]
Loading files:  97%|█████████▋| 29.0/30.0 [00:22<00:00, 3.28 files/s]
Loading files: 100%|██████████| 30.0/30.0 [00:23<00:00, 1.81 files/s]
Loading files: 100%|██████████| 30.0/30.0 [00:23<00:00, 1.29 files/s]
Loaded 485,524 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0      1 (0.0%)        0.0%                                                     
                    3      4 (0.0%)        0.0%                                                     
                    4     13 (0.0%)        0.0%                                                     
                    5     52 (0.0%)        0.0%                                                     
                    6  1,420 (0.3%)        0.3%                                                     
                    7  2,169 (0.4%)        0.8%                                                     
                    8  6,698 (1.4%)        2.1%  ██                                                 
                    9        20,133        6.3%  ███████                                            
                             (4.1%)                                                                 
                   10        20,938       10.6%  ███████                                            
                             (4.3%)                                                                 
                   11        80,606       27.2%  ██████████████████████████████                     
                            (16.6%)                                                                 
                   12       118,079       51.5%  ████████████████████████████████████████████       
                            (24.3%)                                                                 
                   13       132,005       78.7%  ██████████████████████████████████████████████████ 
                            (27.2%)                                                                 
                   14        58,571       90.8%  ██████████████████████                             
                            (12.1%)                                                                 
                   15        40,272       99.1%  ███████████████                                    
                             (8.3%)                                                                 
                   16  4,333 (0.9%)      100.0%  █                                                  
                   17    206 (0.0%)      100.0%                                                     
                   18     24 (0.0%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   485,524 │
│ Mean Length   │    9638.5 │
│ Std Dev       │  128108.5 │
│ Min Length    │         2 │
│ Max Length    │ 9,444,770 │
│ Median Length │     826.5 │
│ P1            │        33 │
│ P5            │        67 │
│ P10           │       104 │
│ P25           │       256 │
│ P50           │       827 │
│ P75           │     2,782 │
│ P90           │     8,085 │
│ P95           │    16,213 │
│ P99           │    88,480 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [2, 104)        48,318       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [104, 198)        48,370       19.9%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [198, 322)        48,734       30.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
           [322, 513)        48,665       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [513, 827)        48,675       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [827, 1,347)        48,548       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,347, 2,167)        48,518       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,167, 3,717)        48,575       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [3,717, 8,085)        48,567       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [8,085, 9,444,771)        48,554      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
Swift
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/14.0 [00:00<?, ? files/s]
Loading files:   7%|▋         | 1.00/14.0 [00:10<02:19, 10.7s/ files]
Loading files:  14%|█▍        | 2.00/14.0 [00:11<01:01, 5.09s/ files]
Loading files:  21%|██▏       | 3.00/14.0 [00:12<00:31, 2.85s/ files]
Loading files:  29%|██▊       | 4.00/14.0 [00:12<00:17, 1.77s/ files]
Loading files:  36%|███▌      | 5.00/14.0 [00:12<00:10, 1.18s/ files]
Loading files:  43%|████▎     | 6.00/14.0 [00:12<00:06, 1.21 files/s]
Loading files:  50%|█████     | 7.00/14.0 [00:12<00:04, 1.66 files/s]
Loading files:  57%|█████▋    | 8.00/14.0 [00:12<00:02, 2.20 files/s]
Loading files:  64%|██████▍   | 9.00/14.0 [00:12<00:01, 2.81 files/s]
Loading files:  71%|███████▏  | 10.0/14.0 [00:12<00:01, 3.44 files/s]
Loading files:  79%|███████▊  | 11.0/14.0 [00:15<00:03, 1.03s/ files]
Loading files:  86%|████████▌ | 12.0/14.0 [00:15<00:01, 1.30 files/s]
Loading files:  93%|█████████▎| 13.0/14.0 [00:16<00:00, 1.16 files/s]
Loading files: 100%|██████████| 14.0/14.0 [00:17<00:00, 1.52 files/s]
Loading files: 100%|██████████| 14.0/14.0 [00:17<00:00, 1.22s/ files]
Loaded 488,917 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0      7 (0.0%)        0.0%                                                     
                    6     58 (0.0%)        0.0%                                                     
                    7    250 (0.1%)        0.1%                                                     
                    8  1,100 (0.2%)        0.3%                                                     
                    9  3,536 (0.7%)        1.0%  █                                                  
                   10  3,868 (0.8%)        1.8%  █                                                  
                   11        23,179        6.5%  ███████                                            
                             (4.7%)                                                                 
                   12        49,738       16.7%  █████████████████                                  
                            (10.2%)                                                                 
                   13       108,490       38.9%  █████████████████████████████████████              
                            (22.2%)                                                                 
                   14        87,208       56.7%  ██████████████████████████████                     
                            (17.8%)                                                                 
                   15       145,070       86.4%  ██████████████████████████████████████████████████ 
                            (29.7%)                                                                 
                   16        50,694       96.8%  █████████████████                                  
                            (10.4%)                                                                 
                   17        12,602       99.4%  ████                                               
                             (2.6%)                                                                 
                   18  1,979 (0.4%)       99.8%                                                     
                   19  1,138 (0.2%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   488,917 │
│ Mean Length   │    3043.5 │
│ Std Dev       │   14336.5 │
│ Min Length    │         4 │
│ Max Length    │ 4,710,743 │
│ Median Length │    1411.0 │
│ P1            │       126 │
│ P5            │       244 │
│ P10           │       331 │
│ P25           │       627 │
│ P50           │     1,411 │
│ P75           │     3,255 │
│ P90           │     6,498 │
│ P95           │     9,815 │
│ P99           │    22,658 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [4, 331)        48,754       10.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [331, 521)        48,840       20.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [521, 746)        48,936       30.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [746, 1,026)        49,028       40.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
       [1,026, 1,411)        48,789       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,411, 1,945)        48,958       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,945, 2,717)        48,907       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,717, 3,959)        48,910       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [3,959, 6,498)        48,894       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [6,498, 4,710,744)        48,901      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


========================================
Processing: 
========================================

Label Distribution Analysis

Dataset: 
/home/lucas/ai2-llm/classifiers/code-quality/data/the-stack-v2/spring2code_v2/minhash_v2_annotated/sample_1GB/countup_criteria_v2/gpt-5-mini/10k_trimmed/
TypeScript
Label Expression: .countup_criteria_v2.score
Label Type: ordinal

Key Expression: .text


Loading files:   0%|          | 0.00/12.0 [00:00<?, ? files/s]
Loading files:   8%|▊         | 1.00/12.0 [00:05<00:56, 5.09s/ files]
Loading files:  17%|█▋        | 2.00/12.0 [00:12<01:02, 6.20s/ files]
Loading files:  25%|██▌       | 3.00/12.0 [00:12<00:30, 3.42s/ files]
Loading files:  42%|████▏     | 5.00/12.0 [00:12<00:10, 1.54s/ files]
Loading files:  58%|█████▊    | 7.00/12.0 [00:12<00:04, 1.12 files/s]
Loading files:  75%|███████▌  | 9.00/12.0 [00:12<00:01, 1.71 files/s]
Loading files:  83%|████████▎ | 10.0/12.0 [00:12<00:00, 2.09 files/s]
Loading files:  92%|█████████▏| 11.0/12.0 [00:13<00:00, 1.79 files/s]
Loading files: 100%|██████████| 12.0/12.0 [00:16<00:00, 1.03s/ files]
Loading files: 100%|██████████| 12.0/12.0 [00:16<00:00, 1.34s/ files]
Loaded 499,990 samples

Label Distribution:
                Label         Count  Percentile  Bar                                                
                    0      3 (0.0%)        0.0%                                                     
                    3      1 (0.0%)        0.0%                                                     
                    5      4 (0.0%)        0.0%                                                     
                    6     50 (0.0%)        0.0%                                                     
                    7    131 (0.0%)        0.0%                                                     
                    8    426 (0.1%)        0.1%                                                     
                    9  1,490 (0.3%)        0.4%                                                     
                   10  2,071 (0.4%)        0.8%                                                     
                   11        13,859        3.6%  ████                                               
                             (2.8%)                                                                 
                   12        44,860       12.6%  ██████████████                                     
                             (9.0%)                                                                 
                   13       123,385       37.3%  ████████████████████████████████████████           
                            (24.7%)                                                                 
                   14        96,954       56.6%  ███████████████████████████████                    
                            (19.4%)                                                                 
                   15       153,922       87.4%  ██████████████████████████████████████████████████ 
                            (30.8%)                                                                 
                   16        43,727       96.2%  ██████████████                                     
                             (8.7%)                                                                 
                   17        14,320       99.0%  ████                                               
                             (2.9%)                                                                 
                   18  3,057 (0.6%)       99.7%                                                     
                   19  1,730 (0.3%)      100.0%                                                     

Key Length Statistics:
┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Statistic     ┃     Value ┃
┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Count         │   499,990 │
│ Mean Length   │    1984.5 │
│ Std Dev       │   15117.2 │
│ Min Length    │         5 │
│ Max Length    │ 3,869,768 │
│ Median Length │     797.0 │
│ P1            │        49 │
│ P5            │       106 │
│ P10           │       178 │
│ P25           │       387 │
│ P50           │       797 │
│ P75           │     1,770 │
│ P90           │     3,784 │
│ P95           │     6,171 │
│ P99           │    16,793 │
└───────────────┴───────────┘

Key Length Distribution (10 buckets):
                Range         Count  Percentile  Bar                                                
             [5, 178)        49,731        9.9%  █████████████████████████████████████████████████  
                             (9.9%)                                                                 
           [178, 320)        50,216       20.0%  ██████████████████████████████████████████████████ 
                            (10.0%)                                                                 
           [320, 464)        49,945       30.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [464, 636)        49,976       40.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
           [636, 797)        50,111       50.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
         [797, 1,069)        49,887       60.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,069, 1,478)        50,086       70.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [1,478, 2,171)        50,019       80.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
       [2,171, 3,784)        50,007       90.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 
   [3,784, 3,869,769)        50,012      100.0%  █████████████████████████████████████████████████  
                            (10.0%)                                                                 


