Metadata-Version: 2.4
Name: unimax_sampling
Version: 1.0.0
Summary: Implementation of the UniMax sampling method for effective language sampling for multilingual pretraining
License-Expression: MIT
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE.md
Provides-Extra: count
Requires-Dist: datasets>=4.1.1; extra == "count"
Requires-Dist: polars>=1.33.1; extra == "count"
Dynamic: license-file

# UniMax

`unimax_sampling` implements the UniMax sampling method introduced by [Chung et al. (2023)](#references). This method aims to balance language representation in multilingual large language models by explicitly capping the number of repeats over each language's corpus, thereby mitigating overfitting on tail languages while delivering more uniform coverage of head languages.

## Installation

```bash
# UniMax algorithm only
pip install unimax_sampling
# Including optional dependencies for the count-characters sub-command
pip install unimax_sampling[count]
```

## Programmatic Usage

```python
from unimax import unimax, count_characters

# (Optional) count characters in each dataset, only available when optional dependencies are installed via unimax_sampling[count]
character_counts = {}
for subset in ("swe_Latn", "fas_Arab", "ekk_Latn", "isl_Latn", "fao_Latn"):
    character_counts[subset.split("_")[0]] = count_characters("HuggingFaceFW/fineweb-2", subset)

# Compute UniMax distribution from available characters per language
character_counts = {
    "swe": 179955884499,
    "fas": 184595788282,
    "ekk": 42541080893,
    "isl": 10027573389,
    "fao": 549707867,
}

distribution = unimax(
    character_counts,
    character_budget=250_000_000_000,
    max_epochs=4,
)
```

**Output:**

```python
UniMaxDistribution(
    budgets={
        "fao": 2198831468,
        "isl": 40110293556,
        "ekk": 69230291658.66667,
        "swe": 69230291658.66666,
        "fas": 69230291658.66666,
    },
    epochs={
        "fao": 4.0,
        "isl": 4.0,
        "ekk": 1.627375003300828,
        "swe": 0.3847070177860806,
        "fas": 0.37503722215431134,
    },
    probabilities={
        "fao": 0.008795325872,
        "isl": 0.160441174224,
        "ekk": 0.2769211666346667,
        "swe": 0.27692116663466665,
        "fas": 0.27692116663466665,
    },
)
```

## Commandline Usage

For convenience, the package can be executed as a commandline utility

### Counting Characters

```bash
python -m unimax count-characters <dataset_path> <language_code> <output_json_file> [-c <dataset_configuration>] [-s <split>]
```

> [!NOTE]
> `count-characters` requires `unimax_sampling` to be installed via `pip install unimax_sampling[count]`

**Example:**

```bash
python -m unimax count-characters HuggingFaceFW/fineweb-2 fra french.json -c fra_Latn
```

### Calculating the UniMax Distribution

```bash
python -m unimax unimax <character_count_files> -c <character_budget> -m <max_epochs> [-r <language_codes>] [-o <distribution_json_file>]
```

**Example:**

```bash
python -m unimax unimax character_counts.json -c 250000000000 -m 4 -o distribution.json
```

## References

Hyung Won Chung, Xavier Garcia, Adam Roberts, Yi Tay, Orhan Firat, Sharan Narang, and Noah Constant. 2023. UniMax: Fairer and More Effective Language Sampling for Large-Scale Multilingual Pretraining. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda.
