Metadata-Version: 2.4
Name: asia-fertility
Version: 0.2.2
Summary: Tokenizer fertility, cost, and multi-turn context-budget analyzer for low-resource Asian languages.
Project-URL: Homepage, https://fertiscope.vercel.app
Project-URL: Repository, https://github.com/Helmo21/asia-fertility
Project-URL: Documentation, https://helmo21.github.io/asia-fertility/
Author: Antoine Pedretti
License: MIT
License-File: LICENSE
Keywords: asian-languages,fertility,llm,low-resource,multilingual,tokenizer
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Requires-Dist: numpy>=1.26
Requires-Dist: pydantic-settings>=2.3
Requires-Dist: pydantic>=2.7
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.7
Requires-Dist: typer>=0.12
Provides-Extra: api
Requires-Dist: anthropic>=0.40; extra == 'api'
Requires-Dist: google-genai>=0.3; extra == 'api'
Requires-Dist: httpx>=0.27; extra == 'api'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.112; extra == 'dev'
Requires-Dist: mypy>=1.11; extra == 'dev'
Requires-Dist: pre-commit>=3.8; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.26; extra == 'docs'
Requires-Dist: pymdown-extensions>=10; extra == 'docs'
Provides-Extra: hf
Requires-Dist: datasets>=2.21; extra == 'hf'
Requires-Dist: huggingface-hub>=0.25; extra == 'hf'
Requires-Dist: khmer-nltk>=1.6; extra == 'hf'
Requires-Dist: laonlp>=1.3; extra == 'hf'
Requires-Dist: pythainlp>=5.0; extra == 'hf'
Requires-Dist: sentencepiece>=0.2; extra == 'hf'
Requires-Dist: tokenizers>=0.20; extra == 'hf'
Requires-Dist: transformers>=4.44; extra == 'hf'
Provides-Extra: niah
Requires-Dist: httpx>=0.27; extra == 'niah'
Requires-Dist: tenacity>=8.5; extra == 'niah'
Provides-Extra: oai
Requires-Dist: tiktoken>=0.8; extra == 'oai'
Provides-Extra: viz
Requires-Dist: matplotlib>=3.9; extra == 'viz'
Requires-Dist: pandas>=2.2; extra == 'viz'
Requires-Dist: pyarrow>=17; extra == 'viz'
Description-Content-Type: text/markdown

# asia-fertility 🌏

**The hidden multilingual tax in your tokenizer — measured before you deploy.**

[![PyPI](https://img.shields.io/pypi/v/asia-fertility.svg?color=blue)](https://pypi.org/project/asia-fertility/)
[![CI](https://github.com/Helmo21/asia-fertility/actions/workflows/ci.yml/badge.svg)](https://github.com/Helmo21/asia-fertility/actions/workflows/ci.yml)
[![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.11%2B-blue)](pyproject.toml)
[![HF Dataset](https://img.shields.io/badge/HF-Helmo21%2Fasia--fertility-ffd21e)](https://huggingface.co/datasets/Helmo21/asia-fertility)

`asia-fertility` measures the structural cost penalty that LLM tokenizers impose on lower-resource Asian languages. The same content can cost up to **11× more tokens in Burmese than in English** on a frontier tokenizer — silent inflation of API bills, smaller usable context windows, and fewer in-context examples.

## Quickstart

```bash
pip install "asia-fertility[oai]"

# Measure your own text
asia-fertility measure --text "தமிழ் ஒரு செம்மொழி" --lang tam --tokenizer openai/o200k_base

# Compare cost across providers (in your local currency)
asia-fertility cost --text "Xin chào" --lang vie \
  --models openai/gpt-4o,openai/gpt-3.5-turbo \
  --currencies USD,VND

# Reproduce the full 16-language × 9-tokenizer leaderboard
asia-fertility run --config configs/study_main.yaml
asia-fertility figures --run runs/main --out runs/main/figures
asia-fertility leaderboard --run runs/main --out runs/main/leaderboard.json
```

## What's inside

- **16 lower-resource Asian languages**: Vietnamese, Indonesian, Malay, Filipino, Thai, Hindi, Bengali, Sinhala, Tamil, Telugu, Kannada, Malayalam, Burmese, Khmer, Lao, plus English baseline.
- **9 tokenizers measured**: OpenAI `o200k_base`/`cl100k_base`/`o200k_harmony`, Mistral Tekken, Qwen3, DeepSeek v3, BLOOM, Gemma-2, Aya Expanse. 3 more registered behind license walls (Llama-3.1, etc.).
- **5 metrics** with 95% bootstrap CIs: fertility, premium, same-content cost ratio, characters/token (CPT), and **bytes/token (BPT)** — the only cross-script-fair comparator.
- **NIAH benchmark**: script-native needle-in-haystack across gpt-4o-mini, gpt-3.5-turbo, llama-3.1-8b-instruct on Tamil/Hindi/Burmese/Lao haystacks.

## Key findings (v0.2.0)

- Same content costs **7–12× more tokens** on cl100k_base for Brahmic-derived scripts (Tamil 7.61×, Burmese 11.66×).
- Switching to `o200k_base` cuts the penalty 3–6× (Tamil → 1.98×, Burmese → 3.18×).
- **Gemma-2 is the best open-weight tokenizer for South Asian** workloads (Tamil 2.58×, Burmese 4.80×).
- **NIAH recall collapses to 0–7%** on Hindi/Tamil/Burmese/Lao with script-native markers, *even at 4k context* — across all three frontier models tested. See paper §4.4.

## Paper

The full writeup is at [`paper/paper.pdf`](paper/paper.pdf) (11 pages). Cite as:

```bibtex
@misc{pedretti2026asianlanguagetax,
  title  = {The Asian Language Tax: Quantifying the Cost, Context, and Recall Penalty of Tokenizing Lower-Resource Asian Languages in Frontier LLMs},
  author = {Pedretti, Antoine},
  year   = {2026},
  url    = {https://github.com/Helmo21/asia-fertility},
}
```

## Data

The full results leaderboard + NIAH benchmark are published as a HuggingFace dataset:

```python
from datasets import load_dataset

ds   = load_dataset("Helmo21/asia-fertility", "leaderboard")  # 144 (lang × tokenizer) rows
niah = load_dataset("Helmo21/asia-fertility", "niah")          # 536 NIAH cells
```

## License

MIT © 2026 Antoine Pedretti. Bundled FLORES-200 data: CC-BY-SA 4.0 (Meta NLLB).
