Metadata-Version: 2.4
Name: cute-tokenizer
Version: 0.2.0
Summary: Compact Unicode Token Encoding — semantic-prior-guided contextual tokenization for code
Project-URL: Homepage, https://github.com/HusseinEid101/CUTE
Project-URL: Issues, https://github.com/HusseinEid101/CUTE/issues
Author-email: Hussein Eid <HusseinEid101@users.noreply.github.com>
License: MIT
License-File: LICENSE
Keywords: bpe,code,huggingface,llm,nlp,tokenizer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: orjson>=3.10
Requires-Dist: pyahocorasick>=2.1
Requires-Dist: regex>=2024.7.24
Requires-Dist: tokenizers<0.22,>=0.20
Requires-Dist: tqdm>=4.66
Requires-Dist: transformers>=4.45
Requires-Dist: xxhash>=3.4
Provides-Extra: baseline
Requires-Dist: tiktoken>=0.7; extra == 'baseline'
Provides-Extra: benchmarks
Requires-Dist: matplotlib>=3.8; extra == 'benchmarks'
Requires-Dist: tabulate>=0.9; extra == 'benchmarks'
Requires-Dist: tiktoken>=0.7; extra == 'benchmarks'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: mypy>=1.11; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Requires-Dist: tiktoken>=0.7; extra == 'dev'
Description-Content-Type: text/markdown

<p align="center">
  <img src="assets/mascot.jpg" alt="CUTE Tokenizer Mascot" width="600"/>
</p>

<h1 align="center">🐭 CUTE Tokenizer</h1>
<h3 align="center"><em>Compact Unicode Token Encoding</em></h3>
<p align="center"><strong>✨ semantic-prior-guided contextual tokenization for code ✨</strong></p>

<p align="center">
  <a href="https://www.python.org/">
    <img src="https://img.shields.io/badge/python-3.10+-blue?style=flat-square" alt="Python 3.10+"/>
  </a>
  <a href="LICENSE">
    <img src="https://img.shields.io/badge/License-MIT-green?style=flat-square" alt="License: MIT"/>
  </a>
  <a href="https://huggingface.co/HusseinEid/cute-tokenizer">
    <img src="https://img.shields.io/badge/🤗-HuggingFace-ffd21e?style=flat-square" alt="HuggingFace"/>
  </a>
  <a href="https://pypi.org/project/cute-tokenizer/">
    <img src="https://img.shields.io/pypi/v/cute-tokenizer?style=flat-square&color=white&cb=20260508" alt="PyPI version"/>
  </a>
  <a href="https://github.com/HusseinEid101/CUTE/actions">
    <img src="https://img.shields.io/github/actions/workflow/status/HusseinEid101/CUTE/ci.yml?branch=main&style=flat-square" alt="CI"/>
  </a>
</p>

---

## ✨ Highlights

CUTE is a code-aware tokenizer that combines **explicit semantic anchors**
with **contextual subword merges** to produce compact, lossless token
sequences for Python, TypeScript, JavaScript, Rust, Go, and other
common programming languages.

The architecture has two stages:

- **Savings-based PUA mapping** — high-value words, operators, and
  identifier sub-parts are mapped to single Unicode Private-Use-Area
  characters, ranked by *expected token savings vs the cl100k baseline*
  (not raw frequency).
- **Contextual byte-level BPE** — the trainer sees PUA-substituted text,
  so it can learn merges around those anchors (e.g. whitespace + PUA),
  while a post-train safety filter forbids PUA + PUA pairs to keep the
  semantic units atomic.

**The result:**

- 🪄 Beats every open-source code tokenizer we tested
- 🔁 **Byte-equal lossless round-trip** on arbitrary Unicode (verified on
  3,000 held-out files: Python + JS + TS + Rust + Go)
- 🔒 **Deterministic** within a fixed `(OS, python, tokenizers)` host triple
- 🤗 Drop-in `AutoTokenizer` compatibility via `trust_remote_code=True`

---

## 📊 Benchmarks

Numbers below are **measured**, not theoretical, on held-out code that
was never seen during training. Lower mean tokens = better compression;
higher bytes/token = better.

### Python (1,500 held-out files from The Stack)

| Tokenizer | mean tokens | bytes/token | vocab | roundtrip |
|---|---:|---:|---:|---:|
| OpenAI cl100k_base | 1,874.1 | 4.17 | 100k | 1500/1500 |
| OpenAI o200k_base | 1,885.6 | 4.14 | 200k | 1500/1500 |
| **CUTE** | **2,009.3** | **3.89** | **150k** | **1500/1500** ✅ |
| StarCoder2 | 2,210.0 | 3.53 | 49k | 685/1500 ❌ |
| CodeLlama | 2,572.9 | 3.03 | 32k | 1493/1500 ⚠ |
| GPT-2 | 3,580.7 | 2.18 | 50k | 1500/1500 |

### Multi-language (1,500 held-out files: JS / TS / Rust / Go)

| Tokenizer | mean tokens | bytes/token | vocab | roundtrip |
|---|---:|---:|---:|---:|
| OpenAI cl100k_base | 1,966.0 | 3.91 | 100k | 1500/1500 |
| OpenAI o200k_base | 1,970.1 | 3.90 | 200k | 1500/1500 |
| **CUTE** | **2,078.0** | **3.70** | **150k** | **1500/1500** ✅ |
| StarCoder2 | 2,262.0 | 3.40 | 49k | 566/1500 ❌ |
| CodeLlama | 2,650.2 | 2.90 | 32k | 1500/1500 |
| GPT-2 | 3,365.4 | 2.28 | 50k | 1500/1500 |

### What this means

- **CUTE beats every open-source code tokenizer** we benchmarked
  (StarCoder2, CodeLlama, GPT-2) on both Python and the multi-lang
  holdout — by ~9–44% depending on the comparison.
- **OpenAI's cl100k still beats CUTE by ~5–7%** on this corpus. We're
  closing on it but not there yet.
- **CUTE is the only specialty code tokenizer with zero roundtrip
  failures** on the test set. StarCoder2 corrupts ~54% of multi-lang
  files and ~54% of Python files; CodeLlama leaks 7 Python files.
- **CUTE is slower than cl100k at encode time** — the Python-side PUA
  substitution adds overhead. Expect roughly an order of magnitude
  higher encode latency than cl100k. Decode latency is comparable.

This is the **first public release** and there is significant room for
improvement: bigger and more diverse training corpora, multi-language
training tuned for the deployment language, smarter PUA selection,
faster Python-side substitution, possibly a Rust pre-tokenizer.
**This is only the beginning.**

Reproduce these numbers locally:

```bash
python -m benchmarks.runner \
    --tokenizer ./model \
    --holdout ./your-holdout-corpus \
    --output reports/mine
```

---

## 🧀 Quick Start

```bash
pip install cute-tokenizer
```

The wheel ships a pretrained tokenizer. Use it immediately — no training
required:

```python
from cute_tokenizer import load_default_tokenizer

tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"  # always lossless
```

### Use via 🤗 HuggingFace

The same pretrained tokenizer is hosted on the HuggingFace Hub:

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(
    "HusseinEid/cute-tokenizer",
    trust_remote_code=True,
)
ids = tok("class Foo: pass", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
```

`trust_remote_code=True` is required because CUTE's wrapper class
(`CUTETokenizerFast`) does Python-side PUA substitution before delegating
to the underlying ByteLevel BPE.

### Train your own

```bash
# Drop a few repos into ./corpus/, then:
pip install 'cute-tokenizer[baseline]'  # pulls tiktoken for cl100k-aware ranking
cute build --corpus ./corpus --output ./output
```

```python
from cute_tokenizer import CUTETokenizerFast

tok = CUTETokenizerFast(
    tokenizer_file="./output/tokenizer.json",
    cute_mapping_file="./output/cute_mapping.json",
)
```

---

## 🔍 How It Works

1. **Corpus ingest** — stream files, dedup by content hash, scrub secrets
   (AWS / OpenAI / Anthropic / GitHub keys, JWTs, PEM private keys),
   optionally license-filter, write deterministic gzipped shards.
2. **Frequency mining** — parallel multiprocess token counter with
   identifier sub-part boosting (camelCase / snake_case / SCREAMING_CASE).
3. **Savings-based selection** — for each candidate token, compute
   `score = frequency × max(0, cl100k_count − 1)`. Tokens whose cl100k
   cost is 1 (single-byte ASCII like `(`, `,`) score zero — byte fallback
   already handles them optimally. Hashes / UUIDs / base64 blobs are
   filtered out by shape.
4. **PUA assignment** — selected tokens get unique codepoints in the
   Unicode supplementary planes (U+F0000+). The Basic Multilingual Plane
   PUA range (U+E000–U+F8FF) is **deliberately skipped** because real
   source code occasionally contains literal BMP PUA chars (Asian fonts,
   Unicode mapping tables in TS/JS) and using them would cause decode-time
   collisions.
5. **Contextual BPE training** — the training stream is PUA-substituted
   *before* it reaches the trainer, so byte-level BPE actually sees PUA
   chars and can learn merges like `[Ġ][⟦return⟧]` (whitespace + anchor).
   PUA chars are also registered as `AddedToken`s so any anchor that
   wasn't picked up still has an atomic vocab id.
6. **Atomicity audit** — post-train, the `merge_policy` module walks the
   tokenizer JSON and (under `strict_pua_atomicity`) drops any
   PUA-PUA merges. Four invariants are asserted on every save: model is
   `BPE`, decoder is `ByteLevel`, pre-tokenizer is `ByteLevel`, every
   mapping PUA char has a vocab id.
7. **Decode** — the byte-level decoder reconstructs the substituted
   string; reverse-substitution restores the original text.

Round-trip is **byte-equal** for any input. We test this with Hypothesis
on arbitrary Unicode (incl. supplementary planes) plus a hand-curated
torture set: ZWJ family emoji, RTL+bidi controls, BOM, control chars,
NFC/NFD variants, mixed scripts, deep underscores. Plus 3,000 held-out
real-world code files.

---

## 📦 Project Layout

```text
src/cute_tokenizer/
  baseline.py       # Cl100kBaseline / NullBaseline (savings scoring)
  config.py         # CUTEConfig — all knobs in one place
  patterns.py       # token regex + identifier splitter (uses `regex` module)
  corpus.py         # streaming ingest, dedup, secret scrub, sharding
  frequency.py      # parallel multiprocess counting
  selection.py      # savings-based selection + tightened PUA filter
  pua.py            # Private-Use-Area codepoint allocator (skips BMP by default)
  pretokenizer.py   # PUA substitution (Aho-Corasick + identifier splitting)
  trainer.py        # build_cute() — pre-substituted BPE training
  merge_policy.py   # PUA atomicity audit + invariant assertions
  decode.py         # PUA-aware reverse substitution
  tokenizer.py      # CUTETokenizerFast (PreTrainedTokenizerFast)
  manifest.py       # build manifest for reproducibility
  cli.py            # `cute build`, `cute roundtrip-check`, `cute info`

tests/
  unit/             # ~180 unit tests
  property/         # Hypothesis round-trip + Unicode torture
  integration/      # full pipeline E2E + determinism + collision regressions

benchmarks/
  baselines.py      # cl100k / o200k / gpt2 / codellama / starcoder2 adapters
  runner.py         # research-grade compression + latency report
  compression.py    # legacy compression-only script
  latency.py        # standalone latency benchmark

scripts/
  download_stack_python.py  # download a Stack subset, train/holdout split
  find_roundtrip_failures.py  # diagnostic: find files that don't roundtrip
```

---

## ⚙️ Configuration

```python
from cute_tokenizer import CUTEConfig, Cl100kBaseline, build_cute

config = CUTEConfig(
    vocab_size=200_000,            # total token IDs
    pua_budget=50_000,             # max PUA-mapped tokens
    min_bpe_budget=130_000,        # minimum learnable BPE merges
    max_token_len=50,              # ignore tokens longer than this
    boost_weight=0.3,              # identifier sub-part boost
    seed=42,                       # determinism
    workers=0,                     # 0 = os.cpu_count()
    use_savings_selection=True,    # use cl100k-aware ranking (default)
    strict_pua_atomicity=True,     # forbid PUA+PUA merges (default)
    allow_supplementary_pua=True,  # use full 50k PUA budget
    pua_skip_bmp=True,             # avoid BMP collisions (production default)
    enable_secret_scrub=True,      # drop files containing API keys etc.
)
build_cute("./corpus", "./output", config=config, baseline=Cl100kBaseline())
```

The vocab math (validated at construction time) is:

```text
byte_alphabet (256) + special_tokens + pua_budget + min_bpe_budget ≤ vocab_size
```

---

## 🧪 Testing

```bash
pip install -e .[dev]
pytest tests/unit          # fast unit tests
pytest tests/property      # Hypothesis round-trip + Unicode torture
pytest tests/integration   # full E2E build (slower)
pytest --cov=cute_tokenizer
```

---

## 🔐 Production Hardening

- **Determinism**: same `(OS, python, tokenizers, corpus_hash, seed)`
  → byte-identical `tokenizer.json`. Verified on Linux. Cross-platform
  byte-identity is explicitly *not* part of the contract.
- **Roundtrip integrity**: 1500/1500 on Python holdout, 1500/1500 on
  multi-language holdout — verified by the benchmark runner on every
  release.
- **Atomicity invariants**: `merge_policy.assert_invariants` enforces
  `model.type=BPE`, `decoder.type=ByteLevel`, `pre_tokenizer.type=ByteLevel`,
  and that every mapping PUA char has a vocab id, after every save.
- **No BMP-PUA collisions**: literal BMP PUA chars in user source
  (TS Unicode tables, CJK fonts) roundtrip unchanged because we
  assign mappings only to supplementary-plane PUAs.
- **No special-token text collisions**: `<s>`, `</s>`, `<unk>`, `<pad>`
  are deliberately *not* in the default special-token list — they collide
  with natural text in code.
- **Secret scrubbing**: corpus files matching AWS / OpenAI / Anthropic /
  GitHub / Slack / Google API key patterns, JWTs, and PEM private keys
  are dropped before vocab construction.
- **Build manifest**: every build emits `build_manifest.json` recording
  config, baseline name, corpus hash, vocab hash, library versions,
  merge audit counts, ingest stats, and timing.
- **Lint clean**: `ruff check` and `ruff format`.

---

## 🐭 Why a Mouse?

A mouse is small, fast, and nibbles things to size. CUTE quietly chews
through your tokenization while you focus on the model.

---

## 📜 License

MIT. See [LICENSE](LICENSE).
