Metadata-Version: 2.4
Name: cute-tokenizer
Version: 0.1.3
Summary: Compact Unicode Token Encoding — semantic-prior-guided contextual tokenization for code
Project-URL: Homepage, https://github.com/HusseinEid101/CUTE
Project-URL: Issues, https://github.com/HusseinEid101/CUTE/issues
Author-email: Hussein Eid <HusseinEid101@users.noreply.github.com>
License: MIT
License-File: LICENSE
Keywords: bpe,code,huggingface,llm,nlp,tokenizer
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Requires-Dist: orjson>=3.10
Requires-Dist: pyahocorasick>=2.1
Requires-Dist: regex>=2024.7.24
Requires-Dist: tokenizers<0.22,>=0.20
Requires-Dist: tqdm>=4.66
Requires-Dist: transformers>=4.45
Requires-Dist: xxhash>=3.4
Provides-Extra: baseline
Requires-Dist: tiktoken>=0.7; extra == 'baseline'
Provides-Extra: benchmarks
Requires-Dist: matplotlib>=3.8; extra == 'benchmarks'
Requires-Dist: tabulate>=0.9; extra == 'benchmarks'
Requires-Dist: tiktoken>=0.7; extra == 'benchmarks'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: mypy>=1.11; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Requires-Dist: tiktoken>=0.7; extra == 'dev'
Description-Content-Type: text/markdown

<p align="center">
  <img src="assets/mascot.jpg" alt="CUTE Tokenizer Mascot" width="600"/>
</p>

<h1 align="center">🐭 CUTE Tokenizer</h1>
<h3 align="center"><em>Compact Unicode Token Encoding</em></h3>
<p align="center"><strong>✨ semantic-prior-guided contextual tokenization for code ✨</strong></p>

<p align="center">
  <a href="https://www.python.org/">
    <img src="https://img.shields.io/badge/python-3.10+-blue?style=flat-square" alt="Python 3.10+"/>
  </a>
  <a href="LICENSE">
    <img src="https://img.shields.io/badge/License-MIT-green?style=flat-square" alt="License: MIT"/>
  </a>
  <a href="https://huggingface.co/docs/tokenizers">
    <img src="https://img.shields.io/badge/🤗-HuggingFace-ffd21e?style=flat-square" alt="HuggingFace Compatible"/>
  </a>
  <a href="https://pypi.org/project/cute-tokenizer/">
    <img src="https://img.shields.io/pypi/v/cute-tokenizer?style=flat-square&color=white&cb=20260507" alt="PyPI version"/>
  </a>
  <a href="https://github.com/HusseinEid101/CUTE/actions">
    <img src="https://img.shields.io/github/actions/workflow/status/HusseinEid101/CUTE/ci.yml?branch=main&style=flat-square" alt="CI"/>
  </a>
</p>

---

## ✨ Highlights

CUTE is a code-aware tokenizer that combines **explicit semantic anchors**
with **contextual subword merges** to produce compact, lossless token
sequences for Python, TypeScript, JavaScript, Rust, Go, and other
common programming languages.

The architecture has two stages:

- **Savings-based PUA mapping** — high-value words, operators, and
  identifier sub-parts are mapped to single Unicode Private-Use-Area
  characters, ranked by *expected token savings vs the cl100k baseline*
  (not raw frequency).
- **Contextual byte-level BPE** — the trainer sees PUA-substituted text,
  so it can learn merges around those anchors (e.g. whitespace + PUA),
  while a post-train safety filter forbids PUA + PUA pairs to keep the
  semantic units atomic.

**The result:**

- 🪄 Shorter sequences on real code than vanilla byte-level BPE
- 🔁 **Byte-equal lossless round-trip** on arbitrary Unicode (Hypothesis-verified)
- 🔒 **Deterministic** within a fixed `(OS, python, tokenizers)` host triple
- 🤗 Drop-in `AutoTokenizer` compatibility via `trust_remote_code`

---

## 🧀 Quick Start

```bash
pip install cute-tokenizer
```

The wheel ships a pretrained tokenizer (v1 model, code corpus). Use it
immediately — no training required:

```python
from cute_tokenizer import load_default_tokenizer

tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"  # always lossless
```

Train your own and point at the artifacts:

```bash
# Drop a few repos into ./corpus/, then:
pip install 'cute-tokenizer[baseline]'  # pulls tiktoken for cl100k-aware ranking
cute build --corpus ./corpus --output ./output
```

```python
from cute_tokenizer import CUTETokenizerFast

tok = CUTETokenizerFast(
    tokenizer_file="./output/tokenizer.json",
    cute_mapping_file="./output/cute_mapping.json",
)
```

Or via `AutoTokenizer` (after pushing to HF Hub):

```python
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("user/cute-py", trust_remote_code=True)
```

---

## 🔍 How It Works

1. **Corpus ingest** — stream files, dedup by content hash, scrub secrets
   (AWS / OpenAI / Anthropic / GitHub / private keys / JWTs), optionally
   license-filter, write deterministic gzipped shards.
2. **Frequency mining** — parallel multiprocess token counter with
   identifier sub-part boosting (camelCase / snake_case / SCREAMING_CASE).
3. **Savings-based selection** — for each candidate token, compute
   `score = frequency × max(0, cl100k_count − 1)`. Tokens whose cl100k
   cost is 1 (single-byte ASCII like `(`, `,`) score zero — byte fallback
   already handles them optimally. Hashes / UUIDs / base64 blobs are
   filtered out by shape.
4. **PUA assignment** — selected tokens get unique codepoints in the
   Private-Use-Area, BMP first (`U+E000` …) for the cheapest UTF-8
   encoding. Codepoints already present in the corpus are skipped.
5. **Contextual BPE training** — the training stream is PUA-substituted
   *before* it reaches the trainer, so byte-level BPE actually sees PUA
   chars and can learn merges like `[Ġ][⟦return⟧]` (whitespace + anchor).
   PUA chars are also registered as `AddedToken`s so any anchor that
   wasn't picked up still has an atomic vocab id.
6. **Atomicity audit** — post-train, the `merge_policy` module walks the
   tokenizer JSON and (under `strict_pua_atomicity`) drops any
   PUA-PUA merges. Four invariants are asserted on every save: model is
   `BPE`, decoder is `ByteLevel`, pre-tokenizer is `ByteLevel`, every
   mapping PUA char has a vocab id.
7. **Decode** — the byte-level decoder reconstructs the substituted
   string; reverse-substitution restores the original text.

Round-trip is **byte-equal** for any input. We test this with Hypothesis
on arbitrary Unicode (incl. supplementary planes) plus a hand-curated
torture set: ZWJ family emoji, RTL+bidi controls, BOM, control chars,
NFC/NFD variants, mixed scripts, deep underscores.

---

## 📦 Project Layout

```text
src/cute_tokenizer/
  baseline.py       # Cl100kBaseline / NullBaseline (savings scoring)
  config.py         # CUTEConfig — all knobs in one place
  patterns.py       # token regex + identifier splitter (uses `regex` module)
  corpus.py         # streaming ingest, dedup, secret scrub, sharding
  frequency.py      # parallel multiprocess counting
  selection.py      # savings-based selection + tightened PUA filter
  pua.py            # Private-Use-Area codepoint allocator
  pretokenizer.py   # PUA substitution (Aho-Corasick + identifier splitting)
  trainer.py        # build_cute() — pre-substituted BPE training
  merge_policy.py   # PUA atomicity audit + invariant assertions
  decode.py         # PUA-aware reverse substitution
  tokenizer.py      # CUTETokenizerFast (PreTrainedTokenizerFast)
  manifest.py       # build manifest for reproducibility
  cli.py            # `cute build`, `cute roundtrip-check`, `cute info`

tests/
  unit/             # ~180 unit tests (incl. baseline + selection + merge_policy)
  property/         # Hypothesis round-trip + Unicode torture
  integration/      # full pipeline E2E + determinism

benchmarks/
  compression.py    # CUTE vs cl100k / GPT-2 / CodeLlama
  latency.py        # encode/decode μs per KB

plans/
  cute-refit.md     # 9-step blueprint for the full v2 production refit
```

---

## ⚙️ Configuration

```python
from cute_tokenizer import CUTEConfig, Cl100kBaseline, build_cute

config = CUTEConfig(
    vocab_size=120_000,            # total token IDs
    pua_budget=50_000,             # max PUA-mapped tokens
    min_bpe_budget=50_000,         # minimum learnable BPE merges
    max_token_len=50,              # ignore tokens longer than this
    boost_weight=0.3,              # identifier sub-part boost
    seed=42,                       # determinism
    workers=0,                     # 0 = os.cpu_count()
    use_savings_selection=True,    # use cl100k-aware ranking (default)
    strict_pua_atomicity=True,     # forbid PUA+PUA merges (default)
    allow_supplementary_pua=False, # cap budget at BMP (6,400) for byte efficiency
    enable_secret_scrub=True,      # drop files containing API keys etc.
)
build_cute("./corpus", "./output", config=config, baseline=Cl100kBaseline())
```

The vocab math (validated at construction time) is:

```text
byte_alphabet (256) + special_tokens + pua_budget + min_bpe_budget ≤ vocab_size
```

---

## 🧪 Testing

```bash
pip install -e .[dev]
pytest tests/unit          # fast unit tests
pytest tests/property      # Hypothesis round-trip + Unicode torture
pytest tests/integration   # full E2E build (slower)
pytest --cov=cute_tokenizer
```

The Hypothesis suite runs hundreds of generated cases per property,
plus a hand-picked torture set covering: empty strings, BOM, ZWJ family
emoji, RTL+bidi controls, combining marks, control chars, supplementary
planes, NFC vs NFD, mixed scripts.

---

## 🔐 Production Hardening

- **Determinism**: same `(OS, python, tokenizers, corpus_hash, seed)`
  → byte-identical `tokenizer.json`. Verified on Linux by
  `tests/integration/test_tokenizer_determinism.py`. Cross-platform
  byte-identity is explicitly *not* part of the contract.
- **Atomicity invariants**: `merge_policy.assert_invariants` enforces
  `model.type=BPE`, `decoder.type=ByteLevel`, `pre_tokenizer.type=ByteLevel`,
  and that every mapping PUA char has a vocab id, after every save.
- **Secret scrubbing**: corpus files matching AWS / OpenAI / Anthropic /
  GitHub / Slack / Google API key patterns, JWTs, and PEM private keys
  are dropped before vocab construction.
- **Build manifest**: every build emits `build_manifest.json` recording
  config, baseline name, corpus hash, vocab hash, library versions,
  merge audit counts, ingest stats, and timing.
- **PUA collision detection**: codepoints found in the corpus are
  skipped during assignment, so user content cannot be confused with
  our injection.
- **Lint clean**: `ruff check` and `ruff format`.

---

## 📊 Benchmarks

```bash
# Compare CUTE against cl100k (and other baselines if installed)
python -m benchmarks.compression --tokenizer ./output --holdout ./holdout
python -m benchmarks.latency --tokenizer ./output
```

The benchmark suite measures bytes-per-token, p50/p95/p99 sequence
lengths, encode and decode latency, and peak RSS, on a held-out corpus
that was never seen during training. Run it on your own corpus to see
numbers for your distribution.

A reproducible v2-model benchmark vs cl100k / GPT-2 / CodeLlama / StarCoder2
is the gate for unlocking quantitative compression claims in this README.

---

## 🐭 Why a Mouse?

A mouse is small, fast, and nibbles things to size. CUTE quietly chews
through your tokenization while you focus on the model.

---

## 📜 License

MIT. See [LICENSE](LICENSE).
