Metadata-Version: 2.4
Name: bpe-from-scratch
Version: 0.3.0
Summary: Byte-pair encoding tokenizer built from scratch
Project-URL: Homepage, https://github.com/hannody/bpe_from_scratch
Project-URL: Repository, https://github.com/hannody/bpe_from_scratch
License-Expression: MIT
Requires-Python: >=3.12
Requires-Dist: regex>=2026.2.28
Requires-Dist: tqdm>=4.67.3
Provides-Extra: dev
Requires-Dist: jupyter>=1.1.1; extra == 'dev'
Requires-Dist: pytest>=9.0.2; extra == 'dev'
Description-Content-Type: text/markdown

# BPE from Scratch

A ground-up Python implementation of **Byte-Level Byte Pair Encoding (BBPE)** — the tokenization algorithm used by GPT-2, GPT-3, GPT-4, and other modern LLMs.

## What is Byte-Level BPE?

BPE was originally a data compression algorithm that replaces the most frequent pair of bytes in a sequence with a single unused symbol. Applied to NLP, it builds a **subword vocabulary** by iteratively merging the most frequent adjacent token pairs in a corpus.

The *byte-level* variant (introduced by OpenAI for GPT-2) operates directly on raw UTF-8 bytes rather than characters or words:

- **Base vocabulary of 256** — one token per possible byte value (0–255), no unknown tokens ever
- **Language-agnostic** — any Unicode text (code, math, emoji, CJK, ...) is representable without a special `<UNK>` token
- **Lossless** — encoding and decoding are exact roundtrips
- **Merges learned greedily** — at each step, the most frequent adjacent pair is merged and assigned a new token ID (256, 257, ...)

This is the same fundamental approach used by `tiktoken` (OpenAI) and Hugging Face tokenizers for GPT-style models.

## Algorithm Phases

| Phase | Description                                                                          |
| ----- | ------------------------------------------------------------------------------------ |
| 1     | **UTF-8 encoding** — normalize and encode input text to bytes                        |
| 2     | **Byte → token conversion** — represent each byte as an integer token ID in [0, 255] |
| 3     | **Pair counting** — count all adjacent token pairs in the sequence                   |
| 4     | **Merge** — replace the most frequent pair with a new token ID                       |
| 5     | **Repeat** — iterate until the target vocabulary size is reached                     |

## Installation

```bash
pip install bpe-from-scratch
```

## Usage

### Train from scratch

```python
from bpe_from_scratch import ByteLevelBPE

bpe = ByteLevelBPE()
bpe.train(text, vocab_size=1024)  # 1024 total tokens, 768 merge rules
bpe.save("my_model.json")
```

Or train directly from a folder of `.txt` files:

```python
from bpe_from_scratch.train import train_from_folder

bpe = train_from_folder(
    folder_path="data/corpus_A/",
    model_path="my_model.json",
    vocab_size=1024,
)
```

### Utilize all CPU cores

Pass `num_workers=os.cpu_count()` to parallelize pre-tokenization across all cores (useful for large corpora):

```python
import os
from bpe_from_scratch import ByteLevelBPE

bpe = ByteLevelBPE()
bpe.train(text, vocab_size=50_257, num_workers=os.cpu_count())
```

Or via the folder helper:

```python
import os
from bpe_from_scratch.train import train_from_folder

train_from_folder(
    folder_path="data/corpus/",
    model_path="my_model.json",
    vocab_size=50_257,
    num_workers=os.cpu_count(),
)
```

> **Note:** On Windows, guard the call site with `if __name__ == "__main__":` due to spawn-based multiprocessing. On macOS/Linux, `fork` is used by default and no guard is needed.

### Encode and decode

```python
tokens = bpe.encode("Hello, world!")  # list[int]
text   = bpe.decode(tokens)           # "Hello, world!"
```

### Continue training on new data

Load an existing model and extend the vocabulary without discarding what was already learned:

```python
bpe = ByteLevelBPE()
bpe.load("my_model.json")
bpe.continue_train(new_text, new_vocab_size=1280)  # extend to 1280 total tokens
bpe.save("my_model.json")
```

All previously learned token IDs remain stable — documents encoded with the old model are still valid after the update.

Or use the folder helper:

```python
from bpe_from_scratch.train import continue_train_from_folder

bpe = continue_train_from_folder(
    folder_path="data/corpus_B/",
    model_path="my_model.json",
    new_vocab_size=1280,
)
```

> **Note:** `continue_train` uses a frozen-base approach — existing merges are replayed on the new text before new rules are learned. This keeps token IDs stable but is not equivalent to a full retrain on the combined corpus. See [TRAINING_GUIDE.md](TRAINING_GUIDE.md) for details and tradeoffs.

## Project Structure

```
src/bpe.py        # Core implementation
tests/test_bpe.py # Unit tests
tests/manual/     # Interactive notebooks for experimentation
```

## Running Tests

```bash
PYTHONPATH=src python3 -m unittest discover -s tests -v
```

## Acknowledgements

Inspired by Andrej Karpathy's [minbpe](https://github.com/karpathy/minbpe).

## References

- [minbpe](https://github.com/karpathy/minbpe) — Minimal BPE implementation by Andrej Karpathy
- [GPT-2 Paper — Language Models are Unsupervised Multitask Learners](https://openai.com/research/language-unsupervised) (Radford et al., 2019)
- [Byte-Pair Encoding tokenization — Hugging Face NLP Course](https://huggingface.co/learn/llm-course/chapter6/5)
- [Neural Machine Translation of Rare Words with Subword Units](https://aclanthology.org/P16-1162/) — original BPE-for-NLP paper (Sennrich et al., 2016)
- [Building a Fast BPE Tokenizer from Scratch — Jun Yu Tan](https://jytan.net/blog/2025/bpe/) — Stages 1–5 complexity analysis and benchmarks
- [From Hours to Seconds: Optimising BPE Tokeniser Training — Logan Thomson](https://medium.com/@logan_16888/from-hours-to-seconds-optimising-bpe-tokeniser-training-f4234300d03e) — Stages 1–7 including GC reduction and adaptive parallelism
- [Bypassing the GIL for Parallel Processing — Real Python](https://realpython.com/python-parallel-processing/) — multiprocessing vs threading guidance
