Metadata-Version: 2.4
Name: nanotok
Version: 0.1.0
Summary: High-performance BPE tokenizer. 20-60x faster than tiktoken.
Keywords: tokenizer,bpe,nlp,llm,tiktoken,huggingface
Author-Email: Ishaan <ishaan@example.com>
License-Expression: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Project-URL: Homepage, https://github.com/ishaan/nanotok
Project-URL: Repository, https://github.com/ishaan/nanotok
Project-URL: Issues, https://github.com/ishaan/nanotok/issues
Requires-Python: >=3.10
Provides-Extra: hub
Requires-Dist: huggingface-hub>=0.20; extra == "hub"
Provides-Extra: chat
Requires-Dist: jinja2>=3.0; extra == "chat"
Provides-Extra: all
Requires-Dist: huggingface-hub>=0.20; extra == "all"
Requires-Dist: jinja2>=3.0; extra == "all"
Description-Content-Type: text/markdown

# nanotok

[![PyPI version](https://badge.fury.io/py/nanotok.svg)](https://badge.fury.io/py/nanotok)
[![Python](https://img.shields.io/pypi/pyversions/nanotok.svg)](https://pypi.org/project/nanotok/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A high-performance BPE tokenizer written in C++ with Python bindings. **20-60x faster than tiktoken.**

## Installation

```bash
pip install nanotok
```

With optional dependencies:

```bash
pip install "nanotok[all]"  # includes huggingface-hub and jinja2
```

## Quick Start

```python
from nanotok import Tokenizer

# Load from Hugging Face Hub (requires huggingface-hub)
tokenizer = Tokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

# Load from tiktoken encoding
tokenizer = Tokenizer.from_tiktoken("cl100k_base")

# Load from local file
tokenizer = Tokenizer.from_file("path/to/tokenizer.json")

# Encode/decode
ids = tokenizer.encode("Hello, world!")
text = tokenizer.decode(ids)

# Batch processing
batch_ids = tokenizer.encode_batch(["Hello", "World"])
batch_texts = tokenizer.decode_batch(batch_ids)

# HuggingFace-style API
result = tokenizer("Hello, world!", padding=True, return_tensors="pt")
print(result["input_ids"], result["attention_mask"])

# Chat templates (requires jinja2)
messages = [{"role": "user", "content": "Hello!"}]
rendered = tokenizer.apply_chat_template(messages, tokenize=False)
```

## Features

- **Fast**: 20-60x faster than tiktoken, written in C++ with SIMD optimizations
- **Compatible**: Drop-in replacement for tiktoken and HuggingFace tokenizers
- **Batch processing**: Efficient batch encode/decode
- **Chat templates**: Support for Jinja2 chat templates
- **Special tokens**: Full support for special token handling
- **Cache**: Built-in encoding cache for repeated text

## API Reference

### `Tokenizer`

#### Class Methods
- `from_file(path)` - Load from tokenizer.json file
- `from_pretrained(repo_id)` - Load from Hugging Face Hub
- `from_tiktoken(encoding_name)` - Load from tiktoken encoding (gpt2, r50k_base, p50k_base, cl100k_base, o200k_base)

#### Methods
- `encode(text, allowed_special=None, add_special_tokens=False)` - Encode text to token IDs
- `decode(ids, skip_special_tokens=False)` - Decode token IDs to text
- `encode_batch(texts, ...)` - Batch encode
- `decode_batch(batch_ids, ...)` - Batch decode
- `token_to_id(token)` - Get ID for token
- `id_to_token(id)` - Get token for ID
- `apply_chat_template(messages, tokenize=True, add_generation_prompt=False)` - Apply chat template
- `clear_cache()` - Clear encoding cache
- `set_cache_enabled(enabled)` - Enable/disable cache

#### Properties
- `vocab_size` - Vocabulary size
- `special_tokens` - Dict of special tokens
- `eos_token`, `bos_token`, `pad_token`, `unk_token` - Special token strings
- `eos_token_id`, `bos_token_id`, `pad_token_id`, `unk_token_id` - Special token IDs

## Development

```bash
# Clone and install with uv
git clone https://github.com/ishaan/nanotok
cd nanotok
uv sync

# Run tests
uv run pytest

# Build wheel
uv build
```

## License

MIT
