Metadata-Version: 2.4
Name: turbotok
Version: 0.1.0
Summary: High-performance NumPy-based tokenizer library
Home-page: https://github.com/turbotok/turbotok
Download-URL: https://github.com/turbotok/turbotok/archive/refs/tags/v0.1.0.tar.gz
Author: TurboTok Team
Author-email: TurboTok Team <team@turbotok.dev>
Maintainer-email: TurboTok Team <team@turbotok.dev>
License: MIT
Project-URL: Homepage, https://github.com/turbotok/turbotok
Project-URL: Documentation, https://turbotok.readthedocs.io/
Project-URL: Repository, https://github.com/turbotok/turbotok
Project-URL: Bug Tracker, https://github.com/turbotok/turbotok/issues
Keywords: tokenizer,nlp,text-processing,numpy,performance,machine-learning
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20.0
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-cov>=2.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: flake8>=3.8; extra == "dev"
Requires-Dist: mypy>=0.800; extra == "dev"
Provides-Extra: benchmark
Requires-Dist: timeit; extra == "benchmark"
Dynamic: author
Dynamic: download-url
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# TurboTok 🚀

**High-performance NumPy-based tokenizer library**

TurboTok is a blazingly fast tokenizer built with pure Python + NumPy vectorization. It exploits SIMD operations under the hood and minimizes Python loops for maximum performance.

## Features

- ⚡ **Ultra-fast**: 1-10M tokens/sec depending on mode
- 🧠 **NumPy vectorization**: SIMD operations for maximum speed
- 🎯 **Multiple modes**: byte, char, word, and sentence tokenization
- 🐍 **Pure Python**: No external dependencies beyond NumPy
- 🌍 **Unicode support**: Full Unicode character handling
- 📦 **Batch processing**: Efficient tokenization of multiple texts
- 📊 **Performance stats**: Built-in benchmarking and statistics

## Installation

```bash
pip install turbotok
```

For development:
```bash
pip install turbotok[dev]
```

## Quick Start

```python
import turbotok

# Create tokenizer with desired mode
tok = turbotok.TurboTok(mode="word")

# Tokenize text
tokens = tok.tokenize("Hello world! 🚀 TurboTok is blazingly fast!")
print(tokens)
# Output: ['Hello', 'world', '!', '🚀', 'TurboTok', 'is', 'blazingly', 'fast', '!']

# Get statistics
stats = tok.get_stats("Hello world! 🚀")
print(stats)
# Output: {'mode': 'word', 'token_count': 5, 'avg_token_length': 3.2, ...}
```

## Tokenization Modes

### 1. Byte Mode (Fastest)
Raw byte-level tokenization using NumPy vectorization:
```python
tok = turbotok.TurboTok(mode="byte")
tokens = tok.tokenize("Hello! 🚀")
print(tokens)
# Output: [72, 101, 108, 108, 111, 33, 32, 240, 159, 154, 128]
# Performance: 5-10M tokens/sec
```

### 2. Char Mode
Unicode character-level tokenization:
```python
tok = turbotok.TurboTok(mode="char")
tokens = tok.tokenize("Hello! 🚀")
print(tokens)
# Output: ['H', 'e', 'l', 'l', 'o', '!', ' ', '🚀']
# Performance: 3-5M tokens/sec
```

### 3. Word Mode (Default)
Word-level tokenization with regex:
```python
tok = turbotok.TurboTok(mode="word")
tokens = tok.tokenize("Hello world! 🚀")
print(tokens)
# Output: ['Hello', 'world', '!', '🚀']
# Performance: 2-4M tokens/sec
```

### 4. Sentence Mode
Sentence-level tokenization:
```python
tok = turbotok.TurboTok(mode="sentence")
tokens = tok.tokenize("Hello world! How are you? I am fine.")
print(tokens)
# Output: ['Hello world!', 'How are you?', 'I am fine.']
# Performance: 1-2M tokens/sec
```

## Batch Processing

Tokenize multiple texts efficiently:
```python
texts = ["Hello world!", "TurboTok 🚀 rocks!", "Fast tokenization!"]
tok = turbotok.TurboTok(mode="word")

# Batch tokenization
batch_tokens = tok.tokenize_batch(texts)
print(batch_tokens)
# Output: [['Hello', 'world', '!'], ['TurboTok', '🚀', 'rocks', '!'], ['Fast', 'tokenization', '!']]
```

## Performance Benchmarks

Run the built-in benchmark suite:
```python
from turbotok.benchmarks import run_benchmarks

results = run_benchmarks(text_size_mb=1.0, iterations=50)
```

### Target Performance Goals

| Mode | Target (tokens/sec) | Status |
|------|-------------------|---------|
| Byte | 5-10M | 🚀 |
| Char | 3-5M | 🚀 |
| Word | 2-4M | 🚀 |
| Sentence | 1-2M | 🚀 |

## API Reference

### TurboTok Class

#### Constructor
```python
TurboTok(mode: str = "word")
```

**Parameters:**
- `mode` (str): Tokenization mode ("byte", "char", "word", "sentence")

#### Methods

##### `tokenize(text: str) -> List[Union[int, str]]`
Tokenize a single text string.

**Parameters:**
- `text` (str): Input text to tokenize

**Returns:**
- `List[Union[int, str]]`: List of tokens (bytes as ints for byte mode, strings otherwise)

##### `tokenize_batch(texts: List[str]) -> List[List[Union[int, str]]]`
Tokenize multiple texts efficiently.

**Parameters:**
- `texts` (List[str]): List of input texts

**Returns:**
- `List[List[Union[int, str]]]`: List of token lists

##### `get_stats(text: str) -> dict`
Get tokenization statistics.

**Parameters:**
- `text` (str): Input text

**Returns:**
- `dict`: Statistics including token count, average length, compression ratio, etc.

## Performance Optimizations

### 1. NumPy Vectorization
- Byte mode uses `np.frombuffer()` for C-level speed
- No Python loops in critical paths
- SIMD operations under the hood

### 2. Pre-compiled Regex
- Word and sentence patterns compiled once at initialization
- Avoids repeated regex compilation overhead

### 3. Memory Views
- Uses `np.frombuffer` instead of string iteration
- Direct memory access for maximum performance

### 4. Batch Processing
- Vectorized operations for multiple texts
- Reduced function call overhead

## Development

### Running Tests
```bash
pytest tests/
```

### Running Benchmarks
```bash
python -m turbotok.benchmarks
```

### Code Quality
```bash
# Format code
black turbotok/ tests/

# Lint code
flake8 turbotok/ tests/

# Type checking
mypy turbotok/
```

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests for new functionality
5. Run the test suite
6. Submit a pull request

## License

MIT License - see LICENSE file for details.

## Performance Philosophy

TurboTok follows these performance principles:

1. **Exploit NumPy vectorization** - SIMD under the hood
2. **Minimize Python loops** - They kill speed
3. **Use memory views** - `np.frombuffer`, `np.char` ops
4. **Apply math-like thinking** - Treat text as arrays, not strings
5. **Pre-compile patterns** - Avoid repeated regex compilation
6. **Batch operations** - Process multiple texts efficiently

## Roadmap

- [ ] Parallel processing with multiprocessing
- [ ] Numba JIT compilation for even more speed
- [ ] Custom vocabulary support
- [ ] Subword tokenization modes
- [ ] Streaming tokenization for large files
- [ ] Integration with popular NLP frameworks

---

**TurboTok** - Because speed matters! 🚀⚡
