Metadata-Version: 2.4
Name: xerv-crayon
Version: 4.0.7
Summary: The Omni-Backend Tokenizer (CPU/CUDA/ROCm)
Home-page: https://github.com/Xerv-AI/crayon
Author: Xerv Research Engineering Division
Author-email: Xerv Research Engineering Division <engineering@xerv.ai>
License: MIT License
        
        Copyright (c) 2025 Xerv Research Engineering Division
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/xerv/crayon
Project-URL: Repository, https://github.com/xerv/crayon.git
Project-URL: Documentation, https://github.com/xerv/crayon#readme
Project-URL: Bug Tracker, https://github.com/xerv/crayon/issues
Keywords: tokenizer,nlp,simd,avx2,high-performance,zero-copy,dat,double-array-trie,machine-learning,deep-learning,transformers,llm
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: C
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: MacOS
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: full
Requires-Dist: requests>=2.31.0; extra == "full"
Requires-Dist: datasets>=2.18.0; extra == "full"
Requires-Dist: huggingface-hub>=0.21.0; extra == "full"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Provides-Extra: benchmark
Requires-Dist: tiktoken>=0.5.0; extra == "benchmark"
Requires-Dist: transformers>=4.30.0; extra == "benchmark"
Requires-Dist: matplotlib>=3.7.0; extra == "benchmark"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

<p align="center">
  <img src="https://em-content.zobj.net/source/microsoft-teams/363/crayon_1f58d-fe0f.png" width="120" alt="Crayon Logo"/>
</p>

<h1 align="center">🖍️ XERV Crayon v4.0</h1>

<p align="center">
  <strong>The Omni-Backend Tokenizer for Specialized AI</strong>
</p>

<p align="center">
  <a href="https://badge.fury.io/py/xerv-crayon"><img src="https://badge.fury.io/py/xerv-crayon.svg" alt="PyPI version"/></a>
  <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"/></a>
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.12+-blue.svg" alt="Python 3.12+"/></a>
  <a href="https://developer.nvidia.com/cuda-zone"><img src="https://img.shields.io/badge/CUDA-12.0+-green.svg" alt="CUDA"/></a>
  <a href="https://rocm.docs.amd.com/"><img src="https://img.shields.io/badge/ROCm-6.0+-red.svg" alt="ROCm"/></a>
  <a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions"><img src="https://img.shields.io/badge/SIMD-AVX2-blue.svg" alt="AVX2"/></a>
</p>

<p align="center">
  <em>Why force a single bloated vocabulary on every problem?</em><br/>
  <strong>Crayon</strong> is a next-generation tokenizer designed for <strong>specialization</strong>. Hot-swap vocabulary profiles ("Cartridges") optimized for your domain—Quantum Physics, Rust Programming, Financial Law, or anything in between.
</p>

---

## 🚀 Key Features

| Feature | Description |
|:--------|:------------|
| **💾 Cartridge System** | Instantly hot-swap specialized vocabularies (`science`, `code`, `multilingual`) |
| **🚀 Omni-Backend** | Auto-detects & runs on **CPU (AVX2)**, **NVIDIA (CUDA)**, or **AMD (ROCm)** |
| **⚡ Native GPU Kernels** | "Bare Metal" C++/HIP kernels (no wrappers) for >100M tokens/sec |
| **🗺️ Zero-Copy Mapping** | DAT files loaded via `mmap` for instant startup & minimal RAM |
| **🌊 Zero-Disk Streaming** | Build profiles directly from Hugging Face—no multi-GB downloads |
| **🛡️ Offline Resilience** | Seamless local bootstrap fallback. Works offline out-of-the-box |

---

## 📊 Benchmarks — The Numbers Speak

> **100% HONEST. NO SUGARCOATING. DATA-DRIVEN.**
> 
> Run `python benchmark_competitive.py` to reproduce these results yourself.

### ⚡ Speed Comparison (Omni-Backend)

| Tokenizer | Tokens/sec | vs CRAYON |
|:----------|----------:|:----------|
| **🖍️ CRAYON (CPU - AVX2)** | **21,863,777** | **baseline** |
| **🖍️ CRAYON (CUDA - A100)** | **140,000,000+** | **6.4x faster** |
| tiktoken (GPT-4) | 524,469 | 41x slower |
| HF LLaMA (SP-BPE) | 281,558 | 77x slower |
| HF GPT-2 (BPE) | 237,117 | 92x slower |
| HF BERT (WordPiece) | 202,269 | 108x slower |

### 📈 CPU Optimization Verification
*Measured on Intel Core i3-7020U (Low-Power Laptop CPU)*

| Metric | Result |
|:-------|:-------|
| ✅ **AVX2 Status** | Active (Simd-Ops v4) |
| ✅ **Load Time** | **0.54ms** (Instant hot-swap) |
| ✅ **Throughput** | **21.1M tokens/sec** (!?!) |

![Benchmark Comparison](benchmark_comparison.png)

---

## ⚡ Quick Start: The "Omni-Backend"

Run on **any hardware** with a single line of code. Crayon automatically detects AVX2, CUDA, or ROCm presence.

### 1. Hardware-Aware Initialization

```python
from crayon.core.vocabulary import CrayonVocab

# 🔵 CPU (Intel/AMD) - AVX2/AVX-512 Native
vocab = CrayonVocab(device="cpu")

# 🟢 NVIDIA GPUs (All Tensor Core Architectures)
vocab = CrayonVocab(device="cuda")

# 🔴 AMD GPUs (Instinct/Radeon HIP/ROCm)
vocab = CrayonVocab(device="rocm")
```

### 2. The "Context Manager" Hot-Swap
Instantly switch between specialized vocabularies *within the same script* without reloading the model.

```python
vocab = CrayonVocab(device="cpu")
vocab.load_profile("lite")

# ... standard tokenization ...

# ⚡ TEMPORARY SWITCH to 'code' profile for a function block
with vocab.using_profile("code"):
    tokens = vocab.tokenize("def fast_inverse_sqrt(x):")
    # Uses the compact Code vocabulary here
    
# 🔥 AUTOMATICALLY REVERT to 'lite' here
```

### 3. Basic Example

```python
import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_cpu # Auto-renamed from crayon_fast

# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
    vocab_list = json.load(f)

# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")

# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_cpu.load_dat(mm)

# Ultra-fast tokenization 🚀
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_cpu.tokenize(code)
print(f"Tokens: {tokens}")
```

---

## 📦 Installation

```bash
git clone https://github.com/Xerv-AI/crayon.git
cd crayon
pip install -e .
```

### Build the Extensions
PowerShell (Windows):
```powershell
python setup.py build_ext --inplace
```
Bash (Linux/Mac):
```bash
python setup.py build_ext --inplace
```

> **Note:** The setup script auto-detects `nvcc` and `hipcc`. If found, GPU backends are built automatically.

---

## 🏎️ Omni-Backend Architecture (v4.0)

Crayon now uses a **"God Tier"** multi-backend implementation combining:

```
┌─────────────┐      ┌──────────────┐      ┌─────────────┐      ┌──────────────┐
│ vocab.json  │ ──▶  │ DATBuilder   │ ──▶  │  vocab.dat  │ ──▶  │ Omni-Engine  │
│   (List)    │      │  (Python)    │      │  (Binary)   │      │ CPU/CUDA/HIP │
└─────────────┘      └──────────────┘      └─────────────┘      └──────────────┘
```

| Component | File | Accelerators |
|:----------|:-----|:-------------|
| **CPU Backend** | `c_ext/cpu_engine.cpp` | **AVX-512 / AVX2** (Intel/AMD) |
| **CUDA Backend** | `c_ext/gpu_engine_cuda.cu` | **Tensor Cores** (NVIDIA Tesla/Ampere) |
| **ROCm Backend** | `c_ext/rocm_engine.cpp` | **CDNA2 / RDNA3** (AMD Instinct/Radeon) |
| **Zero-Copy Loader** | `mmap` + buffer protocol | Instant startup (0.5ms) |

---

## 🧩 Available Cartridges

5 production-ready profiles defined in `src/crayon/core/profiles.py`:

| Profile | Size | Optimized For | Sources |
|:--------|:-----|:--------------|:--------|
| **`lite`** | 50k | Speed & Mobile | WikiText, RainDrop |
| **`science`** | 250k | Reasoning (LaTeX, Quantum, Grad Math) | GRAD, Physics-700 |
| **`code`** | 250k | Syntax (Python, Rust, C++, JS) | CodeParrot, The Stack |
| **`multilingual`** | 250k | Global (EU langs, Chinese, Hindi) | OSCAR, Wikipedia |
| **`arts_commerce`** | 250k | Business (Legal, Finance, Lit) | PG19, Fin Phrasebank |

```python
vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")
```

---

## ☁️ Verify on Google Colab

Want to test the **CUDA Backend** for free? 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Xerv-AI/crayon/blob/main/colab_benchmark.ipynb)

1. Open the notebook.
2. Change Runtime type to **T4 GPU**.
3. Run the cells to verify `crayon_cuda` compiles and smashes tokens at >100M/sec.

---

## 🧪 Testing & Verification

```bash
# Full verification (Benchmarks + Tests)
python verify_dat_engine.py

# Benchmark all backends
python benchmark_competitive.py
```

```
============================================================
XERV CRAYON V2.0 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 50,000 tokens
DAT Nodes: 163,000+
Throughput: 14,255,305 tokens/sec
STATUS: ✅ HYPER-PRODUCTION READY
```

---

## 📜 Citation

```bibtex
@techreport{xerv2026crayon,
  title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
  author={Pal, Soham and Xerv Research},
  year={2026},
  institution={Xerv Research Engineering Division}
}
```

---

## 📄 License

Copyright (c) 2025-2026 Xerv Research. Released under the **MIT License**.

---

<p align="center">
  <strong>Built with 💙 by Xerv Research Engineering Division</strong>
</p>

<p align="center">
  <sub>⭐ Star this repo if Crayon helps your project!</sub>
</p>
