Metadata-Version: 2.4
Name: xerv-crayon
Version: 5.2.1
Summary: Omni-Backend Tokenizer - CPU (AVX2/512), CUDA (NVIDIA), ROCm (AMD) with automatic hardware detection
Author-email: Xerv Research Engineering Division <engineering@xerv.ai>
License: MIT License
        
        Copyright (c) 2025 Xerv Research Engineering Division
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
Project-URL: Homepage, https://github.com/Electroiscoding/CRAYON
Project-URL: Repository, https://github.com/Electroiscoding/CRAYON.git
Project-URL: Documentation, https://github.com/Electroiscoding/CRAYON#readme
Project-URL: Bug Tracker, https://github.com/Electroiscoding/CRAYON/issues
Keywords: tokenizer,nlp,simd,avx2,avx512,cuda,rocm,hip,gpu,high-performance,zero-copy,dat,double-array-trie,machine-learning,deep-learning,transformers,llm,nvcuda,amd,nvidia
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: C
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: MacOS
Classifier: Environment :: GPU :: NVIDIA CUDA
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: full
Requires-Dist: requests>=2.31.0; extra == "full"
Requires-Dist: datasets>=2.18.0; extra == "full"
Requires-Dist: huggingface-hub>=0.21.0; extra == "full"
Provides-Extra: cuda
Requires-Dist: torch>=2.0.0; extra == "cuda"
Requires-Dist: torchvision>=0.15.0; extra == "cuda"
Requires-Dist: torchaudio>=2.0.0; extra == "cuda"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: torch>=2.0.0; extra == "dev"
Provides-Extra: benchmark
Requires-Dist: tiktoken>=0.5.0; extra == "benchmark"
Requires-Dist: transformers>=4.30.0; extra == "benchmark"
Requires-Dist: matplotlib>=3.7.0; extra == "benchmark"
Dynamic: license-file
Dynamic: requires-python

<p align="center">
  <img src="https://em-content.zobj.net/source/microsoft-teams/363/crayon_1f58d-fe0f.png" width="120" alt="Crayon Logo"/>
</p>

<h1 align="center">🖍️ XERV Crayon v5.0.1</h1>

<p align="center">
  <strong>The Omni-Backend Tokenizer for Specialized AI</strong>
</p>

<p align="center">
  <a href="https://badge.fury.io/py/xerv-crayon"><img src="https://badge.fury.io/py/xerv-crayon.svg" alt="PyPI version"/></a>
  <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"/></a>
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.12+-blue.svg" alt="Python 3.12+"/></a>
  <a href="https://developer.nvidia.com/cuda-zone"><img src="https://img.shields.io/badge/CUDA-12.0+-green.svg" alt="CUDA"/></a>
  <a href="https://rocm.docs.amd.com/"><img src="https://img.shields.io/badge/ROCm-6.0+-red.svg" alt="ROCm"/></a>
  <a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions"><img src="https://img.shields.io/badge/SIMD-AVX2-blue.svg" alt="AVX2"/></a>
</p>

<p align="center">
  <em>Why force a single bloated vocabulary on every problem?</em><br/>
  <strong>Crayon</strong> is a next-generation tokenizer designed for <strong>specialization</strong>. Hot-swap vocabulary profiles ("Cartridges") optimized for your domain—Quantum Physics, Rust Programming, Financial Law, or anything in between.
</p>

---

## 🚀 Key Features

| Feature | Description |
|:--------|:------------|
| **💾 Cartridge System** | Instantly hot-swap specialized vocabularies (`science`, `code`, `multilingual`) |
| **🚀 Omni-Backend** | Auto-detects & runs on **CPU (AVX2)**, **NVIDIA (CUDA)**, or **AMD (ROCm)** |
| **⚡ Hyper-Fast Trainer** | C++17 Linked-List BPE trains vocabularies in **seconds** (100x faster) |
| **⚡ Native GPU Kernels** | "Bare Metal" C++/CUDA/HIP kernels (no wrappers) for >10M tokens/sec |
| **🗺️ Zero-Copy Mapping** | DAT files loaded via `mmap` for instant startup & minimal RAM |
| **🌊 Zero-Disk Streaming** | Build profiles directly from Hugging Face—no multi-GB downloads |
| **🛡️ Offline Resilience** | Seamless local bootstrap fallback. Works offline out-of-the-box |

---

## 📊 Benchmarks — Production Results

> **DATA-DRIVEN. NO HYPE. 100% VERIFIED.**

### 🔥 CPU Performance (Intel i3-7020U AVX2)

Even on modest consumer hardware, Crayon's SIMD-accelerated engine outperforms industry standards by **50x - 100x**.

| Tokenizer | Tokens/Sec | Speedup vs Crayon |
|:----------|-----------:|------------------:|
| **CRAYON (Science)** | **40,808,299** | **1.0x** (Baseline) |
| **CRAYON (Code)** | **34,742,588** | **1.2x** slower |
| Tiktoken (GPT-4) | 608,610 | **67.0x** slower |
| HF LLaMA | 343,282 | **118.8x** slower |
| HF GPT-2 | 307,563 | **132.6x** slower |
| HF BERT | 195,108 | **209.1x** slower |

### ⚡ GPU Performance (Tesla T4)

### ⚡ Installation Summary (T4 GPU Environment)

```
======================================================================
XERV CRAYON V4.1.9 INSTALLATION AND BENCHMARKS
======================================================================
[1/7] Checking environment...
      PyTorch: 2.9.0+cu126
      CUDA: 12.6 (Tesla T4)
      * Smart Build: Will compile ONLY for this GPU architecture
      NVCC: /usr/local/cuda/bin/nvcc

[2/7] Installing build dependencies...
      Done (ninja, packaging, wheel)

[3/7] Cleaning previous installations...

[4/7] Cloning source code...
      __version__ = "4.1.9"

[5/7] Compiling and Installing (Streaming Logs)...
----------------------------------------------------------------------
[CRAYON-BUILD] Detected GPU: SM 7.5 -> Compiling for sm_75 ONLY
[CRAYON-BUILD] Configuring CUDA extension (max_jobs=1)

building 'crayon.c_ext.crayon_cpu' extension
[1/1] c++ -O3 -march=native -mavx2 -fPIC -std=c++17
Successfully built crayon_cpu.so

building 'crayon.c_ext.crayon_cuda' extension
[1/1] nvcc -O3 -std=c++17 --expt-relaxed-constexpr -gencode=arch=compute_75,code=sm_75
Successfully built crayon_cuda.so

Successfully installed xerv-crayon-4.1.9
----------------------------------------------------------------------

[6/7] Verifying installation...
      Success! Installed version: 4.1.9
      Backends: {'cpu': True, 'cuda': True, 'rocm': False}
```

### 🔥 Performance Results (T4 GPU vs Tiktoken)

**CRAYON (CUDA Backend - Tesla T4):**
```
Active Device: CUDA
Backend: cuda_extension

Batch Throughput (XERV CRAYON):
     1,000 docs:      748,048 docs/sec |      9,724,621 tokens/sec
    10,000 docs:      639,239 docs/sec |      8,310,109 tokens/sec
    50,000 docs:      781,129 docs/sec |     10,154,678 tokens/sec
```

**Tiktoken (cl100k_base - CPU):**
```
Tiktoken Batch Throughput (cl100k_base encoding):
     1,000 docs:       87,307 docs/sec |        873,068 tokens/sec
    10,000 docs:       81,658 docs/sec |        816,576 tokens/sec
    50,000 docs:      107,583 docs/sec |      1,075,829 tokens/sec
```

### 📈 Performance Comparison Table

| Batch Size | CRAYON Docs/Sec | CRAYON Tokens/Sec | Tiktoken Docs/Sec | Tiktoken Tokens/Sec | **Speedup** |
|:-----------|----------------:|------------------:|------------------:|--------------------:|------------:|
| 1,000      | 748,048         | 9,724,621         | 87,307            | 873,068             | **11.1x** ✨ |
| 10,000     | 639,239         | 8,310,109         | 81,658            | 816,576             | **10.2x** ✨ |
| 50,000     | 781,129         | 10,154,678        | 107,583           | 1,075,829           | **9.4x** ✨ |

**Average Speedup: 10.2x faster than tiktoken on Tesla T4 GPU**

### 🎯 Key Achievements

- ✅ **>10M tokens/sec** on mid-tier GPU (Tesla T4)
- ✅ **Smart compilation** - Only builds for detected GPU architecture
- ✅ **Zero-copy memory mapping** - Instant profile loading (<1ms)
- ✅ **Production-grade stability** - Handles 50K+ document batches
- ✅ **Consistent performance** - Minimal variance across batch sizes

---

## ⚡ Quick Start: The "Omni-Backend"

Run on **any hardware** with a single line of code. Crayon automatically detects AVX2, CUDA, or ROCm presence.

### 1. Hardware-Aware Initialization

```python
from crayon.core.vocabulary import CrayonVocab

# 🔵 CPU (Intel/AMD) - AVX2/AVX-512 Native
vocab = CrayonVocab(device="cpu")

# 🟢 NVIDIA GPUs (All Tensor Core Architectures)
vocab = CrayonVocab(device="cuda")

# 🔴 AMD GPUs (Instinct/Radeon HIP/ROCm)
vocab = CrayonVocab(device="rocm")
```

### 2. The "Context Manager" Hot-Swap
Instantly switch between specialized vocabularies *within the same script* without reloading the model.

```python
vocab = CrayonVocab(device="cpu")
vocab.load_profile("lite")

# ... standard tokenization ...

# ⚡ TEMPORARY SWITCH to 'code' profile for a function block
with vocab.using_profile("code"):
    tokens = vocab.tokenize("def fast_inverse_sqrt(x):")
    # Uses the compact Code vocabulary here
    
# 🔥 AUTOMATICALLY REVERT to 'lite' here
```

### 3. Basic Example

```python
import json
import mmap
from crayon.c_ext.dat_builder import DATBuilder
from crayon.c_ext import crayon_cpu # Auto-renamed from crayon_fast

# Load any trained vocabulary
with open("trained_vocab_code.json", "r") as f:
    vocab_list = json.load(f)

# Compile to DAT (one-time, few seconds)
builder = DATBuilder()
builder.build(vocab_list)
builder.save("vocab_code.dat")

# Load into C++ engine via memory mapping (instant, <1ms)
with open("vocab_code.dat", "rb") as f:
    mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    crayon_cpu.load_dat(mm)

# Ultra-fast tokenization 🚀
code = 'fn main() { println!("Hello, World!"); }'
tokens = crayon_cpu.tokenize(code)
print(f"Tokens: {tokens}")
```

---

## 📦 Installation

```bash
pip install xerv-crayon
```

### Google Colab / Linux Installation
Since Crayon includes high-performance C++ extensions, it will compile natively on your environment:

```python
# Run this in a Colab cell
!pip install xerv-crayon
```

### Build the Extensions
PowerShell (Windows):
```powershell
python setup.py build_ext --inplace
```
Bash (Linux/Mac):
```bash
python setup.py build_ext --inplace
```

> **Note:** The setup script auto-detects `nvcc` and `hipcc`. If found, GPU backends are built automatically.

---

## 🏎️ Omni-Backend Architecture (v4.0)

Crayon now uses a **"God Tier"** multi-backend implementation combining:

```
┌─────────────┐      ┌──────────────┐      ┌─────────────┐      ┌──────────────┐
│ vocab.json  │ ──▶  │ DATCompiler  │ ──▶  │  vocab.dat  │ ──▶  │ Omni-Engine  │
│   (List)    │      │ (C++ Fast)   │      │  (Binary)   │      │ CPU/CUDA/HIP │
└─────────────┘      └──────────────┘      └─────────────┘      └─────────────┘
```

| Component | File | Accelerators |
|:----------|:-----|:-------------|
| **CPU Backend** | `c_ext/cpu_engine.cpp` | **AVX-512 / AVX2** (Intel/AMD) |
| **CUDA Backend** | `c_ext/gpu_engine_cuda.cu` | **Tensor Cores** (NVIDIA Tesla/Ampere) |
| **ROCm Backend** | `c_ext/rocm_engine.cpp` | **CDNA2 / RDNA3** (AMD Instinct/Radeon) |
| **Zero-Copy Loader** | `mmap` + buffer protocol | Instant startup (0.5ms) |

---

## 🧩 Available Cartridges

5 production-ready profiles defined in `src/crayon/core/profiles.py`:

| Profile | Size | Optimized For | Sources |
|:--------|:-----|:--------------|:--------|
| **`standard`** | 57k | **General English (V5 Default)** | Lite + Top 10k subwords |
| **`lite`** | 50k | Speed & Mobile | WikiText, RainDrop |
| **`science`** | 250k | Reasoning (LaTeX, Quantum, Grad Math) | GRAD, Physics-700 |
| **`code`** | 250k | Syntax (Python, Rust, C++, JS) | CodeParrot, The Stack |
| **`multilingual`** | 250k | Global (EU langs, Chinese, Hindi) | OSCAR, Wikipedia |
| **`arts_commerce`** | 250k | Business (Legal, Finance, Lit) | PG19, Fin Phrasebank |

```python
vocab = CrayonVocab.load_profile("science")
vocab = CrayonVocab.load_profile("multilingual")
```

---

## ☁️ Verify on Google Colab


### ✅ Quick Verify Snippet

```python
from crayon import CrayonVocab

# Initialize with Auto-Backend (AVX2/CUDA/ROCm)
tokenizer = CrayonVocab(device="auto")

# 1. Test Standard subword-heavy profile
tokenizer.load_profile("standard")
print(tokenizer.tokenize("that is a test for the standard profile"))

# 2. Test Code specialized profile
tokenizer.load_profile("code")
print(tokenizer.tokenize("def fast_inverse_sqrt(x):"))
```
---

## 🧪 Testing & Verification

```bash
# Full verification (Benchmarks + Tests)
python verify_dat_engine.py

# Benchmark all backends
python benchmark_competitive.py
```

```
============================================================
XERV CRAYON V4.1.9 - HYPER-PRODUCTION DAT ENGINE VERIFICATION
============================================================
Vocabulary Size: 250,000 tokens
DAT Nodes: 370,000+
Throughput: 40,808,299 tokens/sec
STATUS: ✅ HYPER-PRODUCTION READY
```

---

## 📜 Citation

```bibtex
@techreport{xerv2026crayon,
  title={XERV Crayon: A First-Principles Analysis of Production-Grade Tokenization},
  author={Pal, Soham and Xerv Research},
  year={2026},
  institution={Xerv Research Engineering Division}
}
```

---

## 📄 License

Copyright (c) 2025-2026 Xerv Research. Released under the **MIT License**.

---

<p align="center">
  <strong>Built with 💙 by Xerv Research Engineering Division</strong>
</p>

<p align="center">
  <sub>⭐ Star this repo if Crayon helps your project!</sub>
</p>
