Metadata-Version: 2.4
Name: gearbx
Version: 1.0.1
Summary: Entropy-Routed Dynamic Quantization for LLM Inference
Author: Jpdz Labs
License: Gearbx Personal Use License
        Copyright (c) 2024-2026 Jpdz Labs. All rights reserved.
        
        1. Definitions
           "Software" means the Gearbx source code, binaries, models, documentation,
           and any associated materials in this repository.
           "Licensor" means Jpdz Labs.
           "You" means the individual end user accepting this license.
           "Personal Use" means use by a single natural person for private,
           non-commercial purposes, including local experimentation, learning,
           research, and personal projects that are not distributed for profit.
        
        2. Grant of License
           Subject to the terms below, Licensor grants You a worldwide, royalty-free,
           non-exclusive, non-transferable, revocable license to download, install,
           run, and modify the Software solely for Personal Use.
        
        3. Restrictions
           You may NOT, without prior written permission from Jpdz Labs:
           (a) use the Software, in whole or in part, for any commercial purpose,
               including but not limited to providing paid services, hosted offerings,
               SaaS, internal business operations, or revenue-generating workflows;
           (b) redistribute, sublicense, sell, rent, lease, or publish the Software
               or any derivative work, in source or binary form, to any third party;
           (c) remove, alter, or obscure any copyright, trademark, or attribution
               notices contained in the Software;
           (d) use the name "Gearbx" or "Jpdz Labs" to endorse or promote derivative
               works without prior written permission;
           (e) use the Software to train, fine-tune, or evaluate competing products
               offered commercially.
        
        4. Attribution
           Any public discussion, blog post, paper, or demonstration that includes
           results produced with the Software must credit "Gearbx by Jpdz Labs" and,
           where practical, link to the upstream repository.
        
        5. Commercial Licensing
           For any use outside the scope of Personal Use, including commercial,
           enterprise, or production deployments, You must obtain a separate
           commercial license from Jpdz Labs. Contact: licensing@jpdz.app
        
        6. Ownership
           The Software is licensed, not sold. Licensor retains all right, title,
           and interest in and to the Software, including all intellectual property
           rights. No rights are granted to You other than as expressly stated here.
        
        7. Termination
           This license terminates automatically if You breach any of its terms.
           Upon termination You must cease all use and destroy all copies of the
           Software in your possession.
        
        8. Warranty Disclaimer
           THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
           OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
           FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT.
        
        9. Limitation of Liability
           IN NO EVENT SHALL JPDZ LABS OR ITS CONTRIBUTORS BE LIABLE FOR ANY CLAIM,
           DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR
           OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR
           THE USE OR OTHER DEALINGS IN THE SOFTWARE.
        
        10. Governing Law
            This license shall be governed by and construed in accordance with the
            laws applicable to the principal place of business of Jpdz Labs, without
            regard to conflict-of-law principles.
        
        By downloading, installing, or using the Software, You agree to the terms
        of this Gearbx Personal Use License.
        
Project-URL: Homepage, https://gearbx.jpdz.app
Keywords: llm,quantization,inference,entropy,dynamic-precision
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: accelerate>=0.28.0
Provides-Extra: cuda
Requires-Dist: bitsandbytes>=0.43.0; extra == "cuda"
Provides-Extra: mlx
Requires-Dist: mlx>=0.12.0; extra == "mlx"
Requires-Dist: mlx-lm>=0.12.0; extra == "mlx"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: datasets>=2.18.0; extra == "dev"
Requires-Dist: matplotlib>=3.8.0; extra == "dev"
Requires-Dist: tqdm>=4.66.0; extra == "dev"
Dynamic: license-file

<p align="center">
  <img src="https://raw.githubusercontent.com/ratnam1510/gearbx/main/gearbx-icon.svg" alt="Gearbx" width="96" height="96">
</p>

<h1 align="center">Gearbx: Entropy-Routed Dynamic Quantization</h1>

An entropy-routed dynamic quantization engine for LLM inference that adjusts weight precision on a per-token basis during generation.

**Website:** [gearbx.jpdz.app](https://gearbx.jpdz.app)

## How It Works

Not all tokens require the same computational fidelity. Producing "Hello, how can I help you?" demands almost no semantic reasoning, while the next token in a partial differential equation imposes high cognitive load on the model.

Gearbx monitors the model's own output entropy at each generation step and routes subsequent forward passes through pre-cached layer weights at the appropriate precision tier:

| Gear | Precision | Memory per Param | When |
|------|-----------|-----------------|------|
| `low` | 4-bit packed | 0.5 bytes (25% of fp16) | Filler tokens, greetings, boilerplate |
| `mid` | 8-bit packed | 1 byte (50% of fp16) | Standard reasoning, moderate difficulty |
| `high` | fp16 | 2 bytes (original) | Complex reasoning, math, proofs |

Autoregressive LLM inference is memory-bandwidth bound: each generated token loads the full weight matrix. Packed storage means fewer bytes transferred per token, translating directly to speed gains proportional to compression ratio.

## Architecture

```
Prompt → [PromptClassifier] → initial_gear
                │
    ┌───────────▼──────────────────────────┐
    │       GENERATION LOOP                │
    │                                      │
    │  input_ids → model.forward() → logits│
    │                    │                 │
    │          [EntropyMonitor]            │
    │      Shannon entropy → gear decision │
    │           (rolling avg + hysteresis) │
    │                    │                 │
    │         [PrecisionManager]           │
    │     lazy quantize + module swap      │
    │     (originals offloaded to CPU)     │
    │                    │                 │
    │          sample next_token           │
    └──────────────────────────────────────┘
                │
    generated_ids → decode → output_text
```

### Core Components

- **EntropyMonitor** (`monitor.py`: Computes Shannon entropy (bits, log base 2) from logit distributions. Rolling-window average smooths gear transitions. Supports vocab-size auto-scaling (reference: 32768 tokens), hysteresis to prevent oscillation near boundaries, and auto-calibration from observed entropy distribution (p40/p60 thresholds). Deferred pipeline mode eliminates per-token GPU→CPU sync.

- **PrecisionManager** (`precision.py`: Lazy quantization with direct module-reference swaps. Discovers attention layers at init via `detect_attn_prefixes()` but defers quantization to first `shift()` call. On shift: quantizes from originals, swaps quantized module into model tree, offloads originals to CPU. Only one quantized variant exists at any time, memory goes down, not up.

- **QuantizedLinear Kernels** (`kernels.py`: Real packed integer storage: int8 (per-channel symmetric), int6, int4 (2 values per byte), int2 (4 values per byte, ternary {-1,0,1}). Dual-path forward: fused Triton kernels on CUDA (in-register dequant, no intermediate tensor), dequant-cache path on MPS/CPU (first forward unpacks to fp16, subsequent forwards reuse cached tensor).

- **Native Acceleration** (`native_kernels.py` + `csrc/`: Three tiers below Triton: MPS native Metal shaders via PyTorch's MetalShaderLibrary (zero-copy, threadgroup 256), legacy Metal command buffers, and CPU NEON vectorized matmul (ARM `-O3 -march=armv8.2-a+fp16`).

- **Fused Triton Kernels** (`triton_kernels.py`: CUDA-only fused quantized matmul. Multiplies activations directly with packed int8/int4/int2 weights without materializing fp16. In-register dequantization - only packed data traverses global memory.

- **GearbxModel** (`model.py`: Orchestrates everything. Loads model via transformers, auto-detects attention architecture, passes vocab_size to EntropyMonitor for threshold scaling. Manual autoregressive loop with KV caching and NaN guards for MPS stability.

- **StatisticalPromptClassifier** (`classifier.py`: Cold-start heuristic. Scores prompt complexity from math symbols, code keywords, length, avg word length → initial gear before the entropy window fills.

### Backends

| Backend | Module | Routing | Weight Access | Install |
|---------|--------|---------|---------------|---------|
| **Transformers** | `GearbxModel` | Per-token | Direct `nn.Linear` swap | `pip install -e .` |
| **MLX** | `MLXGearbxModel` | Per-token | `mlx.nn.QuantizedLinear` swap | `pip install -e ".[mlx]"` |
| **Ollama** | `OllamaGearbx` | Per-prompt | Black-box HTTP (any OpenAI-compat server) | `pip install -e .` |

**Transformers backend:** Full per-token gear shifting with real weight swaps. Supports MPS, CUDA, and CPU.

**MLX backend:** Native Apple Silicon via mlx-lm. Unified memory means no CPU offloading. Supports telemetry-only mode for pre-quantized models. Median-based calibration.

**Ollama backend:** Per-prompt routing via prompt classifier. Talks to Ollama, llama.cpp, vLLM, LM Studio, or any OpenAI-compatible local server over HTTP. Real per-token entropy telemetry via `top_logprobs`. Supports SSE streaming with `StreamChunk` per-token telemetry.

## Device Support

| Device | Loading Strategy | Base Precision | Acceleration | Notes |
|--------|-----------------|----------------|-------------|-------|
| **MPS** (Apple Silicon) | fp16 direct | float16 | MPS native Metal shaders | M1/M2/M3/M4, unified memory |
| **CUDA** (NVIDIA) | 4-bit NF4 via bitsandbytes | INT4 | Fused Triton kernels | Requires bitsandbytes |
| **MLX** (Apple Silicon) | mlx-lm native | fp16/bf16 | MLX graph compilation | Requires mlx, mlx-lm |
| **CPU** | fp32 direct | float32 | NEON vectorized (ARM) | Testing or lightweight models |

## Quick Start

### Transformers Backend

```python
from gearbx import GearbxModel

# Device auto-detected: MPS > CUDA > CPU
gbm = GearbxModel(
    'mistralai/Mistral-7B-Instruct-v0.3',
    num_attn_layers_to_manage=16,
    high_thresh=3.5,
    low_thresh=1.8,
)

r = gbm.generate('What is the capital of Japan?', max_new_tokens=30)
print(r.text)
print(r.gear_stats)  # e.g. {'low': 0.80, 'mid': 0.20}
```

### MLX Backend (Apple Silicon)

```python
from gearbx import MLXGearbxModel

gbm = MLXGearbxModel('mlx-community/Mistral-7B-Instruct-v0.3-4bit')
r = gbm.generate('Explain entropy in information theory.', max_new_tokens=200)
print(r.text)
print(r.gear_stats)
print(f'{r.tokens_per_sec:.1f} tok/s')
```

### Ollama Backend

```python
from gearbx import OllamaGearbx

gbm = OllamaGearbx(
    gear_models={
        'low':  'mistral:7b',
        'mid':  'mistral:7b',
        'high': 'mistral:7b',
    },
    low_thresh=0.4,
    high_thresh=1.2,
)

# Streaming with per-token telemetry
for chunk in gbm.generate_stream('Prove sqrt(2) is irrational.', max_new_tokens=200):
    print(chunk.token, end='', flush=True)
```

### GGUF via Ollama Cache

```bash
# Load GGUF from Ollama's local blob cache (no HF download)
python3 run_gguf.py llama3.2:1b
```

## Installation

### Apple Silicon (MPS)

```bash
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"

python3 -c "import torch; print('MPS:', torch.backends.mps.is_available())"
```

### Apple Silicon (MLX)

```bash
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[mlx,dev]"
```

### NVIDIA (CUDA)

```bash
python3 -m venv .venv && source .venv/bin/activate
pip install -e ".[cuda,dev]"

python3 -c "import torch; print('CUDA:', torch.cuda.is_available())"
```

### TUI (Terminal UI)

```bash
# From source
cd tui && make build

# Via npm (downloads prebuilt binary)
npm install -g gearbx
gearbx
```

### Requirements

- Python >= 3.10
- PyTorch >= 2.1.0
- transformers >= 4.40.0
- accelerate >= 0.28.0
- bitsandbytes >= 0.43.0 (CUDA only, optional)
- mlx >= 0.12.0, mlx-lm >= 0.12.0 (MLX backend, optional)

### Hardware Requirements

| Target Model | Memory Required | Apple Silicon | NVIDIA GPU |
|---|---|---|---|
| Phi-3 Mini 3.8B | 4-8 GB | Any M-series | RTX 3060 |
| Mistral 7B | 8-14 GB | M1 Pro+ / M2+ | RTX 3070 |
| LLaMA-3 8B | 10-16 GB | M1 Max+ / M2 Pro+ | RTX 3080 |
| LLaMA-3 8B (full dual) | 16-20 GB | M2 Max+ / M3 Pro+ | RTX 4070 |

## Running Tests

```bash
# All unit tests (no GPU needed)
pytest tests/ -v

# Individual test files
pytest tests/test_monitor.py -v
pytest tests/test_precision.py -v
pytest tests/test_integration.py -v
```

## Benchmarks

All benchmarks support `--unit` mode (synthetic data, CPU) and full mode (real model, MPS/CUDA).

```bash
# Unit mode, no GPU needed
python3 benchmarks/bench_entropy.py --unit
python3 benchmarks/bench_latency.py --unit
python3 benchmarks/bench_quality.py --unit

# Full mode, auto-detects MPS or CUDA
python3 benchmarks/bench_entropy.py --model mistralai/Mistral-7B-Instruct-v0.3
python3 benchmarks/bench_latency.py --model mistralai/Mistral-7B-Instruct-v0.3
python3 benchmarks/bench_quality.py --model mistralai/Mistral-7B-Instruct-v0.3 --n 200

# Production benchmark
python3 benchmarks/bench_production.py
```

### Benchmark Targets

| Benchmark | Metric | Target |
|---|---|---|
| Entropy Signal | Separation ratio (hard/trivial) | > 2.0x |
| Gear Shift Latency | Time per shift() call | < 0.5 ms |
| Monitor Throughput | Time per update() call | < 1.0 ms (GPU/MPS) |
| Hook Overhead | Added latency per forward pass | < 0.5 ms (GPU/MPS) |
| GSM8K Accuracy | vs. static 8-bit baseline | within 2pp |

## Threshold Tuning

Thresholds auto-scale by `log2(vocab_size) / log2(32768)` when vocab_size is provided (GearbxModel does this automatically).

**Auto-calibration**: `calibrate_from_logits()` computes entropy from a prefill pass and sets thresholds at p40/p60 of the distribution. Post-calibration caps prevent inflated thresholds: `high ≤ log2(vocab) * 0.18`, `low ≤ log2(vocab) * 0.06`.

### Suggested Thresholds by Model

| Model | Vocab Size | low_thresh | high_thresh |
|---|---|---|---|
| LLaMA-3 8B / 70B | 128,256 | 2.2 | 4.5 |
| Mistral 7B v0.3 | 32,768 | 1.8 | 3.5 |
| Qwen2 7B | 151,936 | 2.5 | 5.0 |
| Phi-3 Mini | 32,064 | 1.7 | 3.4 |
| Gemma 2 9B | 256,000 | 2.8 | 5.5 |

For Ollama backend, top-k truncated entropy has a narrower range, use lower thresholds (e.g., `low=0.4, high=1.2`).

## Known Limitations

1. **Single-Stream Only:** Designed for single-sequence local inference, not batched serving. `generate_batch()` runs prompts sequentially with per-sequence gear routing.

2. **KV Cache Precision Mismatch:** Mid-sequence gear shifts create mixed-precision KV cache entries. Use `min_gear_duration=4` to reduce noise, or `use_cache=False` during strict validation.

3. **MPS fp16 Base:** On Apple Silicon (transformers backend), the base model runs at fp16 since bitsandbytes doesn't support MPS. Memory usage is higher than CUDA 4-bit but entropy routing and quality benefits still apply.

4. **Double Quantization:** Loading GGUF→fp16→int4 compounds quantization error. PrecisionManager warns when detected. Consider higher bit-width for already-quantized source models.

5. **Ollama Coarse Routing:** Ollama backend routes per-prompt (not per-token) since the server is a black box. Entropy telemetry is still per-token.

6. **Catastrophic Entropy:** If entropy exceeds 90% of theoretical max for 2+ consecutive tokens, generation auto-falls-back to mid gear. Catches model failures from bad quantization.

## License

Proprietary - Jpdz Labs. All rights reserved.
