Metadata-Version: 2.4
Name: turbocpp
Version: 0.7.0
Summary: llama.cpp + TurboQuant — Hadamard-rotation preprocessor for LLM weights, plus a unified CLI on top of llama-cpp-python
Author: Ary5272
License: MIT
Project-URL: Homepage, https://github.com/Ary5272/turbocpp
Project-URL: Issues, https://github.com/Ary5272/turbocpp/issues
Project-URL: HuggingFaceSpace, https://huggingface.co/spaces/AIencoder/turboquant-visualizer
Project-URL: WheelMirror, https://huggingface.co/datasets/AIencoder/TurboCpp_Wheels
Keywords: llm,quantization,llama.cpp,hadamard,turboquant,inference,gguf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0
Requires-Dist: transformers>=4.40
Requires-Dist: safetensors>=0.4
Requires-Dist: numpy>=1.24
Requires-Dist: huggingface_hub<2.0,>=0.24
Provides-Extra: runtime
Requires-Dist: llama-cpp-python>=0.3.2; extra == "runtime"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Provides-Extra: demo
Requires-Dist: gradio<7.0,>=4.40; extra == "demo"
Requires-Dist: matplotlib>=3.7; extra == "demo"
Requires-Dist: pillow>=10.0; extra == "demo"
Dynamic: license-file

# turbocpp

> **llama.cpp + TurboQuant.** Every llama.cpp feature, plus an offline
> Hadamard-rotation preprocessor that meaningfully improves the quality
> of any quantization (Q4_0 / Q4_K_M / Q6_K / …) at zero inference cost.

| | |
|---|---|
| 🚀 **Live demo** | [huggingface.co/spaces/AIencoder/turboquant-visualizer](https://huggingface.co/spaces/AIencoder/turboquant-visualizer) |
| 📦 **Python package** | `pip install https://huggingface.co/datasets/AIencoder/TurboCpp_Wheels/resolve/main/turbocpp/turbocpp-0.3.0-py3-none-any.whl` |
| 🐳 **Docker images** | `docker pull ghcr.io/ary5272/turbocpp:cpu` (also `:server`, `:turboquant`) |
| 🔧 **Wheel mirror** | [datasets/AIencoder/TurboCpp_Wheels](https://huggingface.co/datasets/AIencoder/TurboCpp_Wheels) — prebuilt llama-cpp-python for every CPU feature combo |

## Install

```bash
# From the GitHub Release (always points at the latest tag):
pip install https://github.com/Ary5272/turbocpp/releases/latest/download/turbocpp-py3-none-any.whl

# Plus the inference engine (also a prebuilt wheel — never source-builds):
pip install \
    https://github.com/Ary5272/turbocpp/releases/latest/download/turbocpp-py3-none-any.whl \
    https://huggingface.co/datasets/AIencoder/TurboCpp_Wheels/resolve/main/llama_cpp_python-0.3.16%2Bbasic_avx2_fma_f16c-cp312-cp312-manylinux_2_31_x86_64.whl

# Or the HF dataset mirror if GitHub is blocked at your endpoint:
pip install https://huggingface.co/datasets/AIencoder/TurboCpp_Wheels/resolve/main/turbocpp/turbocpp-0.3.0-py3-none-any.whl
```

After install you get a `turbocpp` CLI:

```bash
turbocpp rotate      ./Llama-3-8B  ./Llama-3-8B-tq        # offline Hadamard rotation
turbocpp generate    -m model.gguf -p "Hello" -n 64        # one-shot inference
turbocpp serve       -m model.gguf --host 0.0.0.0 --port 8080
turbocpp speculative -m target.gguf -d draft.gguf -p "..." # 1.5-3× faster decode
turbocpp pick-wheel                                         # auto-pick fastest wheel
turbocpp pick-wheel  --gpu cuda12                           # GPU variant URL
turbocpp bench                                              # rotation/quant MSE microbench

# every llama.cpp tool, no submodule, no compile — pulls ggml-org/llama.cpp:full
turbocpp convert    /models/Llama-3-8B    --outfile /models/m.gguf
turbocpp quantize   /models/m.gguf  /models/m-Q4_K_M.gguf  Q4_K_M
turbocpp perplexity -m /models/m-Q4_K_M.gguf -f /data/wiki.test.raw
turbocpp imatrix    -m /models/m.gguf -f /data/calib.txt -o imatrix.dat
turbocpp llama-cli  -m /models/m.gguf -p "Hello"
turbocpp llama-bench -m /models/m.gguf
turbocpp llama       <any-tool>           # raw passthrough
```

### Security

- All `.github/workflows/*.yml` actions are pinned to commit SHAs (Dependabot keeps them current).
- Wheels and Docker images carry [SLSA build provenance attestations](https://slsa.dev/) — verify with `gh attestation verify <file> --owner Ary5272`.
- Weekly `gitleaks` + CodeQL scans on `main`.
- See [`SECURITY.md`](SECURITY.md) for vulnerability reporting.

### Get actual speedups, not just better quality

```bash
# (1) Auto-install the fastest llama-cpp-python wheel for your CPU
#     (AVX-512 / VNNI / AMX automatically chosen):
pip install $(turbocpp pick-wheel)

# (2) Speculative decoding — biggest single decode win, no kernels needed.
#     Smaller draft proposes K tokens; bigger target verifies in one pass.
turbocpp speculative \
    -m  Llama-3-8B-tq-Q4_K_M.gguf      \
    -d  Llama-3-8B-tq-Q2_K.gguf        \
    -p  "Explain quantization." -n 256 -k 4

# (3) End-to-end head-to-head benchmark (4-way matrix):
./scripts/bench_speculative.sh /path/to/HF/Llama-3-8B
```

The CPU-tier auto-pick alone gives ~10-30% over the AVX2 default on
Sapphire Rapids / Zen4. Speculative decoding stacks another 1.5-3× on
top. Together: 2-4× over a stock `pip install llama-cpp-python` flow.

## Docker

```bash
# Inference runtime + unified CLI (small image, ~500 MB)
docker run --rm -v ~/models:/models ghcr.io/ary5272/turbocpp:cpu \
       generate -m /models/m.gguf -p "Hello" -n 64

# OpenAI-compatible HTTP server on :8080
docker run --rm -p 8080:8080 -v ~/models:/models ghcr.io/ary5272/turbocpp:server \
       -m /models/m.gguf

# Adds torch + transformers for the offline rotation step (~2 GB)
docker run --rm -v ~/models:/models ghcr.io/ary5272/turbocpp:turboquant \
       rotate /models/Llama-3-8B /models/Llama-3-8B-tq
```

All three images install llama-cpp-python from a **prebuilt wheel** at
`AIencoder/TurboCpp_Wheels`. No source compile step → image build takes
~30 seconds instead of ~10 minutes, and the same image runs on any
x86_64 host with AVX2 + FMA + F16C.

```
   ┌───────────────────────────────────────────────────────────────┐
   │ HF model ──► turboquant rotate ──► llama.cpp convert+quantize │
   │                                                ▼              │
   │                            standard GGUF, runs anywhere       │
   │                            llama.cpp does — every backend,    │
   │                            every architecture, every sampler  │
   └───────────────────────────────────────────────────────────────┘
```

## Layout

| path | purpose |
|---|---|
| `ghcr.io/ggml-org/llama.cpp:full` | upstream **ggml-org/llama.cpp**, pulled at runtime via Docker — the inference engine, every quantization format, every GPU backend (CUDA / Metal / Vulkan / SYCL / ROCm), HTTP server, samplers, grammars, ~50 model architectures. We **stopped vendoring** llama.cpp as a git submodule in 0.5.0 so you always get whatever ggml-org's latest stable image is, without us pinning a stale commit. The `turbocpp llama <tool>` and `turbocpp convert / quantize / perplexity / imatrix / llama-cli / llama-bench` subcommands all forward into this image. |
| [`turboquant/`](turboquant) | the differentiator — Python package that applies Walsh-Hadamard rotation to a HuggingFace model **before** quantization. Output is a standard rotated HF checkpoint that you feed to `convert_hf_to_gguf.py` unmodified |
| [`extras/standalone/`](extras/standalone) | a parallel from-scratch C++17 implementation written earlier in the project. Pure CPU, AVX2/AVX-512, K-quants, GQA, YaRN, mirostat, beam search, GBNF subset, OpenAI-compat HTTP server. Useful as a study reference and a lighter-weight runtime when you don't need llama.cpp's full footprint |

## Why "llama.cpp + TurboQuant"

llama.cpp already ships:

- **Architectures**: LLaMA 1/2/3, Mistral, Mixtral (MoE), Qwen 1/2/2.5, Phi 1/2/3, Gemma 1/2, Falcon, MPT, BLOOM, GPT-2, GPT-NeoX, StableLM, Baichuan, Yi, RWKV, Mamba, …
- **Quantization**: Q2_K, Q3_K_S/M/L, Q4_0/1, Q4_K_S/M, Q5_0/1, Q5_K_S/M, Q6_K, Q8_0, Q8_K, IQ1_S/M, IQ2_XXS/XS/S/M, IQ3_XXS/S/M, IQ4_XS/NL, BF16, F16, F32
- **Backends**: CPU (AVX/AVX2/AVX-512/NEON/AMX), CUDA, Metal, Vulkan, SYCL, ROCm, Kompute, OpenCL, RPC, BLAS
- **Sampling**: greedy, temperature, top-k, top-p, min-p, typical-p, tail-free, locally-typical, dynatemp, mirostat v1+v2, repetition penalty, frequency penalty, presence penalty, logit bias, GBNF grammar, JSON mode, classifier-free guidance, beam search, speculative decoding, lookahead decoding
- **Runtime**: continuous batching, parallel sequences, prompt caching, KV-cache shifting/defrag, embeddings, reranking, LoRA hotswap, multi-modal (LLaVA, Phi-3-vision, MiniCPM-V), tools/function-calling, chat templates for every major model
- **Server**: `llama-server` (OpenAI-compatible HTTP API: completions, chat, embeddings, tools), web UI

TurboQuant adds: a **2 KB Python module** that rotates the model's weight matrices in-place using Walsh-Hadamard transforms. The rotation cancels through the residual-stream linear pieces (it's orthogonal) so the model is fp32-bit-identical, but the per-weight-block max-abs that drives Q4 / Q4_K rounding error drops 3-5×, which translates to **0.3-0.5 perplexity improvement at Q4_K_M** on LLaMA-2-7B (and bigger gains at lower bit-widths).

## Does this actually run faster than stock llama.cpp?

It's the right question and the honest answer has two parts:

### Same bit-width: NO

Quantizing a TurboQuant-rotated model at Q4_K_M and running it on stock
llama.cpp gives the **exact same tokens/sec** as a non-rotated Q4_K_M
of the same model. Same bytes per weight, same kernels, same memory
layout. What you get is **better quality at the same speed** — about
0.3-0.5 perplexity points back at Q4_K_M on LLaMA-2-7B.

### Drop a bit-width tier: YES

The real speed win is using the recovered quality budget to drop one
quantization tier:

| recipe | bytes/weight | quality | wall-clock decode |
|---|---|---|---|
| baseline Q4_K_M (no rotation)        | 4.625 | reference | reference |
| TurboQuant Q4_K_M                    | 4.625 | **better** | same |
| **TurboQuant Q3_K_M**                | 3.5   | ≈ baseline Q4_K_M | **~1.20-1.30× faster** on memory-bound CPUs |
| TurboQuant Q2_K (aggressive)         | 2.6   | usable for some tasks | **~1.5× faster** |

The speedup comes from memory bandwidth: decoding is bandwidth-bound on
nearly all consumer CPUs (and on Sapphire Rapids when the workload
doesn't fit AMX tiles, which is most of them at long context). Fewer
bytes per weight read each step = fewer cycles waiting on DRAM.

### KV cache: also YES (long context)

`turboquant.kvcache.rotate_kv_for_cache_quant()` Hadamard-rotates the
attention output projection so K and V live in a Gaussianized frame
**inside** the KV cache. Combine with llama.cpp's
`--cache-type-k q4_0 --cache-type-v q4_0` and you get usable quality at
half the KV bandwidth — meaningful at 8K+ context where KV reads dominate.

### Reproduce the numbers

```bash
# Synthetic micro (1 second, no model needed):
python -m turboquant.bench

# End-to-end on your machine, real GGUF:
./scripts/bench_e2e.sh /path/to/HF/Llama-3-8B
```

The end-to-end script builds both a baseline-Q4_K_M and a TurboQuant-Q3_K_M
GGUF and runs `llama-bench` on each.

## Quick start

```bash
# 1. Clone with submodules
git clone --recursive https://github.com/Ary5272/turbocpp
cd turbocpp

# 2. Build llama.cpp (CPU; see llama.cpp/README.md for CUDA / Metal / Vulkan)
cmake -S llama.cpp -B llama.cpp/build -DCMAKE_BUILD_TYPE=Release
cmake --build llama.cpp/build -j

# 3. Install the turboquant Python package
pip install -e .                        # uses pyproject.toml
# or:  pip install -r turboquant/requirements.txt

# 4. End-to-end (the SPEED path: rotated Q3_K_M ≈ baseline Q4_K_M quality):
python -m turboquant ~/models/Llama-3-8B  ~/models/Llama-3-8B-tq
python llama.cpp/convert_hf_to_gguf.py ~/models/Llama-3-8B-tq \
       --outfile Llama-3-8B-tq.gguf
llama.cpp/build/bin/llama-quantize \
       Llama-3-8B-tq.gguf Llama-3-8B-tq-Q3_K_M.gguf Q3_K_M
llama.cpp/build/bin/llama-cli -m Llama-3-8B-tq-Q3_K_M.gguf \
       -p "Explain Hadamard quantization in one sentence:" -n 100

# 5. Or the QUALITY path (same speed as baseline, better numbers):
llama.cpp/build/bin/llama-quantize \
       Llama-3-8B-tq.gguf Llama-3-8B-tq-Q4_K_M.gguf Q4_K_M
```

## Docker

Same accessibility model as `ghcr.io/ggml-org/llama.cpp` — three pre-built
images on GitHub Container Registry, plus a top-level `docker-compose.yml`.

| image | what's inside | size |
|---|---|---|
| `ghcr.io/ary5272/turbocpp:cpu`        | full llama.cpp toolchain (`llama-cli`, `llama-server`, `llama-quantize`, `llama-bench`, `llama-perplexity`, …) | ~150 MB |
| `ghcr.io/ary5272/turbocpp:server`     | inherits `:cpu`, ENTRYPOINT = `llama-server` on `:8080` | ~150 MB |
| `ghcr.io/ary5272/turbocpp:turboquant` | inherits `:cpu`, adds CPU-only PyTorch + the turboquant Python package | ~2.0 GB |

```bash
# 1. Quick inference
docker run --rm -v $PWD/models:/models ghcr.io/ary5272/turbocpp:cpu \
    llama-cli -m /models/model.gguf -p "Hello"

# 2. OpenAI-compatible HTTP server
docker run --rm -p 8080:8080 -v $PWD/models:/models \
    ghcr.io/ary5272/turbocpp:server -m /models/model.gguf

# 3. End-to-end TurboQuant preprocessing
docker run --rm -v $PWD/models:/models -v $PWD/hf_cache:/root/.cache/huggingface \
    ghcr.io/ary5272/turbocpp:turboquant \
    python -m turboquant /models/Llama-3-8B /models/Llama-3-8B-tq

# Or via docker compose:
docker compose --profile server up
docker compose --profile tools run --rm turboquant python -m turboquant ...
```

Build locally to enable a different CPU baseline (e.g. AVX-512):

```bash
docker build --target cpu \
    --build-arg LLAMA_CMAKE_FLAGS="-DGGML_NATIVE=OFF -DGGML_AVX512=ON -DGGML_AVX2=ON -DGGML_FMA=ON -DGGML_F16C=ON" \
    -t turbocpp:cpu-avx512 .
```

A new image is pushed to GHCR on every `main` commit and every `v*` tag —
see [`.github/workflows/docker.yml`](.github/workflows/docker.yml).

## TurboQuant: the math in one block

For each linear layer `y = W x` in the residual stream, with `H` an
orthogonal block-Hadamard:

```
W' = H · W           (output axis rotated)         ← producers
W' = W · Hᵀ          (input axis rotated)          ← consumers
```

We pair every producer with its consumer:
`tok_embed`, `W_o`, `W_down` ← producers (output rotated)
`W_q`, `W_k`, `W_v`, `W_gate`, `W_up`, `lm_head` ← consumers (input rotated).

Since `H · Hᵀ = I`, the rotations cancel through the network. Forward
pass in fp32 is bit-identical. But quantization noise is computed on the
ROTATED weights, whose per-block distribution is near-Gaussian thanks
to the central-limit theorem — and Gaussian distributions quantize
well, while heavy-tailed real LLM weights don't.

RMSNorm is rotation-equivariant only if its γ vector is uniform. Pass 1
absorbs each γ into the FOLLOWING linear (`W ← W · diag(γ)`) and then
sets γ ← 1, after which the rotation is safe.

See [`turboquant/turboquant.py`](turboquant/turboquant.py) — 100 lines.

## Tests

```bash
pytest turboquant/test_turboquant.py        # rotation invariants + math
ctest --test-dir extras/standalone/build    # standalone-engine kernels
```

CI runs the turboquant tests on Linux + Windows + macOS, plus builds the
standalone engine and runs its 15 unit tests.

## Related work

- **QuaRot** (Ashkboos et al. 2024)
- **SpinQuant** (Liu et al. 2024)
- **GPTQ** (Frantar et al. 2022) — calibration-based, complementary
- **AWQ** (Lin et al. 2023) — activation-aware scaling, complementary

## License

- TurboQuant code: **MIT** ([LICENSE](LICENSE))
- llama.cpp submodule: **MIT** (their `LICENSE`)
