Metadata-Version: 2.4
Name: ultracompress
Version: 0.4.0
Summary: Extreme compression for large language models. Download pre-compressed models from Hugging Face Hub; self-compress support coming soon.
Project-URL: Homepage, https://sipsalabs.com
Project-URL: Documentation, https://github.com/sipsalabs/ultracompress#readme
Project-URL: Repository, https://github.com/sipsalabs/ultracompress
Project-URL: Issues, https://github.com/sipsalabs/ultracompress/issues
Author-email: Sipsa Labs <founder@sipsalabs.com>
License: Apache-2.0
License-File: LICENSE
Keywords: compression,edge-ai,inference,llm,quantization,transformer
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: click>=8.1.0
Requires-Dist: huggingface-hub>=0.24.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: safetensors>=0.4.0
Requires-Dist: tqdm>=4.66.0
Provides-Extra: bench
Requires-Dist: lm-eval>=0.4.5; extra == 'bench'
Requires-Dist: torch>=2.0.0; extra == 'bench'
Provides-Extra: dev
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: torch
Requires-Dist: torch>=2.0.0; extra == 'torch'
Description-Content-Type: text/markdown

# UltraCompress

> Extreme compression for large language models. Patent pending — USPTO 64/049,511 + 64/049,517

[![PyPI](https://img.shields.io/pypi/v/ultracompress.svg)](https://pypi.org/project/ultracompress/)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE)

**Run language models on less hardware than they were supposed to need.**

UltraCompress is the patent-pending compression infrastructure for transformer language models. The Track A method targets sub-3 bits per weight — **~30% smaller than bitsandbytes NF4** with **zero catastrophic failures** on a 6-model head-to-head cohort in our internal benchmark. The CLI is shipped on PyPI today; pre-compressed reference models roll out on Hugging Face Hub through April–May 2026.

### Who this is for

- Engineers running into the **4-bits-per-weight cliff** that public methods (bitsandbytes, GPTQ, AWQ, HQQ) fall off below 4 bpw
- Product teams targeting **on-device deployment** — phones, cars, robots, embedded systems
- Inference platforms whose margins are **GPU-memory-bound** at scale
- Hardware partners (chip vendors, OEMs) evaluating compression infrastructure for licensing

> **v0.1 alpha**: pre-compressed reference models are uploading to Hugging Face Hub throughout April–May 2026. Run `uc list` for the live catalog. Examples below show expected post-launch usage.

---

## Latest — Streaming compression: full Qwen scaling curve, 72B on a single GPU (2026-05-04)

Per-layer streaming compression validated end-to-end across 8B → 72B with peak VRAM bounded by ~one transformer layer regardless of total model depth. Production-grade quality (PPL ratio ≤ 1.05) at every scale; **Qwen2.5-72B compressed to 8.98 GB peak VRAM on a single RTX 5090** with 1.6% PPL drift.

| Model | Layers | Baseline PPL | Compressed PPL | **PPL ratio** | **Peak VRAM** | Status |
|---|---:|---:|---:|---:|---:|:---|
| Qwen3-8B | 36 | 16.79 | 17.26 | **1.0278×** | **2.26 GB** | PROD |
| Qwen3-14B | 40 | 15.44 | 15.61 | **1.0111×** | **3.37 GB** | PROD (best) |
| Qwen3-32B | 64 | 13.77 | 14.27 | **1.0367×** | **4.85 GB** | PROD |
| **Qwen2.5-72B** | **80** | **8.92** | **9.07** | **1.0162×** | **8.98 GB** | **PROD (headline)** |

Recipe: GSQ scalar 5 bpw + per-block (B=64) absmax + V18-C rank-32 low-rank correction overlay + 200-step KL distillation per layer. Process: load layer fp16 weights via `safetensors` lazy load → cache teacher hidden output → quantize → fit V18-C against cache → save → free → next layer. Compression time scales linearly: ~1 min/layer overhead.

Bigger models compress at least as well as smaller ones, empirically consistent within the Qwen family. Peak VRAM stays bounded near a single transformer layer regardless of total model depth — the 100T-on-1-GPU trajectory is now a math problem, not a prayer.

Reproduce on Qwen3-8B (~9 min on a 5090):

```bash
python scripts/overlay/streaming_compression_runner.py \
    --model qwen3-8b --bpw 5 --block_size 64 --rank 32 \
    --train_steps 200 --n_calib 100 --n_eval 50
```

Result JSONs live under `scripts/overlay/artifacts/streaming_compression_{8b,14b,32b,72b}_smoke.json`. A patent supplement covering the streaming-compression mechanism was filed in May 2026.

---

## Install

```bash
pip install ultracompress
```

## Quickstart

```bash
# Today: scripted demo (no Hub artifacts required)
uc demo

# Today: query the live HF Hub catalog (returns "No pre-compressed models
# published yet" until the first rolling-release artifact lands)
uc list

# Post-artifact example usage (works once an artifact is on the Hub):
uc pull sipsalabs/<model-id>
uc info ./models/<model-id>
uc bench ./models/<model-id> --tasks hellaswag --limit 500
```

## What's available today (v0.1 — alpha)

The CLI itself is shipped on PyPI. The Hugging Face Hub catalog is rolling out through April–May 2026; until the first reference compressed model lands, `uc list` against the live Hub returns "No pre-compressed models published yet."

- `uc demo` — scripted CLI demo for screen recording (works without any Hub artifacts).
- `uc list` — query the live `sipsalabs` collection on the Hugging Face Hub. Returns the actual current catalog; expect "no models published yet" until the first rolling-release artifact lands.
- `uc pull <model-id>` — download a pre-compressed model when one is available on the Hub.
- `uc info <path>` — inspect the compression metadata of an already-downloaded artifact.
- `uc bench <path> --tasks <list>` — run downstream benchmarks via `lm-eval-harness` on a downloaded artifact.

## What's coming (v0.2 — Q3 2026)

- `uc compress <hf-model-id> --bpw 2.8` — self-compression (gated on patent prosecution timeline).
- `uc serve <path>` — inference server with OpenAI-compatible API.
- `uc export --format gguf` — export to llama.cpp GGUF format.
- `uc export --format coreml` — export to Apple CoreML for on-device inference.

## Why UltraCompress

### The 4-bit-per-weight cliff

Every public LLM compression method (bitsandbytes, GPTQ, AWQ, HQQ) is stable at and above 4 bits per weight. Below 4 bpw, model quality falls off a cliff — most methods produce models whose downstream-task accuracy collapses to near-random. We measure this with a `T_cat` threshold; on a 6-model cohort, public sub-3-bpw methods produce **catastrophic failures on the majority of the cohort.**

UltraCompress doesn't.

### Track A — post-training row-overlay quantization (USPTO 64/049,511) — shipping now

On a 6-model × 8-method × 500-sample head-to-head benchmark:

| Method | Bits per weight | Cohort median T1 retention | Catastrophic failures |
|---|---:|---:|---:|
| bitsandbytes int8 | 8.000 | 99.75% | 0/6 |
| bitsandbytes nf4 | 4.000 | 98.31% | 0/6 |
| HQQ 4-bit g64 | 4.500 | 97.72% | 0/6 |
| **UltraCompress 2.8 bpw** | **2.798** | **95.63%** | **0/6** |
| HQQ 3-bit g64 | 3.500 | 72.46% | 1/6 |
| HQQ 2-bit g64 | 2.500 | 3.46% | 6/6 |

Top-k retention curves (top-1, top-10, top-32, top-64, top-128, top-256) will ship in the per-model card on each artifact's Hugging Face Hub repository as the reference compressed models roll out through April–May 2026. T1 alone is the wrong metric for autocomplete, candidate generation, or RAG re-ranking — most customer use cases care about top-k structure.

### Track B — Fractal Residual Recursion (USPTO 64/049,517) — v0.2 (Q3 2026)

Architectural compression beyond published academic ratios for transformer language models. Combined with Track A on the v0.2 stack: the strongest end-to-end ratio we've measured for transformer language model architectures in our cohort. Gated on patent prosecution timing.

Track B evidence is separate from Track A shipping artifacts; see [docs/evidence/matrix.md](docs/evidence/matrix.md) for Track B detail. Do not combine retention numbers across tracks as a single quality curve.

## Patent status

The UltraCompress compression methods are the subject of pending U.S. patent applications. Pre-compressed models are distributed under a separate licensing arrangement described in [LICENSE](LICENSE). The CLI code in this repository is Apache-2.0.

## Reporting issues, security, and commercial inquiries

- Bugs and feature requests: open an issue.
- Security vulnerabilities: see [SECURITY.md](SECURITY.md) — report privately to `security@sipsalabs.com`.
- Commercial / design-partner / pilot inquiries: `founder@sipsalabs.com`.
- Patent / licensing: `legal@sipsalabs.com`.

Contributing: see [CONTRIBUTING.md](CONTRIBUTING.md). Changes that touch packaging, CI, docs, and the public CLI surface are very welcome. Pull requests adding the proprietary compression methods will be closed.

## Citation

```bibtex
@misc{sipsalabs2026ultracompress,
  title        = {UltraCompress: Extreme Compression for Large Language Models},
  author       = {{Sipsa Labs, Inc.}},
  year         = {2026},
  note         = {U.S.\ patent applications 64/049,511 and 64/049,517, patent pending},
  howpublished = {\url{https://sipsalabs.com}}
}
```

## About

UltraCompress is built by [Sipsa Labs](https://sipsalabs.com) — a research lab spanning Systems · Intelligence · Precision.

Patent pending — USPTO 64/049,511 + 64/049,517.
