Metadata-Version: 2.4
Name: bigsmall
Version: 4.0.0
Summary: Lossless AI model compression — ~34% smaller with bit-identical weights; the autopilot profiles your machine, picks the highest fidelity that runs, and streams models bigger than your RAM
Home-page: https://github.com/wpferrell/Bigsmall
Author: Will Ferrell
Author-email: wpferrell@gmail.com
License: Elastic-2.0
Project-URL: Homepage, https://github.com/wpferrell/Bigsmall
Project-URL: Documentation, https://github.com/wpferrell/Bigsmall/tree/main/docs
Project-URL: Changelog, https://github.com/wpferrell/Bigsmall/blob/main/CHANGELOG.md
Project-URL: Models (HuggingFace), https://huggingface.co/wpferrell
Project-URL: Paper, https://doi.org/10.5281/zenodo.20279247
Project-URL: Bug Tracker, https://github.com/wpferrell/Bigsmall/issues
Keywords: neural network,compression,lossless,machine learning,model compression,pytorch,huggingface,transformers,bfloat16,bf16,delta compression,fine-tuning,inference,vram,llm,ai,weights,safetensors,arithmetic coding,entropy coding,autopilot,model streaming,fp8,int4,dual fidelity,ferrell duo,kv cache
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Archiving :: Compression
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: numba>=0.61
Requires-Dist: constriction>=0.4
Requires-Dist: zstandard>=0.21
Requires-Dist: blosc2>=2.0
Requires-Dist: safetensors>=0.4
Requires-Dist: huggingface-hub>=0.20
Requires-Dist: tqdm>=4.0
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Provides-Extra: hf
Requires-Dist: transformers>=4.30; extra == "hf"
Requires-Dist: huggingface-hub>=0.20; extra == "hf"
Provides-Extra: diffusion
Requires-Dist: diffusers>=0.20; extra == "diffusion"
Provides-Extra: vllm
Requires-Dist: vllm>=0.4; extra == "vllm"
Provides-Extra: ecc
Requires-Dist: reedsolo>=1.7; extra == "ecc"
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: transformers>=4.30; extra == "all"
Requires-Dist: diffusers>=0.20; extra == "all"
Requires-Dist: huggingface-hub>=0.20; extra == "all"
Requires-Dist: reedsolo>=1.7; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

[![PyPI version](https://img.shields.io/pypi/v/bigsmall.svg)](https://pypi.org/project/bigsmall/)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.20279247.svg)](https://doi.org/10.5281/zenodo.20279247)
[![License](https://img.shields.io/badge/license-Elastic--2.0-blue.svg)](https://github.com/wpferrell/Bigsmall/blob/main/LICENSE)
[![Python](https://img.shields.io/pypi/pyversions/bigsmall.svg)](https://pypi.org/project/bigsmall/)
[![Downloads](https://static.pepy.tech/badge/bigsmall)](https://pepy.tech/project/bigsmall)

# BigSmall — Lossless AI Model Compression

**Make any AI model ~34% smaller with bit-identical weights — and let it decide how to run on *your* machine. Your machine runs bigger models than you think, and BigSmall never lies to you about it.**

```bash
pip install bigsmall          # CLI + compression/decompression
pip install bigsmall[torch]   # add this for model loading (from_pretrained)
```

A 14 GB Mistral-7B becomes 9.3 GB. A fine-tuned model becomes a 5 GB patch on top of its 14 GB base. A 14.2 GB model runs in ~2.5 GB of working memory through the streaming executor (it reads weights from disk layer by layer instead of holding the whole model). The decompressed model is **every weight bit-for-bit identical** to the original — each tensor's md5 (a checksum) is verified on decompress.

---

## 60-second quickstart

Three commands. No settings to learn.

```bash
bigsmall profile          # once: ~10s hardware probe (saved, never asked again)
bigsmall plan model.bs    # one sentence: what would run, where, how faithfully
bigsmall run model.bs     # do it
```

Real output from the reference machine (8-core CPU, RTX A4500 busy with another job):

```
$ bigsmall plan qwen2.5-0.5b.bsd
Running perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.

$ bigsmall run qwen2.5-0.5b.bsd
Running perfect mode (bit-exact, receipt verified) in CPU RAM at full CPU speed while the GPU is busy — fast mode (lossy INT4) available with --mode fast.
loaded 290 tensors into host RAM in 27.0s [mode=perfect] (bit-exact receipt honoured)
```

The planner picks **the highest fidelity that runs at usable speed** on your hardware, and one rule is enforced by the test suite itself: **anything below bit-exact is announced before it happens, never silently.** Full walkthrough: [docs/quickstart.md](docs/quickstart.md).

---

## The improvement, in two pictures

![Bar chart: loading the bit-exact original took 150.9 s (Qwen2.5-0.5B) and 411.2 s (Llama-3.2-1B) in the first 4.0 build; 4.0.0 ships 25.3 s and 56.2 s — 6.0x and 7.3x faster](https://raw.githubusercontent.com/wpferrell/Bigsmall/main/docs/assets/c1_perfect_loads.png)

![Bar chart: one download, three sizes — the original is 100%, the lossless .bs is about 66%, and the Ferrell Duo fast member reads about 21.5%; the Duo file holds both for about 3% more than lossless alone](https://raw.githubusercontent.com/wpferrell/Bigsmall/main/docs/assets/c2_three_sizes.png)

---

## What's in 4.0

| Feature | What you get | Command |
|---|---|---|
| **Autopilot** | No decisions: profile once, then one-sentence picks with receipts | `bigsmall profile` / `plan` / `run` — [docs](docs/features/autopilot.md) |
| **Ferrell Duo (`.bsd`)** | One file. Two models: the fast one (INT4, reads 21.4% of raw) and the real one (bit-exact, proven) — for ~2.9 pp over lossless-only | `bigsmall dual` — [docs](docs/features/dual-fidelity.md) |
| **Streaming executor** | Run models bigger than your RAM: bounded resident set, promise checked against measurement every run | `bigsmall run --stream`, `serve-stream` — [docs](docs/features/streaming.md) |
| **FP8-native lossless** | Models released in fp8 compress to 0.829 of their fp8 bytes, bit-exact | `bigsmall compress` (automatic) — [docs](docs/features/fp8.md) |
| **Capacity math** | What fits on *your* card, computed honestly (13B-class lossless resident on a 20 GB GPU) | `bigsmall plan` — [docs](docs/features/capacity.md) |
| **Integrity tooling** | Bit-exact receipts in the file, checkpoint forensics, an untrained-model trap | `bigsmall xray`, `verify` — [docs](docs/features/integrity.md) |

Everything from 3.x is still here: delta patches for fine-tunes, KV-cache compression (entry v3 + FP8 + 32k chunked + GPU codec), `xray`, reshard, resume, ECC. [Full changelog →](CHANGELOG.md)

---

## What BigSmall does

### 1. Make any model smaller

```bash
bigsmall compress mistral-7b/ -o mistral-7b.bs
bigsmall decompress mistral-7b.bs -o mistral-7b-restored/
```

**Before:** 14.2 GB of safetensors. **After:** 9.3 GB `.bs` file. **Saved:** 4.9 GB (34%).

Every weight is bit-for-bit identical. Works on any safetensors model — LLMs, diffusion, audio, vision. BF16, F16, F32, and FP8 weights are all native.

### 2. Store fine-tunes as patches

```bash
bigsmall compress qwen-instruct/ --delta-from qwen-base/ -o instruct.bs
bigsmall apply qwen-base/ instruct.bs -o qwen-instruct-restored/
```

**Before:** 14.2 GB Qwen2.5-7B-Instruct. **After:** ~5 GB patch. If your users already have the base model, they only download what changed. Delta size is **pair-dependent** — measured from under 1% of full size (best ≥7B SFT pairs) to ~61% (small-model full tunes); the ≥7B official-instruct class measures 34–50%. The engine measures both codings per tensor and never ships a delta larger than standalone. Full table: [docs/delta-compression.md](docs/delta-compression.md).

### 3. One file. Two models: the fast one and the real one — the Ferrell Duo

```bash
bigsmall dual qwen2.5-0.5b/model.safetensors -o qwen2.5-0.5b.bsd
bigsmall run qwen2.5-0.5b.bsd                # perfect: bit-exact, verified
bigsmall run qwen2.5-0.5b.bsd --mode fast    # fast: INT4, reads 21.4% of raw
```

A `.bsd` is a **Ferrell Duo**: use the fast one while you work (a lossy-INT4 member that loads by reading a fifth of the bytes); the real one is the exact original — every bit, proven. The same file holds the lossless residual that reconstructs the original bits, receipt included. Measured cost: a `.bsd` is ~2.9 pp of raw larger than the lossless-only `.bs` ([details](docs/features/dual-fidelity.md)).

### 4. Run models bigger than your RAM

```bash
bigsmall run model.bsd --stream
```

The layer-streaming executor keeps the resident set bounded and tells you the bound before it starts. Measured receipt, Qwen2.5-7B (14.2 GB raw): streamed forward pass **bit-exact** against the fully-loaded model (identical logits sha256) in ~2.5 GB of hot state. Slow is stated plainly where it is slow — see [docs/features/streaming.md](docs/features/streaming.md) for the measured rates.

### 5. Download smaller, use instantly

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall"
)
```

Works exactly like a normal HuggingFace model — BigSmall decompresses transparently on load. **25+ pre-compressed models** ready to use ([browse them all](https://huggingface.co/wpferrell)).

---

## Compression numbers (every published model)

Every row is a real measurement. Click a model to download it.

| Model | Original | BigSmall | Saved |
|---|---:|---:|---:|
| [Qwen2.5-14B-Instruct](https://huggingface.co/wpferrell/qwen2.5-14b-instruct-bigsmall) | 29.5 GB | 19.5 GB | 34% |
| [Gemma-3-12B-it](https://huggingface.co/wpferrell/gemma-3-12b-it-bigsmall) | 22.7 GB | 14.8 GB | 35% |
| [Gemma-2-9B-it](https://huggingface.co/wpferrell/gemma-2-9b-it-bigsmall) | 17.2 GB | 11.3 GB | 34% |
| [Llama-3.1-8B-Instruct](https://huggingface.co/wpferrell/llama-3.1-8b-instruct-bigsmall) | 15.0 GB | 9.7 GB | 35% |
| [Llama-3-8B-Instruct](https://huggingface.co/wpferrell/llama-3-8b-instruct-bigsmall) | 15.0 GB | 9.8 GB | 34% |
| [Qwen3-8B](https://huggingface.co/wpferrell/qwen3-8b-bigsmall) | 15.3 GB | 10.1 GB | 34% |
| [Mistral-7B-Instruct v0.3](https://huggingface.co/wpferrell/mistral-7b-instruct-bigsmall) | 14.2 GB | 8.9 GB | 37% |
| [Mistral-7B-Instruct v0.2](https://huggingface.co/wpferrell/mistral-7b-instruct-v0.2-bigsmall) | 14.2 GB | 8.9 GB | 37% |
| [Qwen2.5-7B-Instruct](https://huggingface.co/wpferrell/qwen2.5-7b-instruct-bigsmall) | 14.2 GB | 9.4 GB | 34% |
| [Phi-3.5-mini-instruct](https://huggingface.co/wpferrell/phi-3.5-mini-instruct-bigsmall) | 7.1 GB | 4.7 GB | 34% |
| [Gemma-3-4B-it](https://huggingface.co/wpferrell/gemma-3-4b-it-bigsmall) | 8.0 GB | 5.2 GB | 35% |
| [Qwen3-4B-Instruct](https://huggingface.co/wpferrell/qwen3-4b-instruct-bigsmall) | 7.5 GB | 5.0 GB | 34% |
| [Llama-3.2-3B-Instruct](https://huggingface.co/wpferrell/llama-3.2-3b-instruct-bigsmall) | 6.4 GB | 3.9 GB | 39% |
| [Gemma-2-2B-it](https://huggingface.co/wpferrell/gemma-2-2b-it-bigsmall) | 4.9 GB | 3.2 GB | 34% |
| [Qwen2.5-3B-Instruct](https://huggingface.co/wpferrell/qwen2.5-3b-instruct-bigsmall) | 5.7 GB | 3.8 GB | 34% |
| [Qwen2.5-1.5B-Instruct](https://huggingface.co/wpferrell/qwen2.5-1.5b-instruct-bigsmall) | 2.9 GB | 1.9 GB | 34% |
| [Llama-3.2-1B-Instruct](https://huggingface.co/wpferrell/llama-3.2-1b-instruct-bigsmall) | 2.3 GB | 1.5 GB | 34% |
| [Gemma-3-1B-it](https://huggingface.co/wpferrell/gemma-3-1b-it-bigsmall) | 1.9 GB | 1.2 GB | 35% |
| [Qwen2.5-0.5B-Instruct](https://huggingface.co/wpferrell/qwen2.5-0.5b-instruct-bigsmall) | 920 MB | 610 MB | 34% |
| [GPT-2 (117M)](https://huggingface.co/wpferrell/gpt2-bigsmall) | 548 MB | 414 MB | 24% |
| [Gemma-3-270M-it](https://huggingface.co/wpferrell/gemma-3-270m-it-bigsmall) | 500 MB | 330 MB | 34% |
| [Gemma-3-270M](https://huggingface.co/wpferrell/gemma-3-270m-bigsmall) | 500 MB | 330 MB | 34% |
| [Gemma-2-2B](https://huggingface.co/wpferrell/gemma-2-2b-bigsmall) | 9.7 GB | 8.1 GB | 17% |

[Browse all 25+ models on HuggingFace →](https://huggingface.co/wpferrell)

v4-line measurements on the reference machine (2026-06 campaign): Qwen2.5-7B compressed to **65.95%** of raw and ran **bit-exact** through the streaming executor; a real fp8 release (Qwen3-0.6B-FP8) compressed to **0.829 of its fp8 weight bytes**, 507/507 tensors bit-exact.

---

## What "lossless" actually means

Every weight in the model is **mathematically identical** to the original — same bit pattern, same floating-point value, same gradient, same output.

- **Not quantization.** Quantization rounds weights to fewer bits and the model's behaviour changes. (The `.bsd` fast member *is* quantization — and is labelled lossy every time it is picked, which is the point.)
- **Not pruning.** Pruning deletes weights.
- **Not approximation.** No tricks, no calibration data, no quality drop.

BigSmall finds redundancy in the bit pattern of neural weights and stores it more compactly — the same idea as ZIP for text, but tuned for BF16 floating-point distributions. **md5 is verified on every tensor** at decompression. If a single bit differs, verify fails.

The honesty rules are code, not policy: any pick below a file's best fidelity carries a mandatory announcement, the announcement is printed before the load, and a test sweep across every profile × file × override combination fails the suite if a downgrade could ever happen silently. Details: [docs/features/integrity.md](docs/features/integrity.md).

---

## CLI

```
bigsmall profile                            one-time hardware probe (~10s)
bigsmall plan SRC [--mode M]                the decision, one sentence, no execution
bigsmall run SRC [--mode M] [--stream]      pick, announce, load (or stream)
bigsmall dual SRC [-o OUT.bsd]              create/inspect a dual-fidelity .bsd
bigsmall compress SRC [-o OUT] [--delta-from BASE] [--resume] [--ecc]
bigsmall decompress SRC [-o OUT] [--base BASE]
bigsmall transcode SRC DST.bsr [--mode M]   re-encode for decode speed
bigsmall serve-stream SRC [--prompt ...]    tiered weight-streaming inference
bigsmall xray SRC                           checkpoint forensics
bigsmall info | scan | stat | verify | diff | apply | repair | benchmark
bigsmall migrate | status | pipeline run | reshard
```

Every command has `--help`. See [docs/cli-reference.md](docs/cli-reference.md) for full examples with real output.

---

## Python API

```python
import bigsmall

# Round-trip a model
bigsmall.compress("model/", "model.bs")
bigsmall.decompress("model.bs", "model_back/")

# Fine-tune as a delta patch
bigsmall.compress("finetune/", "patch.bs", delta_from="base/")
bigsmall.apply("base/", "patch.bs", "finetune_back/")

# Inspect before compressing
bigsmall.detect_bf16_native("model/")
bigsmall.scan_model("model/")

# Low-VRAM streaming inference (~12x less VRAM than from_pretrained)
from bigsmall import BigSmallStreamingModel
model = BigSmallStreamingModel.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall",
    device="cuda",
    lru_max_vram_gb=2.0,
)

# Stream-decompress straight from the HF CDN — no .bs written to disk
state_dict = bigsmall.stream_from_hub("wpferrell/gpt2-bigsmall", device="cpu")

# Reshard .bs files along layer boundaries, no re-encoding
bigsmall.reshard(["model.bs"], "resharded/", target_shard_size_gb=2.0)
```

---

## Research

The lossless compression ceiling for BF16 neural weights has been measured. It is **~62% of raw BF16 for any model**, **~34% for ≥7B instruct fine-tunes** with delta compression. We ran 300+ experiments across every known mathematical approach — entropy coding, cross-tensor prediction, learned translators, persistent homology, optimal transport, quantum-inspired methods, and more — and proved that there is no further compression available within the strict bit-identity contract.

The floor is **measured at the wall, not extrapolated**: across 4,143 weight matrices in 8 architectures the per-tensor entropy floor is flat (coefficient of variation ≈ 0), and trained mantissa/sign bits are coder-equivalent to matched random controls in every family tested — training only writes the exponent. The floor exists at initialization and never moves during training. Details: [docs/research.md](docs/research.md).

Curious how BigSmall relates to DFloat11, ZipNN, GGUF quants, and the rest of the landscape? One honest reference page: [docs/comparison.md](docs/comparison.md).

The Ferrell Duo format has its own paper — how carrying the bit-exact original next to an INT4 fast member went from +13.4 to +2.9 points of file size, every number receipted: **[10.5281/zenodo.20673133](https://doi.org/10.5281/zenodo.20673133)** ([markdown](papers/FERRELL_DUO_PAPER.md), [plain-English page](papers/FERRELL_DUO_PLAIN_ENGLISH.md)).

Full findings, all experiments, all dead-ends: **[10.5281/zenodo.20279247](https://doi.org/10.5281/zenodo.20279247)**. Plain-English summary: [docs/research.md](docs/research.md).

---

## Install

```bash
pip install bigsmall                  # core
pip install "bigsmall[hf]"            # + HuggingFace integration
pip install "bigsmall[ecc]"           # + Reed-Solomon error recovery
pip install "bigsmall[all]"           # everything
```

**Requires** Python 3.9+. Works on Linux, macOS, and Windows. CPU, NVIDIA, AMD, and Apple Silicon.

New here? Start with [docs/quickstart.md](docs/quickstart.md), or find your situation in [docs/how-it-helps.md](docs/how-it-helps.md).

---

## License

Code: [Elastic License 2.0](LICENSE). Free for personal, research, and commercial use. SaaS providers should see [LICENSING.md](LICENSING.md).

Model weights distributed in `.bs` format keep the license of the original model.

---

## Links

- **PyPI** — https://pypi.org/project/bigsmall/
- **GitHub** — https://github.com/wpferrell/Bigsmall
- **HuggingFace** — https://huggingface.co/wpferrell
- **Paper / DOI** — https://doi.org/10.5281/zenodo.20279247 (always resolves to the latest version)
- **Paper (PDF)** — https://github.com/wpferrell/Bigsmall/blob/main/paper.pdf
- **Docs** — [docs/](docs/)
- **Changelog** — [CHANGELOG.md](CHANGELOG.md)
- **Contact** — wpferrell@gmail.com

---

## Feedback & Community

Did BigSmall work for your model? We'd love to know.

- **Open a [Discussion](https://github.com/wpferrell/Bigsmall/discussions)** — share your compression results, ask questions, or suggest improvements
- **File an [Issue](https://github.com/wpferrell/Bigsmall/issues)** — if something didn't work, tell us exactly what happened
- **HuggingFace** — all compressed models are at [huggingface.co/wpferrell](https://huggingface.co/wpferrell)

We especially want to hear:
- Which model you compressed and what ratio you got
- Any errors or unexpected behaviour
- Use cases we haven't thought of
