Metadata-Version: 2.4
Name: bigsmall
Version: 3.15.0
Summary: Lossless AI model compression — make any model 34% smaller with bit-identical weights, drop-in replacement for HuggingFace from_pretrained
Home-page: https://github.com/wpferrell/Bigsmall
Author: Will Ferrell
Author-email: wpferrell@gmail.com
License: Elastic-2.0
Project-URL: Homepage, https://github.com/wpferrell/Bigsmall
Project-URL: Documentation, https://github.com/wpferrell/Bigsmall/tree/main/docs
Project-URL: Changelog, https://github.com/wpferrell/Bigsmall/blob/main/CHANGELOG.md
Project-URL: Models (HuggingFace), https://huggingface.co/wpferrell
Project-URL: Paper, https://doi.org/10.5281/zenodo.20279247
Project-URL: Bug Tracker, https://github.com/wpferrell/Bigsmall/issues
Keywords: neural network,compression,lossless,machine learning,model compression,pytorch,huggingface,transformers,bfloat16,bf16,delta compression,fine-tuning,inference,vram,llm,ai,weights,safetensors,arithmetic coding,entropy coding
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: System :: Archiving :: Compression
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: numba>=0.61
Requires-Dist: constriction>=0.4
Requires-Dist: zstandard>=0.21
Requires-Dist: blosc2>=2.0
Requires-Dist: safetensors>=0.4
Requires-Dist: huggingface-hub>=0.20
Requires-Dist: tqdm>=4.0
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Provides-Extra: hf
Requires-Dist: transformers>=4.30; extra == "hf"
Requires-Dist: huggingface-hub>=0.20; extra == "hf"
Provides-Extra: diffusion
Requires-Dist: diffusers>=0.20; extra == "diffusion"
Provides-Extra: vllm
Requires-Dist: vllm>=0.4; extra == "vllm"
Provides-Extra: ecc
Requires-Dist: reedsolo>=1.7; extra == "ecc"
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: transformers>=4.30; extra == "all"
Requires-Dist: diffusers>=0.20; extra == "all"
Requires-Dist: huggingface-hub>=0.20; extra == "all"
Requires-Dist: reedsolo>=1.7; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

[![PyPI version](https://img.shields.io/pypi/v/bigsmall.svg)](https://pypi.org/project/bigsmall/)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.20279247.svg)](https://doi.org/10.5281/zenodo.20279247)
[![License](https://img.shields.io/badge/license-Elastic--2.0-blue.svg)](https://github.com/wpferrell/Bigsmall/blob/main/LICENSE)
[![Python](https://img.shields.io/pypi/pyversions/bigsmall.svg)](https://pypi.org/project/bigsmall/)
[![Downloads](https://static.pepy.tech/badge/bigsmall)](https://pepy.tech/project/bigsmall)

# BigSmall — Lossless AI Model Compression

**Make any AI model ~34% smaller. Bit-identical weights. Drop-in replacement for `from_pretrained`.**

```bash
pip install bigsmall          # CLI + compression/decompression
pip install bigsmall[torch]   # add this for model loading (from_pretrained)
```

A 14 GB Mistral-7B becomes 9.3 GB. A fine-tuned model becomes a 5 GB patch on top of its 14 GB base. The decompressed model is **every weight bit-for-bit identical** to the original — each tensor's md5 is verified on decompress. (Verification is tensor-level, not file-level: safetensors re-serializes the container wrapper, so the file's md5 changes, but every weight value is bit-for-bit identical.)

| **~34%** smaller | **~65%** smaller as a delta patch | **25+** ready-to-use models |
|:---:|:---:|:---:|
| any BF16 LLM | ≥7B instruct fine-tunes vs their base ([pair-dependent](docs/delta-compression.md)) | [on HuggingFace](https://huggingface.co/wpferrell) |

---

## What BigSmall does

Three use cases. Pick the one that fits.

### 1. Make any model smaller

```bash
bigsmall compress mistral-7b/ -o mistral-7b.bs
bigsmall decompress mistral-7b.bs -o mistral-7b-restored/
```

**Before:** 14.2 GB of safetensors. **After:** 9.3 GB `.bs` file. **Saved:** 4.9 GB (34%).

Every weight is bit-for-bit identical. Every calculation the model does is identical to the original. Works on any safetensors model — LLMs, diffusion, audio, vision, anything.

### 2. Store fine-tunes as tiny patches

```bash
bigsmall compress qwen-instruct/ --delta-from qwen-base/ -o instruct.bs
bigsmall apply qwen-base/ instruct.bs -o qwen-instruct-restored/
```

**Before:** 14.2 GB Qwen2.5-7B-Instruct. **After:** ~5 GB patch. **Saved:** 9 GB (65%).

If your users already have the public base model, they only need to download what *changed*. This is the biggest win in BigSmall. Use it for any fine-tune: instruction tuning, DPO, RLHF, domain adaptation, LoRA-merged checkpoints.

How much a delta saves is **pair-dependent** — measured from under 1% of full size (the best ≥7B SFT pairs) to ~61% (small-model full tunes, barely under standalone). The 65% saving above is the ≥7B official-instruct class, where patches measure 34–50% of full size. Full measured table: [docs/delta-compression.md](docs/delta-compression.md). Since 3.15 the engine measures both codings per tensor and never ships a delta larger than standalone.

### 3. Download smaller, use instantly

```python
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall"
)
```

Works exactly like a normal HuggingFace model — BigSmall decompresses transparently on load. **25+ pre-compressed models** ready to use ([browse them all](https://huggingface.co/wpferrell)).

Prefer the CLI? `bigsmall decompress` works on local `.bs` files — download first, then decompress:

```bash
hf download wpferrell/phi-3.5-mini-instruct-bigsmall --local-dir phi-3.5-mini-bs
bigsmall decompress phi-3.5-mini-bs/model-00001-of-00002.bs -o model.safetensors
```

(On older `huggingface_hub` the equivalent command is `huggingface-cli download …`; the `huggingface-cli` entrypoint is deprecated in `huggingface_hub >= 1.0` in favour of `hf`.)

---

## Compression numbers (every published model)

Every row is a real measurement. Click a model to download it.

| Model | Original | BigSmall | Saved |
|---|---:|---:|---:|
| [Qwen2.5-14B-Instruct](https://huggingface.co/wpferrell/qwen2.5-14b-instruct-bigsmall) | 29.5 GB | 19.5 GB | 34% |
| [Gemma-3-12B-it](https://huggingface.co/wpferrell/gemma-3-12b-it-bigsmall) | 22.7 GB | 14.8 GB | 35% |
| [Gemma-2-9B-it](https://huggingface.co/wpferrell/gemma-2-9b-it-bigsmall) | 17.2 GB | 11.3 GB | 34% |
| [Llama-3.1-8B-Instruct](https://huggingface.co/wpferrell/llama-3.1-8b-instruct-bigsmall) | 15.0 GB | 9.7 GB | 35% |
| [Llama-3-8B-Instruct](https://huggingface.co/wpferrell/llama-3-8b-instruct-bigsmall) | 15.0 GB | 9.8 GB | 34% |
| [Qwen3-8B](https://huggingface.co/wpferrell/qwen3-8b-bigsmall) | 15.3 GB | 10.1 GB | 34% |
| [Mistral-7B-Instruct v0.3](https://huggingface.co/wpferrell/mistral-7b-instruct-bigsmall) | 14.2 GB | 8.9 GB | 37% |
| [Mistral-7B-Instruct v0.2](https://huggingface.co/wpferrell/mistral-7b-instruct-v0.2-bigsmall) | 14.2 GB | 8.9 GB | 37% |
| [Qwen2.5-7B-Instruct](https://huggingface.co/wpferrell/qwen2.5-7b-instruct-bigsmall) | 14.2 GB | 9.4 GB | 34% |
| [Phi-3.5-mini-instruct](https://huggingface.co/wpferrell/phi-3.5-mini-instruct-bigsmall) | 7.1 GB | 4.7 GB | 34% |
| [Gemma-3-4B-it](https://huggingface.co/wpferrell/gemma-3-4b-it-bigsmall) | 8.0 GB | 5.2 GB | 35% |
| [Qwen3-4B-Instruct](https://huggingface.co/wpferrell/qwen3-4b-instruct-bigsmall) | 7.5 GB | 5.0 GB | 34% |
| [Llama-3.2-3B-Instruct](https://huggingface.co/wpferrell/llama-3.2-3b-instruct-bigsmall) | 6.4 GB | 3.9 GB | 39% |
| [Gemma-2-2B-it](https://huggingface.co/wpferrell/gemma-2-2b-it-bigsmall) | 4.9 GB | 3.2 GB | 34% |
| [Qwen2.5-3B-Instruct](https://huggingface.co/wpferrell/qwen2.5-3b-instruct-bigsmall) | 5.7 GB | 3.8 GB | 34% |
| [Qwen2.5-1.5B-Instruct](https://huggingface.co/wpferrell/qwen2.5-1.5b-instruct-bigsmall) | 2.9 GB | 1.9 GB | 34% |
| [Llama-3.2-1B-Instruct](https://huggingface.co/wpferrell/llama-3.2-1b-instruct-bigsmall) | 2.3 GB | 1.5 GB | 34% |
| [Gemma-3-1B-it](https://huggingface.co/wpferrell/gemma-3-1b-it-bigsmall) | 1.9 GB | 1.2 GB | 35% |
| [Qwen2.5-0.5B-Instruct](https://huggingface.co/wpferrell/qwen2.5-0.5b-instruct-bigsmall) | 920 MB | 610 MB | 34% |
| [GPT-2 (117M)](https://huggingface.co/wpferrell/gpt2-bigsmall) | 548 MB | 414 MB | 24% |
| [Gemma-3-270M-it](https://huggingface.co/wpferrell/gemma-3-270m-it-bigsmall) | 500 MB | 330 MB | 34% |
| [Gemma-3-270M](https://huggingface.co/wpferrell/gemma-3-270m-bigsmall) | 500 MB | 330 MB | 34% |
| [Gemma-2-2B](https://huggingface.co/wpferrell/gemma-2-2b-bigsmall) | 9.7 GB | 8.1 GB | 17% |

[Browse all 25+ models on HuggingFace →](https://huggingface.co/wpferrell)

---

## What "lossless" actually means

Every weight in the model is **mathematically identical** to the original — same bit pattern, same floating-point value, same gradient, same output.

- **Not quantization.** Quantization rounds weights to fewer bits and the model's behaviour changes.
- **Not pruning.** Pruning deletes weights.
- **Not approximation.** No tricks, no calibration data, no quality drop.

BigSmall finds redundancy in the bit pattern of neural weights and stores it more compactly — the same idea as ZIP for text, but tuned for BF16 floating-point distributions. **md5 is verified on every tensor** at decompression. If a single bit differs, verify fails.

---

## How it compares

| Approach | Lossless? | Typical reduction | Behaviour change |
|---|:---:|:---:|:---:|
| **BigSmall** | **Yes — bit-identical** | **~34%** (65% as a delta, ≥7B instruct class) | **None** |
| Quantization (GPTQ / AWQ / bitsandbytes) | No | 50–75% | Yes — weights are rounded |
| DFloat11 (entropy-coded BF16) | Yes — lossless | ~30% (format-fixed) | None |
| ZipNN (entropy-coded BF16) | Yes — lossless | up to ~33% (authors' reported numbers) | None |
| ZIP / gzip on safetensors | Yes | ~1–3% | None (but not model-aware) |

Three of these are lossless weight-aware formats: BigSmall, DFloat11, and ZipNN. Head-to-head under the same accounting, BigSmall codes below DFloat11's bound on **every layer type of every model measured** (+0.45–0.55 pp model-level, +12–18 pp on norm scales — [docs/dfloat11.md](docs/dfloat11.md)); ZipNN has not been independently measured by us, so its row carries its authors' numbers. BigSmall is also the only one of the three with delta patches, BF16-native-F32 detection, and streaming surfaces. Quantization compresses further but changes the model; generic ZIP keeps fidelity but barely shrinks BF16 weights. See [docs/comparison.md](docs/comparison.md) for the full breakdown.

---

## CLI reference

```
bigsmall compress SRC [-o OUT] [--delta-from BASE] [--auto-delta] [--resume] [--ecc]
bigsmall decompress SRC [-o OUT] [--base BASE]
bigsmall info SRC.bs                       size, ratio, codecs used
bigsmall scan SRC                          analyse before compressing
bigsmall verify SRC.bs [--fast|--sample N] integrity check
bigsmall diff A.bs B.bs [--patch P.bs]     compare or write a delta
bigsmall apply BASE PATCH.bs -o OUT        reconstruct from base + patch
bigsmall repair SRC.bs [-o OUT]            recover via Reed-Solomon ECC sidecar
bigsmall benchmark SRC                     encode/decode throughput
bigsmall migrate SRC.bs                    re-encode with current codecs
bigsmall status                            list your BigSmall HF repos
bigsmall pipeline run SRC DST              resumable download → compress → upload
bigsmall reshard SRC --output-dir DIR [--size-gb N|--shards N|--join]  reshard .bs by layer
```

Every command has `--help`. See [docs/cli-reference.md](docs/cli-reference.md) for full examples.

---

## Python API

```python
import bigsmall

# Round-trip a model
bigsmall.compress("model/", "model.bs")
bigsmall.decompress("model.bs", "model_back/")

# Fine-tune as a delta patch
bigsmall.compress("finetune/", "patch.bs", delta_from="base/")
bigsmall.apply("base/", "patch.bs", "finetune_back/")

# Inspect before compressing
bigsmall.detect_bf16_native("model/")
bigsmall.scan_model("model/")

# Low-VRAM streaming inference (~12× less VRAM than from_pretrained)
from bigsmall import BigSmallStreamingModel
model = BigSmallStreamingModel.from_pretrained(
    "wpferrell/phi-3.5-mini-instruct-bigsmall",
    device="cuda",
    lru_max_vram_gb=2.0,
)

# Stream-decompress straight from the HF CDN — no .bs written to disk (V10)
state_dict = bigsmall.stream_from_hub("wpferrell/gpt2-bigsmall", device="cpu")

# Reshard .bs files along layer boundaries, no re-encoding (V11)
bigsmall.reshard(["model.bs"], "resharded/", target_shard_size_gb=2.0)
```

---

## What's new in v3.15.0

- **Delta compression is now fail-safe — measure-then-choose.** `compress_delta` encodes every matched tensor both ways (XOR-delta and standalone) and keeps the smaller, so a delta file can never come out larger than standalone compression. A pre-compression gate warns when >30% of matched bytes changed — the measured delta-doesn't-pay regime (`--force-delta` silences it).
- **KV cache entry format v3 — per-depth sequential dispatcher.** Plain vs sequential-exponent coding is measured per tensor at encode time, the smaller kept, and the winning blob verified bit-exact before it returns. End-to-end wins over the shipped v1 format grow with context: 0.27% (128 tokens) → 0.58% (2048 tokens), up to +7.7% on early-layer K tensors. v1/v2 entries decode forever.
- **FP8 KV entries** (e4m3 / e5m2) and **chunked streaming KV** (32k+ contexts encode and decode chunk-by-chunk, nothing materialized whole; default 4096 tokens/chunk) in the same v3 entry format.
- **`AutoKVCache`** — `get_kv_cache(mode="auto")` routes each call by device: CUDA tensors through the GPU-resident lossless codec, CPU tensors through the entry codec.
- **`bigsmall xray` — checkpoint forensics.** Per-tensor substream entropies vs a matched-random control, lineage and anomaly flags (`mantissa_carved`, `sign_imbalance`, `exp_near_random`, …), and a model-level "looks untrained" detector that catches silently-randomized loads:
  ```bash
  bigsmall xray model_dir/ --json report.json
  ```
- **Role-group stream packing** (opt-in `--group-streams`): small same-role tensors (norm/bias chains, GQA k/v projections, MoE routers) packed into one coded stream per role when measured smaller. Grouped files need ≥3.15 readers.
- **`bf16_native_f32_v2`** — near-gate F32 tensors pick the smallest verified codec per substream; 0.6–2.7% smaller on the affected class.
- **Measured DFloat11 comparison** — BigSmall codes below DFloat11's bound on every layer type of every model measured: [docs/dfloat11.md](docs/dfloat11.md).

---

## What's new in v3.14

- **GPU-resident KV cache (V9)** — `GPUCompressedKVCache` keeps the compressed cache and the encode/decode passes entirely on the CUDA device, with no CPU round-trip. ~47× faster than the CPU KV codec on the reference shape, bit-identical round-trip. `get_kv_cache(device, mode)` auto-picks the GPU backend when CUDA is available. V9B adds fused Triton pack/unpack kernels.
- **Progressive HTTP streaming (V10)** — `stream_from_hub(repo_id)` decompresses a model directly from the HuggingFace CDN over HTTP byte-range requests. With the default `cache=False`, **zero `.bs` bytes are written to disk**.
- **Reshard (V11)** — `bigsmall reshard` splits, joins, or rebalances `.bs` shards along transformer-layer boundaries with no re-encoding. Every output tensor is md5-verified.
- **numba is now a hard dependency** (`numba>=0.61`) — guarantees the JIT codec path runs everywhere instead of a silent slow fallback.
- **CI green across the full matrix** — Ubuntu / Windows / macOS × Python 3.10 / 3.11 / 3.12.

Earlier highlights still current: **delta compression** (fine-tunes as ~34%-size patches), `--auto-delta` base detection, BF16-native F32 auto-routing (Whisper-class), `--resume`, `verify --fast`/`--sample`, mmap decode, Reed-Solomon `--ecc` + `repair`, and `BigSmallStreamingModel(lru_max_vram_gb=…)`.

[Full changelog →](CHANGELOG.md)

---

## Research

The lossless compression ceiling for BF16 neural weights has been measured. It is **~62% of raw BF16 for any model**, **~34% for ≥7B instruct fine-tunes** with delta compression. We ran 300+ experiments across every known mathematical approach — entropy coding, cross-tensor prediction, learned translators, persistent homology, optimal transport, quantum-inspired methods, and more — and proved that there is no further compression available within the strict bit-identity contract.

The floor is **measured at the wall, not extrapolated**: across 4,143 weight matrices in 8 architectures the per-tensor entropy floor is flat (coefficient of variation ≈ 0), and trained mantissa/sign bits are coder-equivalent to matched random controls in every family tested — training only writes the exponent. The floor exists at initialization and never moves during training. Details: [docs/research.md](docs/research.md).

Full findings, all experiments, all dead-ends: **[10.5281/zenodo.20279247](https://doi.org/10.5281/zenodo.20279247)**. Plain-English summary: [docs/research.md](docs/research.md).

---

## Install

```bash
pip install bigsmall                  # core
pip install "bigsmall[hf]"            # + HuggingFace integration
pip install "bigsmall[ecc]"           # + Reed-Solomon error recovery
pip install "bigsmall[all]"           # everything
```

**Requires** Python 3.9+. Works on Linux, macOS, and Windows. CPU, NVIDIA, AMD, and Apple Silicon.

---

## License

Code: [Elastic License 2.0](LICENSE). Free for personal, research, and commercial use. SaaS providers should see [LICENSING.md](LICENSING.md).

Model weights distributed in `.bs` format keep the license of the original model.

---

## Links

- **PyPI** — https://pypi.org/project/bigsmall/
- **GitHub** — https://github.com/wpferrell/Bigsmall
- **HuggingFace** — https://huggingface.co/wpferrell
- **Paper / DOI** — https://doi.org/10.5281/zenodo.20279247 (always resolves to the latest version)
- **Paper (PDF)** — https://github.com/wpferrell/Bigsmall/blob/main/paper.pdf
- **Docs** — [docs/](docs/)
- **Changelog** — [CHANGELOG.md](CHANGELOG.md)
- **Contact** — wpferrell@gmail.com

---

## Feedback & Community

Did BigSmall work for your model? We'd love to know.

- **Open a [Discussion](https://github.com/wpferrell/Bigsmall/discussions)** — share your compression results, ask questions, or suggest improvements
- **File an [Issue](https://github.com/wpferrell/Bigsmall/issues)** — if something didn't work, tell us exactly what happened
- **HuggingFace** — all compressed models are at [huggingface.co/wpferrell](https://huggingface.co/wpferrell)

We especially want to hear:
- Which model you compressed and what ratio you got
- Any errors or unexpected behaviour
- Use cases we haven't thought of
