Metadata-Version: 2.4
Name: bigsmall
Version: 3.13.1
Summary: Lossless neural network weight compression - run any model, no compromises
Home-page: https://github.com/wpferrell/Bigsmall
Author: Will Ferrell
Author-email: wpferrell@gmail.com
License: Elastic-2.0
Project-URL: Paper, https://doi.org/10.5281/zenodo.20279248
Project-URL: Bug Tracker, https://github.com/wpferrell/Bigsmall/issues
Keywords: machine learning,compression,lossless,neural networks,LLM,transformers
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Archiving :: Compression
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: constriction>=0.4
Requires-Dist: zstandard>=0.21
Requires-Dist: blosc2>=2.0
Requires-Dist: safetensors>=0.4
Requires-Dist: huggingface-hub>=0.20
Requires-Dist: tqdm>=4.0
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == "torch"
Provides-Extra: hf
Requires-Dist: transformers>=4.30; extra == "hf"
Requires-Dist: huggingface-hub>=0.20; extra == "hf"
Provides-Extra: diffusion
Requires-Dist: diffusers>=0.20; extra == "diffusion"
Provides-Extra: vllm
Requires-Dist: vllm>=0.4; extra == "vllm"
Provides-Extra: ecc
Requires-Dist: reedsolo>=1.7; extra == "ecc"
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == "all"
Requires-Dist: transformers>=4.30; extra == "all"
Requires-Dist: diffusers>=0.20; extra == "all"
Requires-Dist: huggingface-hub>=0.20; extra == "all"
Requires-Dist: reedsolo>=1.7; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

[![PyPI version](https://img.shields.io/pypi/v/bigsmall.svg)](https://pypi.org/project/bigsmall/)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.20279248.svg)](https://doi.org/10.5281/zenodo.20279248)
[![License](https://img.shields.io/badge/license-Elastic--2.0-blue.svg)](https://github.com/wpferrell/Bigsmall/blob/main/LICENSE)
[![Python](https://img.shields.io/pypi/pyversions/bigsmall.svg)](https://pypi.org/project/bigsmall/)

# BigSmall — Make AI Models Smaller, Instantly

**Lossless compression for neural network weights. Same model, smaller files. Bit-identical weights, md5-verified.**

```bash
pip install bigsmall
```

A 14 GB Mistral 7B becomes 9 GB. A fine-tuned model becomes a small "patch" on top of its base — often less than 35% of the full size. Drop-in compatible with HuggingFace `from_pretrained`.

---

## What it does

Three things, in plain English:

### 1. Compress any model

```bash
bigsmall compress model.safetensors -o model.bs
bigsmall decompress model.bs -o reconstructed.safetensors
```

**Before:** 15 GB safetensors.
**After:**  10 GB .bs file.
**Quality:** every weight bit-for-bit identical to the original.

### 2. Compress a fine-tuned model as a "patch"

If you have the base model already, store only what changed:

```bash
bigsmall compress fine_tuned.safetensors --delta-from base.safetensors -o patch.bs
bigsmall apply base.safetensors patch.bs -o reconstructed.safetensors
```

**Before:** 15 GB fine-tuned model.
**After:**  ~5 GB patch (depends on how much was fine-tuned).
**Quality:** every weight bit-for-bit identical to the original.

This is the biggest user win. If you're publishing a fine-tune of a public base, your users can store the base once and download patches.

### 3. Use a pre-compressed model from HuggingFace

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "wpferrell/mistral-7b-instruct-bigsmall"
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
```

Works exactly like any other HuggingFace model. BigSmall transparently decompresses in the background.

---

## How much smaller?

| Model                              | Original | BigSmall | Saved |
|------------------------------------|---------:|---------:|------:|
| GPT-2 (117M, FP32)                 | 548 MB   | 414 MB   |  24% |
| Llama 3.2-1B-Instruct              | 2.5 GB   | 1.5 GB   |  40% |
| Llama 3.2-3B-Instruct              | 6.4 GB   | 3.9 GB   |  39% |
| Mistral 7B Instruct v0.3           | 14.2 GB  | 9.3 GB   |  34% |
| Qwen 2.5-7B-Instruct               | 15.2 GB  | 10.0 GB  |  34% |
| Llama 3-8B-Instruct                | 16.1 GB  | 9.8 GB   |  39% |
| Qwen 2.5-14B-Instruct              | 29.5 GB  | 19.5 GB  |  34% |
| Gemma 2-9B-it                      | 18.5 GB  | 11.3 GB  |  39% |
| Gemma 3-1B-it                      | 2.0 GB   | 1.2 GB   |  40% |
| Stable Diffusion 1.5 UNet (FP16)   | 1.72 GB  | 1.48 GB  |  14% |
| **Fine-tune patch** (Instruct vs base) | 14 GB | ~5 GB    |  ~65% |

[Browse all pre-compressed models →](https://huggingface.co/wpferrell)

---

## Why lossless matters

- **Exact same model.** Bit-identical weights. Every floating-point value is mathematically identical to the original.
- **Not quantization.** Quantization (INT8, INT4) changes weight values — model behaviour changes too, even if just slightly.
- **Not pruning.** Pruning removes parts of the model.
- **Not approximation.** No tricks, no calibration data, no quality loss.

BigSmall compresses neural network weights the same way ZIP compresses text files: it finds redundancy in the bit pattern and stores it more compactly. The output decodes back to the *exact* same bits. md5 verified on every tensor.

---

## Install

```bash
pip install bigsmall                  # core
pip install "bigsmall[hf]"            # + HuggingFace Hub integration
pip install "bigsmall[ecc]"           # + Reed-Solomon error recovery
pip install "bigsmall[all]"           # everything
```

**Requirements:** Python 3.9+, NumPy, safetensors.
PyTorch is required for HuggingFace round-trips and for using compressed models in inference.

Works on Linux, macOS, and Windows. CPU + NVIDIA + AMD + Apple Silicon.

---

## What's new in v3.13.0

- **Delta compression (the big one).** Compress a fine-tune as a patch on its base model. `bigsmall compress fine_tuned/ --delta-from base/ patch.bs`. Often <35% of the full model size, fully lossless.
- **Auto-detect the base model.** `bigsmall compress --auto-delta` scans known-base fingerprints and suggests the right base. Header embeds a fingerprint of the base used, so decompression warns on mismatch.
- **Resumable compression.** `bigsmall compress --resume` picks up exactly where it left off if the run was interrupted. Tensor-level checkpointing.
- **mmap-backed decode.** Large `.bs` files (>256 MB) are now mmap'd instead of fully read into RAM. Lower peak memory, faster start.
- **GPU INT8 KV cache.** `LossyKVCacheGPU` — opt-in lossy compression for runtime KV cache. ~50% VRAM saving for streaming inference, max error ~0.04 in BF16.
- **Streaming LRU layer cache.** `BigSmallStreamingModel(lru_max_vram_gb=2.0)` keeps the most-recently-used decoded layers in VRAM.
- **Reed-Solomon ECC.** `bigsmall compress --ecc` writes a parity sidecar that can recover from ~16 corrupted bytes per 223-byte block. `bigsmall repair` uses it.
- **Fast probabilistic verify.** `bigsmall verify --sample 0.001` decodes 0.1% of weights and verifies their md5 — catches in-blob corruption without the cost of a full verify.
- **Three new CLI commands.** `bigsmall scan` (analyse before compressing), `bigsmall apply` (delta + base → original), `bigsmall repair` (ECC recovery).
- **V8 codec opt-in.** Layer-type-aware codec for attention / embedding tensors. Negligible average gain (~0.07%), available via `--use-v8-codec` for users who want the option.
- **`bigsmall.detect_bf16_native`** — detects F32 models that are really BF16 upcast and compresses them as BF16 (44% of raw F32 instead of 83%).
- **`bigsmall.download_delta(repo_id, base_dir, output_dir)`** — pull a delta repo from HuggingFace and reconstruct the fine-tune.

[See CHANGELOG.md for full details](CHANGELOG.md).

---

## CLI reference

```
bigsmall compress SRC [-o OUTPUT] [--delta-from BASE] [--auto-delta]
                       [--resume] [--ecc] [--storage|--balanced|--inference]
bigsmall decompress SRC [-o OUTPUT] [--base BASE]
bigsmall info SRC                       # size, ratio, codecs used
bigsmall scan SRC                       # analyse before compressing
bigsmall stat SRC [--tensor X]          # per-tensor table
bigsmall verify SRC [--fast|--sample N] # integrity check
bigsmall diff A.bs B.bs [--patch P.bs]  # compare or write a delta
bigsmall apply BASE PATCH.bs -o OUT     # reconstruct from base + patch
bigsmall repair SRC.bs [-o OUT]         # recover using .ecc sidecar
bigsmall benchmark SRC                  # encode/decode speed
bigsmall migrate SRC                    # re-encode with current codecs
bigsmall status                         # list your BigSmall HF repos
bigsmall pipeline run SRC DST           # resumable download → compress → upload
```

Each command has `--help` for details. See `docs/cli-reference.md` for examples.

---

## Common workflows

### Compress and upload a model to HuggingFace

```bash
python -c "
import bigsmall
bigsmall.compress_for_hub('mistralai/Mistral-7B-Instruct-v0.3', './mistral_bs/')
bigsmall.upload_to_hub('./mistral_bs/', repo_id='wpferrell/mistral-7b-bigsmall')
"
```

### Use a compressed model on a low-VRAM GPU

```python
from bigsmall import BigSmallStreamingModel

model = BigSmallStreamingModel.from_pretrained(
    "wpferrell/mistral-7b-instruct-bigsmall",
    device="cuda",
    lru_max_vram_gb=2.0,     # cache 2 GB of decoded layers
)
out = model.generate(input_ids, max_new_tokens=100)
```

Uses ~12× less VRAM than standard loading by streaming layers on demand.

### Distribute a fine-tune as a small patch

```bash
# As the publisher:
bigsmall compress fine_tuned.safetensors --delta-from base.safetensors -o patch.bs
# upload patch.bs to your HF repo

# As a user:
python -c "
import bigsmall
bigsmall.download_delta(
    'wpferrell/my-finetune-bigsmall-delta',
    base_dir='~/.cache/huggingface/.../Mistral-7B-Instruct-v0.3',
    output_dir='./reconstructed',
)
"
```

---

## Research

BigSmall ships from a multi-month research arc that established the per-tensor lossless ceiling for BF16 transformer weights. We measured every meaningful direction — column-major rescan, 2D context coding, head-cluster dedup, QKV split, delta encoding, BF16-native F32 detection — and report what works and what doesn't.

**Bottom-line findings:**

- The per-tensor lossless floor for BF16 transformer weights is ~65-66% of raw. Proven by V4-V8 experiments (300+ tested combinations, see `research/`).
- The biggest meaningful gain available today is **delta compression** for fine-tuned models — ~34% of raw BF16.
- All other intra-tensor angles have been falsified empirically.

Cite the BigSmall paper: [Zenodo DOI 10.5281/zenodo.20279248](https://doi.org/10.5281/zenodo.20279248)

See `docs/research.md` for a plain-English summary of what was learned.

---

## License

Code: [Elastic License 2.0](LICENSE).
Free for personal, research, and commercial use under typical software-product terms.
See [LICENSING.md](LICENSING.md) for commercial licensing.

Model weights distributed via BigSmall format keep the license of the original model.

---

## Links

- **PyPI**: https://pypi.org/project/bigsmall/
- **GitHub**: https://github.com/wpferrell/Bigsmall
- **HuggingFace**: https://huggingface.co/wpferrell
- **Paper**: https://doi.org/10.5281/zenodo.20279248
- **Docs**: [docs/](docs/)
