Metadata-Version: 2.4
Name: sempress
Version: 0.3.1
Summary: Semantic compression for tabular data and images using vector quantization
Author-email: Keaton Anderson <research@sempress.net>
License-Expression: MIT
Project-URL: Homepage, https://sempress.net
Project-URL: Repository, https://github.com/jalyper/sempress
Project-URL: Documentation, https://sempress.net
Project-URL: Paper, https://sempress.net/paper.pdf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: System :: Archiving :: Compression
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: click>=8.1
Requires-Dist: msgpack>=1.0
Requires-Dist: zstandard>=0.20
Requires-Dist: Pillow>=10.0
Provides-Extra: api
Requires-Dist: fastapi>=0.110; extra == "api"
Requires-Dist: uvicorn>=0.24; extra == "api"
Provides-Extra: image
Requires-Dist: scikit-image>=0.22; extra == "image"
Requires-Dist: scipy>=1.11; extra == "image"
Provides-Extra: audio
Requires-Dist: librosa>=0.10; extra == "audio"
Requires-Dist: soundfile>=0.12; extra == "audio"
Requires-Dist: scipy>=1.11; extra == "audio"
Provides-Extra: all
Requires-Dist: sempress[api,audio,image]; extra == "all"
Provides-Extra: test
Requires-Dist: pytest>=8.0; extra == "test"
Requires-Dist: pytest-cov>=5.0; extra == "test"
Requires-Dist: httpx>=0.27; extra == "test"
Provides-Extra: dev
Requires-Dist: sempress[all,test]; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Dynamic: license-file

# Sempress

**Semantic compression for tabular data and images using vector quantization**

[![PyPI](https://img.shields.io/pypi/v/sempress)](https://pypi.org/project/sempress/)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/downloads/)

Sempress achieves **5-15x better compression than gzip** on numeric-heavy datasets by learning per-column codebooks with K-Means vector quantization. String and ID columns are preserved losslessly; precision-critical columns can store exact residuals.

## Installation

```bash
pip install sempress
```

Optional extras:

```bash
pip install sempress[image]   # PSNR/SSIM metrics (scikit-image, scipy)
pip install sempress[audio]   # Audio compression (librosa, soundfile)
pip install sempress[api]     # FastAPI server (fastapi, uvicorn)
pip install sempress[all]     # Everything
```

## CLI Usage

```bash
# Compress CSV to .smp format
sempress encode --in data.csv --out data.smp --lock-cols id,timestamp --k 64

# Decompress back to CSV
sempress decode --in data.smp --out restored.csv

# Evaluate reconstruction quality
sempress eval --original data.csv --recon restored.csv --lock-cols id,timestamp
```

**Options:**
- `--lock-cols`: Columns preserved losslessly (strings, IDs, timestamps)
- `--residual-cols`: Columns with exact error stored (financial, scientific)
- `--k`: Codebook size per column (default: 64, range: 16-256)
- `--uncert-thresh`: Flag cells with relative error above threshold (default: 0.2)

## Python API

```python
from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig

config = EncodeConfig(
    lock_cols=["id", "timestamp"],
    residual_cols=["amount"],
    k=64,
    uncertainty_thresh=0.2,
)

# Compress
blob = encode_csv("data.csv", config)
with open("data.smp", "wb") as f:
    f.write(blob)

# Decompress
decode_to_csv(blob, "restored.csv")
```

## How It Works

1. **Column analysis** - auto-detects numeric vs categorical columns
2. **Learn codebooks** - K-Means learns k centroids per numeric column
3. **Encode to indices** - replaces values with nearest centroid index (uint16)
4. **Add residuals** (optional) - stores exact errors for high-precision columns
5. **Package** - msgpack + zstd container (`.smp` / SEMZ1 format) with schema and metadata

## Benchmarks

Tested on 10,000 rows of IoT sensor data (1.4 MB):

| Metric | Sempress | gzip | Improvement |
|--------|----------|------|-------------|
| Compression Ratio | **15.72x** | 2.48x | +533% |
| Final Size | 93 KB | 603 KB | 84% smaller |
| Data Fidelity | 97.5% | 100% (lossless) | Configurable |

Sempress excels on numeric-heavy data (IoT, ML features, financial). For text-heavy or very small tables, gzip may be simpler.

## Git LFS Integration

For automatic compression in Git repositories, see the companion plugin:
[git-lfs-sempress](https://github.com/jalyper/git-lfs-sempress)

## Research Paper

[sempress.net/paper.pdf](https://sempress.net/paper.pdf)

```bibtex
@article{sempress2025,
  title={Sempress: Semantic Compression for Numeric Tabular Data via Learned Vector Quantization},
  author={Anderson, Keaton},
  year={2025},
  url={https://sempress.net}
}
```

## License

MIT License - see [LICENSE](LICENSE) for details.
