Metadata-Version: 2.4
Name: git-lfs-sempress
Version: 0.1.0
Summary: Git LFS filter for semantic compression of CSV files
Home-page: https://github.com/jalyper/git-lfs-sempress
Author: Keaton Anderson
Author-email: research@sempress.net
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Version Control :: Git
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sempress>=0.3.0
Requires-Dist: click>=8.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: msgpack>=1.0
Requires-Dist: zstandard>=0.20
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# git-lfs-sempress

**Automatic semantic compression for CSV files in Git repositories**

[![PyPI](https://img.shields.io/pypi/v/git-lfs-sempress)](https://pypi.org/project/git-lfs-sempress/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

A Git LFS clean/smudge filter that compresses CSV files 8-12x using [sempress](https://pypi.org/project/sempress/) vector quantization. Zero workflow changes -- just `git add` and `git commit` as usual.

## Quick Start

```bash
# Install Git LFS first
git lfs install

# Install the plugin
pip install git-lfs-sempress

# Initialize in your repo
git lfs-sempress init

# Track CSV files
echo "*.csv filter=lfs-sempress diff=lfs merge=lfs -text" >> .gitattributes

# Use Git normally - compression happens automatically
git add data.csv
git commit -m "Add training data"
```

## How It Works

1. **`git add`** -- Sempress compresses CSV to `.smp` format (clean filter)
2. **Git LFS** -- stores the compressed blob
3. **`git checkout`** -- Sempress decompresses back to CSV (smudge filter)
4. **You see** -- the original CSV file, seamlessly

## Compression Results

```
$ git add training_data.csv
[sempress] Compressed: 4.0MB -> 471KB (8.5x ratio)
```

Typical ratios on real data:
- IoT sensor data: **11.8x**
- Financial OHLC: **8.5x**
- ML feature vectors: **6-10x**

## Configuration

Create `.sempress.yml` in your repository root:

```yaml
version: 1

compression:
  k: 64
  uncertainty_threshold: 0.2
  auto_lock: true
  lock_cols:
    - id
    - timestamp
  residual_cols:
    - amount
    - price

thresholds:
  min_size_mb: 1
  min_compression_ratio: 1.5
```

## Commands

```bash
git lfs-sempress init              # Set up filter in current repo
git lfs-sempress track "*.csv"     # Add tracking pattern
git lfs-sempress analyze           # Estimate savings for existing files
git lfs-sempress stats             # Show compression stats for repo
git lfs-sempress quality a.csv b.csv  # Compare original vs reconstructed
```

## Quality Assurance

- **String/ID columns**: 100% exact match (automatically locked)
- **Numeric columns**: < 0.1% relative error by default
- **Residual columns**: bit-perfect reconstruction

If a column needs higher precision, add it to `residual_cols` in `.sempress.yml`.

## Installation Notes

**Windows**: If `git lfs-sempress` isn't recognized, use:
```powershell
python -m git_lfs_sempress.cli init
```

## Links

- [sempress library](https://pypi.org/project/sempress/) -- the underlying compression engine
- [Research paper](https://sempress.net/paper.pdf) -- technical details

## License

MIT License - see [LICENSE](LICENSE)
