Metadata-Version: 2.4
Name: sempress
Version: 0.3.0
Summary: Semantic compression for tabular data and images using vector quantization
Author-email: Keaton Anderson <research@sempress.net>
License: MIT
Project-URL: Homepage, https://sempress.net
Project-URL: Repository, https://github.com/jalyper/sempress
Project-URL: Documentation, https://sempress.net
Project-URL: Paper, https://sempress.net/paper.pdf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: System :: Archiving :: Compression
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: click>=8.1
Requires-Dist: msgpack>=1.0
Requires-Dist: zstandard>=0.20
Requires-Dist: Pillow>=10.0
Provides-Extra: api
Requires-Dist: fastapi>=0.110; extra == "api"
Requires-Dist: uvicorn>=0.24; extra == "api"
Provides-Extra: image
Requires-Dist: scikit-image>=0.22; extra == "image"
Requires-Dist: scipy>=1.11; extra == "image"
Provides-Extra: audio
Requires-Dist: librosa>=0.10; extra == "audio"
Requires-Dist: soundfile>=0.12; extra == "audio"
Requires-Dist: scipy>=1.11; extra == "audio"
Provides-Extra: all
Requires-Dist: sempress[api,audio,image]; extra == "all"
Provides-Extra: test
Requires-Dist: pytest>=8.0; extra == "test"
Requires-Dist: pytest-cov>=5.0; extra == "test"
Requires-Dist: httpx>=0.27; extra == "test"
Provides-Extra: dev
Requires-Dist: sempress[all,test]; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Dynamic: license-file

# Sempress

**Semantic Compression API - Reduce Cloud Storage Costs by 90%**

[![Website](https://img.shields.io/badge/Website-sempress.net-blue)](https://sempress.net)
[![API Docs](https://img.shields.io/badge/API-docs.sempress.net-green)](https://sempress.net)
[![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/downloads/)

Sempress is a **compression API service** that achieves **5-15× better compression than gzip** on numeric-heavy datasets through learned vector quantization. Perfect for IoT telemetry, ML feature stores, and financial data.

**Proven Results**: 15.72× compression ratio on real IoT data (vs gzip's 2.48×) = **533% improvement**

---

## 🚀 Quick Start

### Python Client (Recommended)

```python
# Install
pip install sempress-client

# Compress
from sempress import SempressClient

client = SempressClient(api_key="sk_live_...")
result = client.compress_file("data.csv")

print(f"Compression: {result.ratio}×")
print(f"Saved: {result.space_saved_pct}%")
print(f"AWS Cost Savings: ${result.monthly_savings}/mo")

# Download compressed file
result.save("data.smp")
```

### REST API (Any Language)

```bash
# Compress
curl -X POST https://api.sempress.net/v1/compress \
  -H "Authorization: Bearer sk_live_..." \
  -F "file=@data.csv"

# Response
{
  "job_id": "job_abc123",
  "compression_ratio": 12.5,
  "space_saved_pct": 92.0,
  "aws_savings_monthly": 45.50
}
```

---

## 💰 Pricing

### Free Tier
- **100 MB/month** compression
- Full API access
- Web interface
- Community support

### Pro Tier - $29/month
- **10 GB/month** (100× more)
- Priority processing (2× faster)
- Email support
- Advanced features

### Enterprise - Custom Pricing
- Unlimited usage
- S3 direct integration
- On-premise deployment
- 24/7 support

**[→ Sign Up Free](https://sempress.net/signup.html)**

---

## 📊 Performance

**Tested on 10,000 rows of IoT data (1.4 MB):**

| Metric | Sempress | gzip | Improvement |
|--------|----------|------|-------------|
| Compression Ratio | **15.72×** | 2.48× | **+533%** |
| Final Size | 93 KB | 603 KB | **84% smaller** |
| Space Saved | 93.64% | 59.73% | **+57%** |
| Data Fidelity | 97.5% | N/A | Configurable |

---

## 🎯 Use Cases

### IoT & Telemetry
Compress sensor data streams by 90%+. Perfect for:
- Industrial IoT monitoring
- Smart city deployments
- Fleet management systems

### ML Feature Stores
Reduce S3 costs for training data:
- High-dimensional feature vectors
- Time-series embeddings
- Model training datasets

### Financial Data
Archive tick data with precision:
- High-frequency trading data
- Market microstructure
- Historical financial time series

---

## 🔧 Core Library (Open Source)

This repository contains **sempress-core**, the open-source compression library.

### Installation

```bash
pip install git+https://github.com/jalyper/sempress-core.git
```

### CLI Usage

```bash
# Compress
sempress encode --in data.csv --out data.smp --k 64

# Decompress  
sempress decode --in data.smp --out restored.csv
```

### Python API

```python
from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig

# Compress
config = EncodeConfig(
    lock_cols=["id", "timestamp"],  # Lossless
    residual_cols=["price"],         # Perfect precision
    k=64                             # Codebook size
)
compressed = encode_csv("data.csv", config)

# Save
with open("data.smp", "wb") as f:
    f.write(compressed)

# Decompress
decode_to_csv(compressed, "restored.csv")
```

---

## 🌐 Sempress.net Service

**Live Service**: [https://sempress.net](https://sempress.net)

The commercial Sempress service provides:
- ✅ REST API for any language
- ✅ Python, JavaScript, Go clients
- ✅ Web interface with analytics
- ✅ Job tracking & metrics
- ✅ Usage-based pricing
- ✅ Enterprise features (S3 integration, batch processing)

**Service Code**: See `/vercel-deploy/` for the production deployment

---

## 📖 Documentation

- **[Customer Lifecycle Strategy](CUSTOMER_LIFECYCLE_STRATEGY.md)** - Complete product roadmap
- **[Deployment Guide](DEPLOYMENT_NEXT_STEPS.md)** - Launch & scaling plan
- **[Research Paper](https://sempress.net/paper.pdf)** - Technical details
- **[Image Compression](docs/image_compression.md)** - Image compression features (experimental)

---

## 🛠️ Development

### For Commercial Service (sempress.net)
See `/vercel-deploy/` directory for:
- Production website code
- API backend implementation
- Authentication system
- Payment integration

### For Core Library Development

```bash
# Clone
git clone https://github.com/jalyper/sempress-core.git
cd sempress-core

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run benchmarks
python scripts/run_benchmarks.py
```

---

## 🤝 Contributing

We welcome contributions to the core library!

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request

See [CONTRIBUTING.md](CONTRIBUTING.md) for details.

---

## 📄 Research Paper

**Sempress: Semantic Compression for Tabular Data via Learned Vector Quantization**

- **PDF**: [sempress.net/paper.pdf](https://sempress.net/paper.pdf)
- **Published**: January 2025
- **Version**: 0.2.0
- **Author**: Keaton Anderson
- **License**: MIT (Open Source)

---

## 🗺️ Roadmap

### Core Library (Open Source)
- [x] CSV compression with vector quantization
- [x] CLI tool
- [x] Python API
- [x] Git LFS plugin
- [x] Image compression (experimental)
- [ ] Parquet support
- [ ] Arrow support
- [ ] Streaming compression

### Commercial Service (sempress.net)
- [x] REST API with authentication
- [x] Web interface with analytics
- [x] Free & Pro tiers
- [ ] Python client library (`sempress-client`)
- [ ] JavaScript client library
- [ ] S3 direct integration
- [ ] Batch processing API
- [ ] Enterprise on-premise deployment

See [CUSTOMER_LIFECYCLE_STRATEGY.md](CUSTOMER_LIFECYCLE_STRATEGY.md) for detailed roadmap.

---

## 📧 Contact

- **Website**: [sempress.net](https://sempress.net)
- **Email**: hello@sempress.net
- **GitHub**: [@jalyper](https://github.com/jalyper)
- **Issues**: [GitHub Issues](https://github.com/jalyper/sempress-core/issues)

---

## 📜 License

MIT License - See [LICENSE](LICENSE) for details.

**Note**: The core compression library is open source. The commercial API service at sempress.net is a hosted offering with additional features.

---

## 🙏 Citation

If you use Sempress in your research, please cite:

```bibtex
@software{sempress2025,
  title={Sempress: Semantic Compression for Tabular Data},
  author={Anderson, Keaton},
  year={2025},
  url={https://sempress.net}
}
```

---

**Built with ❤️ for the data science community**

---

## 💡 Key Features

- **Semantic Compression**: Learns column-wise patterns using K-Means vector quantization
- **Lossless Locked Columns**: Automatically preserves strings, categoricals, and IDs with 100% fidelity
- **Optional Residuals**: Achieve near-zero error on precision-critical columns (financial, scientific)
- **Uncertainty Tracking**: Flags cells with high quantization error for quality monitoring
- **Fast Decode**: Competitive with gzip+CSV parse (0.9-1.5× overhead)

---

## 📖 How It Works

Sempress applies **per-column K-Means vector quantization** to numeric data:

1. **Column Analysis**: Auto-detects numeric vs categorical columns
2. **Learn Codebooks**: K-Means learns k=64 centroids per numeric column
3. **Encode to Indices**: Replace values with nearest centroid index (uint16)
4. **Add Residuals** (optional): Store exact errors for high-precision columns
5. **Package**: Msgpack + Zstd container with schema and metadata

**Result:** Exploit semantic patterns in numeric data instead of treating tables as byte streams.

---

## 🛠️ Installation

### Requirements
- Python 3.10+
- pandas, numpy, scikit-learn, msgpack, zstandard

### Install from Source

```bash
git clone https://github.com/jalyper/sempress.git
cd sempress
pip install -e .
```

### Dependencies

```bash
pip install pandas numpy scikit-learn msgpack zstandard
```

### Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

```bash
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
```

Then open: `http://localhost:3000`

**Features:**
- 📤 Upload CSV files (up to 50MB)
- 📊 Real-time compression with Sempress
- ⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
- 💾 Download .smp and reconstructed CSV
- 📈 Quality analysis with per-column metrics

**Deployment Options:**
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See [docs/deployment_guide.md](docs/deployment_guide.md) for details

---

## 📚 Usage Guide

### Basic Compression

```bash
# Encode CSV to .smp format
sempress encode \
  --in data.csv \
  --out data.smp \
  --lock-cols user_id,timestamp \
  --k 64
```

**Options:**
- `--lock-cols`: Columns to preserve losslessly (comma-separated)
- `--residual-cols`: High-precision columns (store exact errors)
- `--k`: Codebook size (default: 64, range: 16-256)
- `--uncert-thresh`: Flag cells with >X relative error (default: 0.2)

### Decompression

```bash
# Decode .smp back to CSV
sempress decode \
  --in data.smp \
  --out data_reconstructed.csv
```

### Quality Evaluation

```bash
# Compare original vs reconstructed
sempress eval \
  --original data.csv \
  --recon data_reconstructed.csv \
  --lock-cols user_id,timestamp
```

**Metrics:**
- **Locked columns**: Exact match rate (should be 100%)
- **Numeric columns**: RMSE, MAPE, KS-distance
- **Uncertainty**: % of cells flagged

---

## 🐍 Python API

```python
from sempress import encode_csv, decode_to_csv
from sempress.table_encoder import EncodeConfig

# Configure encoder
config = EncodeConfig(
    lock_cols=['user_id', 'timestamp'],
    residual_cols=['amount'],
    k=64,
    uncertainty_thresh=0.2
)

# Encode
compressed_blob = encode_csv('data.csv', config)

# Save to file
with open('data.smp', 'wb') as f:
    f.write(compressed_blob)

# Decode
decode_to_csv(compressed_blob, 'reconstructed.csv')
```

### Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

```bash
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
```

Then open: `http://localhost:3000`

**Features:**
- 📤 Upload CSV files (up to 50MB)
- 📊 Real-time compression with Sempress
- ⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
- 💾 Download .smp and reconstructed CSV
- 📈 Quality analysis with per-column metrics

**Deployment Options:**
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See [docs/deployment_guide.md](docs/deployment_guide.md) for details

---

## 📊 Benchmarking

Run comprehensive benchmarks on your data:

```bash
# Generate synthetic datasets
python scripts/generate_datasets.py --rows 100000

# Run benchmarks
python scripts/comprehensive_benchmark.py --out results.json

# Generate figures
python scripts/generate_figures.py
```

**Included datasets:**
- IoT Telemetry (sensor readings)
- ML Features (user behavior)
- Financial (stock market OHLC)
- Sensor Physics (accelerometer, magnetometer)

---

## 🔗 Integrations

### Git LFS Plugin

**Automatic compression for Git repositories** - Perfect for ML teams!

- **Repository**: [github.com/jalyper/git-lfs-sempress](https://github.com/jalyper/git-lfs-sempress)
- **Features**:
  - Zero workflow changes (works with git add/commit)
  - 8-12× compression on CSV files
  - Intelligent quality monitoring
  - 15 automated tests, all passing
- **Installation**: `pip install git+https://github.com/jalyper/git-lfs-sempress.git`

**Use Cases**:
- ML training datasets in Git repos
- Data science notebooks with large CSVs
- IoT data collection repositories
- Collaborative data projects

---

## 🎯 When to Use Sempress

### ✅ Sempress Excels On:

- **High numeric density** (>60% numeric columns)
- **IoT/sensor data** (temperature, pressure, acceleration)
- **ML feature stores** (continuous features for training)
- **Financial data** (tick data, OHLC prices)
- **Large datasets** (>10K rows)

### ⚠️ Use Gzip Instead For:

- **Text-heavy tables** (<50% numeric)
- **Small tables** (<5K rows)
- **Real-time streaming** (Sempress has higher encode overhead)
- **High categorical cardinality**

---

## 📄 Research Paper

**Full paper:** [https://sempress.net/paper.pdf](https://sempress.net/paper.pdf)

**Citation:**
```bibtex
@article{sempress2025,
  title={Sempress: Semantic Compression for Numeric Tabular Data via Learned Vector Quantization},
  author={Anderson, Keaton},
  year={2025},
  note={Independent research with implementation assistance from AI coding agents},
  url={https://sempress.net}
}
```

### Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

```bash
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
```

Then open: `http://localhost:3000`

**Features:**
- 📤 Upload CSV files (up to 50MB)
- 📊 Real-time compression with Sempress
- ⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
- 💾 Download .smp and reconstructed CSV
- 📈 Quality analysis with per-column metrics

**Deployment Options:**
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See [docs/deployment_guide.md](docs/deployment_guide.md) for details

---

## 🤝 Contributing

We welcome contributions! Areas for improvement:

- **Streaming ingestion** (chunked encoding for >100GB files)
- **Learned entropy coding** (autoregressive priors on index sequences)
- **Time-series VQ** (segment-wise codebooks for temporal data)
- **Database integrations** (PostgreSQL extension, ClickHouse codec)
- **Text compression** (LLM-based semantic tokens for mixed data)

**How to contribute:**
1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

---

## 📁 Repository Structure

```
sempress/
├── src/sempress/           # Core library
│   ├── table_encoder.py    # K-Means VQ encoder
│   ├── table_decoder.py    # Decoder with uncertainty
│   ├── container.py        # Msgpack + Zstd packaging
│   └── cli.py              # Command-line interface
├── scripts/                # Benchmarking & datasets
│   ├── generate_datasets.py
│   ├── comprehensive_benchmark.py
│   └── generate_figures.py
├── data/                   # Sample datasets
├── tests/                  # Unit tests
├── docs/                   # Documentation & paper
└── README.md               # This file
```

### Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

```bash
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
```

Then open: `http://localhost:3000`

**Features:**
- 📤 Upload CSV files (up to 50MB)
- 📊 Real-time compression with Sempress
- ⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
- 💾 Download .smp and reconstructed CSV
- 📈 Quality analysis with per-column metrics

**Deployment Options:**
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See [docs/deployment_guide.md](docs/deployment_guide.md) for details

---

## 🧪 Running Tests

```bash
# Install test dependencies
pip install pytest

# Run tests
pytest tests/

# With coverage
pytest --cov=sempress tests/
```

### Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

```bash
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
```

Then open: `http://localhost:3000`

**Features:**
- 📤 Upload CSV files (up to 50MB)
- 📊 Real-time compression with Sempress
- ⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
- 💾 Download .smp and reconstructed CSV
- 📈 Quality analysis with per-column metrics

**Deployment Options:**
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See [docs/deployment_guide.md](docs/deployment_guide.md) for details

---

## 📊 Reproducing Paper Results

```bash
# Generate datasets
python scripts/generate_datasets.py

# Run all benchmarks (takes ~10 minutes)
python scripts/comprehensive_benchmark.py

# Generate paper figures
python scripts/generate_figures.py

# Results saved to logs/ and docs/assets/
```

### Option 3: Interactive Web Platform

Run the full web platform locally with file upload, baseline comparisons, and downloads:

```bash
# Clone and start
git clone https://github.com/jalyper/sempress.git
cd sempress
./scripts/start_dev_server.sh
```

Then open: `http://localhost:3000`

**Features:**
- 📤 Upload CSV files (up to 50MB)
- 📊 Real-time compression with Sempress
- ⚖️ Compare against GZIP, BZ2, LZMA, ZSTD
- 💾 Download .smp and reconstructed CSV
- 📈 Quality analysis with per-column metrics

**Deployment Options:**
- Local (development)
- Railway/Render (production)
- Docker/Docker Compose
- See [docs/deployment_guide.md](docs/deployment_guide.md) for details

---

## 📈 Performance Benchmarks

**Encode time (100K rows):**
- Telemetry: 5.83s
- ML Features: 11.20s
- Financial: 9.08s

**Decode time (100K rows):**
- Telemetry: 0.28s (1.47× gzip+parse)
- ML Features: 0.55s (1.28× gzip+parse)
- Financial: 0.28s (1.17× gzip+parse)

**Memory usage:**
- Peak during encode: 2-3× original file size
- Peak during decode: 1.5-2× original file size

---

## 🐛 Known Issues

- **In-memory processing**: Files must fit in RAM (working on streaming)
- **Fixed k per column**: No adaptive sizing yet
- **CSV-only**: Parquet/Arrow support coming soon

---

## 📜 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

---

## 🌟 Star History

If you find Sempress useful, please star the repository! ⭐

---

## 📞 Contact

- **Website:** [https://sempress.net](https://sempress.net)
- **Paper:** [https://sempress.net/paper.pdf](https://sempress.net/paper.pdf)
- **Issues:** [GitHub Issues](https://github.com/jalyper/sempress/issues)
- **Email:** research@sempress.net

---

## 🙏 Acknowledgments

Independent research (no external funding).

Built with: Python, pandas, numpy, scikit-learn, msgpack, zstandard

---

**Made with ❤️ for the data compression community**
