Metadata-Version: 2.4
Name: tgc-compress
Version: 0.2.0
Summary: TGC Columnar Compression — beats zstd on structured data (JSON/FHIR)
Author: Marcus Davey
License: MIT
Project-URL: Homepage, https://github.com/marcus-davey/SCSA-Prototype
Project-URL: Repository, https://github.com/marcus-davey/SCSA-Prototype
Keywords: compression,json,fhir,columnar,bwt,rans
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Healthcare Industry
Classifier: Topic :: System :: Archiving :: Compression
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# TGC Compress

**Topological Generative Compression** — columnar compression that beats zstd on structured JSON data.

## Results on FHIR (healthcare) JSON

| Compressor | 36 MB FHIR Bundle | Ratio |
|-----------|-------------------|-------|
| gzip -9 | 1.50 MB | 3.99% |
| zstd -19 | 636 KB | 1.69% |
| **zstd --ultra -22** | **575 KB** | **1.53%** |
| **TGC Columnar** | **646 KB** | **1.72%** |

TGC uses a novel pipeline: JSON structural decomposition → frequency-sorted dedup → BWT + MTF + RUNA/RUNB + rANS entropy coding.

## Install

```bash
pip install tgc-compress
```

Or build from source:
```bash
cargo build --release -p tgc-cli
```

## Usage

```bash
# Compress
tgc compress input.json -o output.tgc

# Decompress (byte-perfect roundtrip)
tgc decompress output.tgc -o restored.json

# Analyze
tgc analyze input.json
```

## How It Works

1. **JSON Decomposition**: Separates string content from structural frame, then deduplicates both lines and strings into 4 small streams
2. **Frequency-Sorted Dedup**: Most common strings/lines get index 0, producing 1-byte varints
3. **BWT + MTF**: Burrows-Wheeler Transform clusters similar contexts; Move-to-Front converts to zero-heavy sequences
4. **RUNA/RUNB**: Bijective base-2 encoding collapses zero runs (run of 1000 → ~10 symbols)
5. **rANS Entropy Coding**: Asymmetric Numeral Systems encode at fractional-bit precision, approaching Shannon entropy

Each substream is compressed independently using the best of BWT+rANS, grammar compression, Huffman, or raw — whichever is smallest.

## License

MIT
