Metadata-Version: 2.4
Name: xlcompress
Version: 0.3.2
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Summary: Compress Excel spreadsheets for LLM context windows. Rust-powered, 80-95% token reduction.
Keywords: excel,llm,compression,spreadsheet,tokens
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

<p align="center">
  <h1 align="center">XLcompress</h1>
  <p align="center">
    <em>Compress Excel spreadsheets for LLM context windows. Written in Rust.</em>
  </p>
</p>

<p align="center">
  <a href="https://pypi.org/project/xlcompress/"><img src="https://img.shields.io/pypi/v/xlcompress?color=%2334D058&label=pypi" alt="PyPI version"></a>
  <a href="https://pypi.org/project/xlcompress/"><img src="https://img.shields.io/pypi/pyversions/xlcompress" alt="Python versions"></a>
  <a href="https://github.com/JustinStrik/xlcompress/actions"><img src="https://img.shields.io/github/actions/workflow/status/JustinStrik/xlcompress/release.yml?label=CI" alt="CI status"></a>
  <a href="https://github.com/JustinStrik/xlcompress/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-blue" alt="License"></a>
</p>

<p align="center">
  <a href="https://pypi.org/project/xlcompress/">PyPI</a> &middot;
  <a href="https://arxiv.org/abs/2407.09025">Paper</a> &middot;
  <a href="https://github.com/JustinStrik/xlcompress">Source</a>
</p>

---

Excel files are one of the most common data formats, but feeding them to LLMs wastes tokens on empty cells, repetitive formatting, and verbose address-value pairs. **xlcompress** applies the [SpreadsheetLLM](https://arxiv.org/abs/2407.09025) compression pipeline to reduce spreadsheet token usage by **80-95%** while preserving the structural information LLMs need.

### Compression results

| Sheet type | Original tokens | Compressed tokens | Reduction |
|---|--:|--:|--:|
| Financial model (200 rows) | 5,240 | 799 | **84.7%** |
| Lookup table (100 rows) | 1,922 | 186 | **90.3%** |
| Sparse sheet (few values) | 106 | 106 | 0% (auto-fallback) |

> Sparse sheets with scattered values automatically fall back to raw encoding when compression would increase size.

## Install

```
pip install xlcompress
```

Pre-built wheels for **Linux** (x86_64, aarch64), **macOS** (x86_64, ARM), and **Windows** (x86_64). No Python dependencies. No Rust toolchain required.

## Quickstart

```python
import xlcompress

# One-liner: compress and get a prompt-ready string
prompt = xlcompress.compress_to_string("financials.xlsx")

# Per-sheet results with token counts
results = xlcompress.compress("financials.xlsx")
for sheet in results:
    print(f"{sheet.name}: {sheet.original_tokens} -> {sheet.compressed_tokens} tokens")
```

## Usage

### Excel files

```python
import xlcompress

# All sheets
results = xlcompress.compress("report.xlsx")

# Specific sheets
results = xlcompress.compress("data.xlsb", sheets=["Q1", "Q2"])

# From bytes (S3, HTTP responses, file uploads)
results = xlcompress.compress(file_bytes, format="xlsx")

# List sheets without compressing
names = xlcompress.list_sheets("workbook.xlsx")
```

### CSV

```python
# From a CSV file
result = xlcompress.compress_csv("data.csv")

# From a CSV string
csv_text = "Name,Age,City\nAlice,30,NYC\nBob,25,LA"
result = xlcompress.compress_csv(csv_text)

# Custom delimiter
result = xlcompress.compress_csv("data.tsv", delimiter="\t")
```

### Text and grids

```python
# Tab-delimited text (e.g. pasted from a spreadsheet)
pasted = "Region\tQ1\tQ2\nNorth\t1000\t2000\nSouth\t800\t1200"
result = xlcompress.compress_text(pasted)

# Raw 2D grid
grid = [["Name", "Score"], ["Alice", "95"], ["Bob", "87"]]
result = xlcompress.compress_grid(grid)
```

### Output modes

```python
# Aggregated (default) — best compression, type-labeled ranges
xlcompress.compress("file.xlsx", mode="aggregated")
# -> "(IntNum|A1:B10),(DateData|C1:C5),..."

# Vanilla — preserves all cell values, no compression
xlcompress.compress("file.xlsx", mode="vanilla")
# -> "|A1, Revenue|\n|B1, 1000|\n..."

# Inverted index — value-to-address mapping
xlcompress.compress("file.xlsx", mode="inverted")
# -> {"Revenue": ["A1"], "1000": ["B1", "C3"], ...}
```

### Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `source` | `str`, `PathLike`, `bytes` | required | File path or raw file bytes |
| `format` | `str \| None` | `None` | `"xlsx"` or `"xlsb"`. Required for bytes. Auto-detected for paths. |
| `sheets` | `list[str] \| None` | `None` | Filter to specific sheets. `None` = all. |
| `mode` | `str` | `"aggregated"` | `"aggregated"`, `"vanilla"`, or `"inverted"` |
| `structural` | `bool` | `True` | Run table boundary detection |
| `structural_k` | `int` | `4` | Rows/cols to keep around detected boundaries |

### Return type

```python
class SheetResult:
    name: str              # Sheet name
    original: str          # Raw |Address, Value| representation
    compressed: str        # Compressed output
    original_tokens: int   # Token count before compression
    compressed_tokens: int # Token count after compression
```

## How it works

The compression pipeline runs three stages:

**1. Structural compression** — Detects table boundaries using heuristic analysis of cell patterns, then keeps only rows and columns near detected tables. Large empty regions are removed entirely.

**2. Data-format aggregation** — Groups contiguous cells of the same type (integers, dates, text, currencies, etc.) into labeled ranges. A column of 100 integers becomes `(IntNum|A1:A100)` instead of 100 separate entries.

**3. Smart output selection** — Renders the result in the chosen mode. For sparse sheets where aggregation would inflate the output, automatically falls back to vanilla encoding.

### Why Rust?

The pipeline involves BFS flood-fill over cell grids, regex-based type detection, and boundary analysis — all CPU-bound work that benefits from compiled performance. The Python interface is a thin PyO3 wrapper over the Rust implementation, with no Python dependencies.

## Pipeline

Based on the SheetCompressor pipeline from:

> Hao, Y., et al. "SpreadsheetLLM: Encoding Spreadsheets for Large Language Models." arXiv:2407.09025, 2024.

## Architecture

| Crate | Role |
|---|---|
| `xlcompress` | Python bindings (PyO3), Excel parsing (calamine), pipeline orchestration |
| `compress_aggregation` | Data-format aggregation via BFS flood-fill |
| `compress_structure` | Structural compression with boundary-based filtering |
| `heuristic_detection` | Table boundary detection (TableSense hybrid algorithm) |
| `compress_excel` | Standalone CLI for JSONL input |
| `xlsb_to_xlsx` | .xlsb to .xlsx converter |
| `wasm_api` | WebAssembly bindings for browser use |

## Supported inputs

| Input | Function | Notes |
|---|---|---|
| `.xlsx` | `compress()` | Excel 2007+ XML format |
| `.xlsb` | `compress()` | Excel binary format |
| Bytes | `compress()` | Raw file bytes with `format="xlsx"` or `format="xlsb"` |
| CSV | `compress_csv()` | File path, string, or file-like object |
| Text | `compress_text()` | Tab-delimited or custom delimiter |
| 2D grid | `compress_grid()` | `list[list[str]]` directly |

## Browser UI

A drag-and-drop browser interface is included in `docs/`. Serve it locally or deploy to GitHub Pages.

## License

MIT

