Metadata-Version: 2.4
Name: sas-parquet
Version: 0.1.0
Summary: Convert SAS/CSV files to Parquet, detect unique keys, and compare datasets — chunk-by-chunk, memory-safe.
Project-URL: Homepage, https://github.com/kulbshar/sas-parquet
Project-URL: Documentation, https://github.com/kulbshar/sas-parquet#readme
Project-URL: Repository, https://github.com/kulbshar/sas-parquet
Project-URL: Issues, https://github.com/kulbshar/sas-parquet/issues
Project-URL: Changelog, https://github.com/kulbshar/sas-parquet/blob/main/CHANGELOG.md
Author-email: Kulbhushan Sharma <sharma.kul.1982@gmail.com>
License: MIT
License-File: LICENSE
Keywords: csv,data-comparison,data-engineering,data-validation,etl,parquet,polars,pyarrow,sas,sas7bdat
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: click>=8.1.0
Requires-Dist: polars>=1.0.0
Requires-Dist: pyarrow>=15.0.0
Requires-Dist: pyreadstat>=1.2.0
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Provides-Extra: ui
Requires-Dist: streamlit>=1.30.0; extra == 'ui'
Description-Content-Type: text/markdown

# sas-parquet

[![PyPI](https://img.shields.io/pypi/v/sas-parquet.svg)](https://pypi.org/project/sas-parquet/)
[![Python](https://img.shields.io/pypi/pyversions/sas-parquet.svg)](https://pypi.org/project/sas-parquet/)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
[![Tests](https://img.shields.io/badge/tests-105%20passing-brightgreen.svg)](#development)

Convert SAS/CSV files to Parquet, detect unique keys, and compare datasets — chunk-by-chunk, memory-safe.

Built for working with large SAS datasets (`.sas7bdat`) that don't fit in memory. Streams through 100+ GB files on a 16 GB laptop without breaking a sweat.

---

## Why

If you've ever tried to:

- Convert a 50 GB `.sas7bdat` file to Parquet and watched your machine OOM
- Compare a SAS-exported Parquet against a "ground truth" Parquet and not know which column is the primary key
- Validate a SAS-to-Python pipeline migration row-by-row without loading both datasets into memory

…this tool is for you.

## What's inside

Three composable command-line tools and a Streamlit web UI:

| Command | What it does |
|---|---|
| `sas-parquet convert` | Stream a `.sas7bdat`, `.csv`, `.csv.gz`, or `.parquet` file into one or more Parquet chunks. Smart chunk sizing aims for a target MB per file. |
| `sas-parquet detect-keys` | Find the smallest combinations of columns whose values are unique. Useful when you don't know the primary key of a dataset. |
| `sas-parquet compare` | Compare two Parquet datasets chunk-by-chunk on a composite key. Reports row counts, schema differences, per-column value mismatches with sample rows. |
| `sas-parquet ui` | Launch the four-tab Streamlit UI: Convert · Detect Keys · Compare · Full Pipeline. |

## Install

```bash
pip install sas-parquet              # core CLIs
pip install "sas-parquet[ui]"        # adds the Streamlit web UI
pip install "sas-parquet[ui,dev]"    # everything (ruff, pytest, etc.)
```

Requires Python 3.10+.

## Quick start

### Convert SAS → Parquet

```bash
# SAS → chunked Parquet (~256 MB per chunk)
sas-parquet convert data/big.sas7bdat --mb-size 256

# SAS → single Parquet file
sas-parquet convert data/big.sas7bdat --single-file

# CSV → chunked Parquet
sas-parquet convert data/big.csv --mb-size 256

# Pick specific columns
sas-parquet convert data/big.sas7bdat --columns loan_id period balance
```

### Detect unique composite keys

Don't know which columns form the primary key of your dataset? Let the tool find them:

```bash
sas-parquet detect-keys ./data/loans_parquet/ --first-only
```

```
[detect] Total rows: 4,562,891
[detect] Auto-selected 12 candidate(s):
   - loan_id, period, balance, status, origination_date, ...
[key-detect] Searching up to size 4 across 12 candidate column(s)...
--- size 1: 12 combo(s)
  ['loan_id'] -> 1,140,723 dup keys
--- size 2: 66 combo(s)
  ['loan_id', 'period'] -> UNIQUE
[detect] Found 1 minimal unique key(s):
   1. ['loan_id', 'period']
```

### Compare two Parquet datasets row-by-row

```bash
sas-parquet compare \
    --source-dir ./data/sas_chunks/ \
    --target-path ./data/prod.parquet \
    --keys loan_id period \
    --output-dir ./reports/
```

A report file is written with:

- Source and target row counts (from Parquet metadata — milliseconds even on TB-scale files)
- Columns present only on one side
- Data type differences
- Per-column value mismatch counts, sorted by severity
- Sample mismatched rows with key + source value + target value
- Per-chunk timing

### Launch the web UI

```bash
sas-parquet ui
# Opens http://localhost:8501
```

Four tabs let you run any single step or chain them end-to-end:

- **Convert** — point at a `.sas7bdat` / `.csv` / `.csv.gz` / `.parquet` file, set a target MB per chunk, click run.
- **Detect Keys** — point at a Parquet file or directory, see the minimal unique composite key(s).
- **Compare** — pick source and target Parquet, multiselect the composite key, get a per-column mismatch report with downloadable text output.
- **Full Pipeline** — chain all three: SAS → Parquet → key detection → comparison, with an optional key override if you'd rather skip auto-detection.

## Runnable examples

The three examples below are self-contained: each one generates its own data, runs the relevant command(s), and shows the expected output. All you need is `pip install sas-parquet`.

### Example 1 — Convert CSV → Parquet

```bash
mkdir -p /tmp/sp_demo && cd /tmp/sp_demo

# Generate a 1,000-row CSV
python -c "
import polars as pl
pl.DataFrame({
    'loan_id': list(range(1000)),
    'period':  ['Q1' if i % 2 == 0 else 'Q2' for i in range(1000)],
    'balance': [round(i * 1.5, 2) for i in range(1000)],
    'status':  ['ACTIVE' if i % 3 else 'CLOSED' for i in range(1000)],
}).write_csv('loans.csv')
"

sas-parquet convert loans.csv --mb-size 5
```

Expected output:

```
[convert] Input  : /tmp/sp_demo/loans.csv  (csv)
[convert] Output : /tmp/sp_demo/loans  (chunked)
[convert] Sampling for chunk-size estimate...
[convert] Chunk size: ~500,000 rows per chunk
  chunk 000 -> loans_000.parquet (1,000 rows, 0.0 MB)
[convert] Done. 1,000 rows in 1 chunk(s) -> 0.0 MB in 0.2s
```

### Example 2 — Detect a composite key

Generate a dataset where neither `loan_id` nor `period` alone is unique, but the pair `(loan_id, period)` is. The tool should figure that out without hints.

```bash
mkdir -p /tmp/sp_demo2 && cd /tmp/sp_demo2

python -c "
import polars as pl
pl.DataFrame({
    'loan_id': [i // 2 for i in range(1000)],                                  # each loan appears in 2 periods
    'period':  ['Q1' if i % 2 == 0 else 'Q2' for i in range(1000)],
    'balance': [round(((i // 2) % 100 + 1) * 100.0, 2) for i in range(1000)],  # cycles every 100 loans
    'status':  ['ACTIVE' if i % 3 else 'CLOSED' for i in range(1000)],
}).write_csv('loans.csv')
"

sas-parquet convert loans.csv --mb-size 5
sas-parquet detect-keys loans/
```

Expected `detect-keys` output (abridged):

```
[detect] Auto-selected 4 candidate(s):
   - loan_id, balance, period, status

[key-detect] Searching up to size 4 across 4 candidate column(s)...
--- size 1: 4 combo(s)
  ['loan_id'] -> 500 dup keys
  ['balance'] -> 100 dup keys
  ['period']  -> 2 dup keys
  ['status']  -> 2 dup keys
--- size 2: 6 combo(s)
  ['loan_id', 'period'] -> UNIQUE
  ...

[detect] Found 1 minimal unique key(s):
   1. ['loan_id', 'period']
```

### Example 3 — Compare two datasets with known mismatches

Mutate three rows in the target and verify the tool catches exactly those mismatches.

```bash
cd /tmp/sp_demo2   # reusing the dataset from example 2

# Build a target with 6 deliberate mismatches (3 loans × 2 periods)
python -c "
import polars as pl
df = pl.read_parquet('loans/loans_000.parquet')
df = df.with_columns(
    pl.when(pl.col('loan_id').is_in([10, 20, 30]))
      .then(pl.col('balance') + 0.99)
      .otherwise(pl.col('balance'))
      .alias('balance')
)
df.write_parquet('target.parquet')
"

sas-parquet compare \
    --source-dir loans/ \
    --target-path target.parquet \
    --keys loan_id --keys period \
    --output-dir reports/
```

Expected output:

```
[compare] Source chunks : 1 (1,000 rows total)
[compare] Target rows   : 1,000
[compare] Columns to compare : 2
[compare] Chunk 1/1: loans_000.parquet
   rows=1,000  matched=1,000  source_only=0  (0.2s)
[compare] Matched: 1,000 / 1,000
[compare] 1 column(s) have value mismatches.
[compare] Report: /tmp/sp_demo2/reports/compare_loans_<timestamp>.txt
```

The report file shows the 6 mismatched rows with their key, source value, and target value:

```
[balance] — first 6 mismatch(es):
   {'loan_id': 10, 'period': 'Q1', 'balance__source': 1100.0, 'balance__target': 1100.99}
   {'loan_id': 10, 'period': 'Q2', 'balance__source': 1100.0, 'balance__target': 1100.99}
   {'loan_id': 20, 'period': 'Q1', 'balance__source': 2100.0, 'balance__target': 2100.99}
   {'loan_id': 20, 'period': 'Q2', 'balance__source': 2100.0, 'balance__target': 2100.99}
   {'loan_id': 30, 'period': 'Q1', 'balance__source': 3100.0, 'balance__target': 3100.99}
   {'loan_id': 30, 'period': 'Q2', 'balance__source': 3100.0, 'balance__target': 3100.99}
```

## Design notes

- **Smart chunk sizing** — samples 10K rows, writes a probe Parquet, measures compressed bytes/row, then derives a chunk size targeting your `--mb-size` (default 50 MB).
- **Streaming everywhere** — uses `pyreadstat.read_file_in_chunks`, `polars.scan_*` lazy frames, and `pyarrow.ParquetWriter` for incremental writes. No data ever fully materializes in memory.
- **Chunk-by-chunk comparison** — loads one source chunk, semi-joins target down to matching keys (streaming), compares with null-safe equality, frees memory, repeats. Works on multi-TB datasets.
- **SAS empty-string handling** — SAS exports missing character values as `""`. The compare tool normalizes these to `null` before comparing strings, so they don't show up as false mismatches against true `null` in Parquet.
- **Anti-join source-only detection** — keys present in source but missing from target are counted and surfaced separately.
- **Metadata-only row counts** — `pyarrow.parquet.read_metadata` gives exact row counts in milliseconds, even on TB-scale datasets, without scanning data.
- **Schema-drift-aware reading** — chunked datasets with column dtype drift (one chunk has a column as `Null`, another as `Float64`) are unified at read time.

## What this is not

- **A full SAS replacement.** It reads `.sas7bdat`. It does not run PROC SQL, macros, or formats. Use [SAS Viya](https://www.sas.com/) or [pysas](https://pypi.org/project/sas7bdat/) if you need those.
- **Spark or Dask.** Single-machine tool optimized for "fits on a beefy laptop or VM" workloads. If you have a cluster and 10+ TB of data, use Spark.
- **A diff tool for arbitrary files.** Source and target must share a composite key. Schema-drift handling exists but is intentionally minimal.

## Limitations and known issues

- **SAS numerics are Float64.** SAS stores all numbers as 8-byte floats. After conversion, columns that are conceptually integers (IDs, counts) arrive as `f64`. A future release will auto-downcast int-like floats.
- **Float equality is exact.** Comparisons use null-safe `==`. No fuzzy tolerance for floats yet — `1.0` vs `1.0000001` is a mismatch. `--rel-tol` is planned for 0.2.0.
- **`pyreadstat` quirks on Windows.** Some `.sas7bdat` files have out-of-range file timestamps that crash `pyreadstat` 1.3.4's internal `datetime.fromtimestamp` call. We patch around this transparently in the SAS reader.
- **Conversion is single-threaded.** Polars parallelizes the column-level work, but the chunk loop itself is serial. For 100+ GB files this is usually I/O bound anyway.

## Development

```bash
git clone https://github.com/kulbshar/sas-parquet.git
cd sas-parquet
python -m venv .venv
source .venv/bin/activate           # or  .venv\Scripts\activate  on Windows
pip install -e ".[ui,dev]"

pytest                               # 105 tests
ruff check src tests
```

The test suite covers every module end-to-end with synthetic Parquet fixtures, including the Streamlit UI (driven via `streamlit.testing.v1.AppTest`).

## Changelog

See [CHANGELOG.md](CHANGELOG.md).

## License

MIT — see [LICENSE](LICENSE).
