Metadata-Version: 2.4
Name: proccompy
Version: 0.4.2
Summary: Structured comparison reports for tabular data in Python — modeled on SAS PROC COMPARE.
Project-URL: Homepage, https://github.com/kulbshar/proccompy
Project-URL: Repository, https://github.com/kulbshar/proccompy
Project-URL: Issues, https://github.com/kulbshar/proccompy/issues
Project-URL: Changelog, https://github.com/kulbshar/proccompy/blob/main/CHANGELOG.md
Project-URL: Documentation, https://github.com/kulbshar/proccompy#readme
Author-email: Kulbhushan Sharma <sharma.kul.1982@gmail.com>
License: Apache-2.0
License-File: LICENSE
Keywords: comparison,data-engineering,dataframe,duckdb,etl,pandas,polars,proc-compare,sas,validation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: click>=8.0
Requires-Dist: duckdb>=0.10.0
Requires-Dist: polars>=1.0.0
Provides-Extra: pandas
Requires-Dist: pandas>=2.0.0; extra == 'pandas'
Provides-Extra: test
Requires-Dist: pandas>=2.0.0; extra == 'test'
Requires-Dist: pytest>=7.0; extra == 'test'
Description-Content-Type: text/markdown

# proccompy

**Structured comparison reports for tabular data in Python.**

If you've ever needed to prove that two DataFrames hold the same data — after a pipeline rewrite, between source and ETL output, across a database migration, or in a daily regression check — you've probably found that `pandas.DataFrame.compare()` shows you cell-level diffs, `assert_frame_equal` gives you pass/fail, and neither tells you the thing you actually want to know: **where do they differ, by how much, in which direction, and does it matter?**

proccompy produces a structured diagnostic report that answers those questions in the order you'd ask them. It's modeled on SAS PROC COMPARE — a 40-year-old procedure that data analysts trust because it makes "what changed?" a 30-second scan instead of an afternoon of investigation.

```python
import polars as pl
import proccompy as pc
from proccompy import T

base    = pl.read_parquet("production.parquet")
rewrite = pl.read_parquet("new_implementation.parquet")

result = pc.compare(
    base, rewrite,
    id_columns="account_id",
    tolerances={
        "dollar_amount": T.absolute(0.01),   # rounding-level diffs are OK
        "rate":          T.percent(0.001),   # but >0.1% relative is not
    },
)

result.to_lst("validation.lst")     # SAS-style listing for archival / review
result.assert_matches()              # raises AssertionError if they don't — for CI
```

## Why this exists

The Python ecosystem has many ways to *find* differences between two DataFrames. What it doesn't have is a structured report that helps you *interpret* them.

- `pandas.DataFrame.compare()` returns a multi-index DataFrame of cell-level diffs. No summary, no row-presence breakdown, no statistics.
- `pandas.testing.assert_frame_equal` raises on first difference. Tells you something is wrong, not what.
- [`datacompy`](https://github.com/capitalone/datacompy) has a report, but it's a flat text dump optimized for a single screen. No per-column difference statistics, no SAS conventions, no `.lst` output, no streaming/out-of-core support.
- Schema-validation tools (Great Expectations, Pandera) check one frame against rules. They don't compare two frames.

The result: every team building a data pipeline ends up writing their own ad-hoc comparison script. That script's report is always a little worse than PROC COMPARE's because PROC COMPARE has had four decades of refinement.

proccompy brings that report to Python natively — running on polars, pandas, or DuckDB; pluggable into pytest; producing `.lst` files SAS analysts already know how to read.

## Who is this for

**Data engineers doing pipeline validation.** You're rewriting an implementation, migrating a database, or proving that yesterday's output matches today's. You want tolerance per column (because floating-point drift isn't "wrong"), the differences summarized and broken down, and a structured object you can assert on in CI.

**ETL teams comparing output to source.** You need a recurring check that your pipeline faithfully reproduces source data, with the diagnostic depth to tell rounding drift from real corruption.

**SAS analysts who miss the PROC COMPARE listing.** You moved to Python (willingly or not) and want a comparison report your boss will accept — `.lst` format, with the same sections you've read for 20 years.

**Testing frameworks for data pipelines.** You're building (or using) dbt tests, custom assertion harnesses, regression suites. proccompy gives you a clean programmatic `CompareResult` to embed: structured fields, pytest-friendly.

## What the report tells you

A real PROC COMPARE listing isn't arbitrary — it's a diagnostic protocol. Each section answers a question, in the order you'd ask it:

1. **Dataset Summary** — are we comparing what I think we're comparing?
2. **Variables Summary** — do the schemas match?
3. **Observation Summary** — do the same keys exist on both sides?
4. **Values Comparison Summary** — how broken is this, in one glance?
5. **Variables with Unequal Values** — which columns are responsible?
6. **Statistics for Numeric Differences** — bias direction, spread, magnitude
7. **Comparison Results for Observations** — show me the actual numbers

You can scan a 50-page `.lst` in 30 seconds and decide whether to investigate or close the ticket. That's the value.

## Install

```bash
pip install proccompy
```

Requires Python 3.10+. Pulls in `duckdb` and `polars`. Add `pandas` only if you need it as input.

## Usage

### The basic comparison

```python
import polars as pl
import proccompy as pc

base    = pl.DataFrame({"id": [1, 2, 3], "x": [10.0, 20.0, 30.0]})
compare = pl.DataFrame({"id": [1, 2, 3], "x": [10.0, 20.5, 30.0]})

result = pc.compare(base, compare, id_columns="id")

result.matches             # → False
result.summary()           # → per-column polars DataFrame
print(result.report())     # → terminal-style report
```

Inputs can be polars DataFrames, pandas DataFrames, or DuckDB relations.

### Tolerances per column

```python
from proccompy import T

result = pc.compare(
    base, compare,
    id_columns="account_id",
    tolerances={
        "dollar_amount": T.absolute(0.01),   # within 1 cent → equal
        "rate":          T.percent(0.001),   # within 0.1% relative → equal
        "category":      T.exact(),          # exact match required (default)
    },
)
```

Three tolerance types: `T.exact()`, `T.absolute(value)`, `T.percent(value)`. Per-column. Defaults to exact for columns you don't specify.

### Multiple ID columns

```python
result = pc.compare(
    base, compare,
    id_columns=["sector", "naics_code", "year"],
)
```

### SAS-style .lst output

```python
result.to_lst("validation.lst")
```

Produces a paginated, fixed-width, ASCII-only `.lst` file with form-feed page breaks and SAS-style headers. Looks like real PROC COMPARE output. Open it in any text editor; attach it to a ticket; share it with anyone who used to write SAS.

### Assertion for CI

```python
def test_pipeline_rewrite_matches_production():
    base = pl.read_parquet("expected.parquet")
    actual = run_pipeline()
    pc.compare(base, actual, id_columns="id").assert_matches()
```

Raises `AssertionError` with structured per-column failure detail. Drops into any pytest suite.

### The structured result

`CompareResult` exposes everything as DataFrames you can post-process:

```python
result.summary()             # per-column summary
result.unequal_rows()        # both-side rows with at least one diff
result.base_only_rows()      # rows only in base
result.compare_only_rows()   # rows only in compare
result.diff_dataset()        # SAS OUT= equivalent: BASE / COMPARE / DIF / PERCENT
result.joined()              # full joined frame for custom analysis
```

### Command line

proccompy installs a `proccompy` console script for shell-level use — no Python required:

```bash
# Simple: compare two parquet files, report to stdout, exit 0 or 1
proccompy expected.parquet actual.parquet --id account_id

# With tolerances and multiple ID columns
proccompy old.csv new.csv --id "sector,year" \
    --tolerance "amount=abs:0.01" \
    --tolerance "rate=pct:0.001"

# Pipe everything to disk for archival / CI artifacts
proccompy a.parquet b.parquet --id id \
    --lst report.lst \
    --summary-csv summary.csv \
    --diff-parquet diffs.parquet \
    --quiet
```

Exit codes: **0** if the datasets match (within any specified tolerances), **1** if they differ in any compared value, row presence, or column overlap. Use as a CI gate:

```bash
proccompy expected.parquet actual.parquet --id id --quiet && deploy.sh
```

Supports parquet, CSV, and TSV input; format is inferred from extension or set with `--format`. Run `proccompy --help` for the full reference.

### Per-column difference statistics

For numeric columns, the summary includes diagnostic statistics computed over matched non-null pairs (SAS PROC COMPARE STATS semantics):

```python
summary = result.summary()
# Columns: column, base_dtype, compare_dtype, n_compared, n_unequal,
#          n_null_mismatch, max_abs_diff, mean_diff, std_diff, min_diff,
#          n_base_greater, n_compare_greater
```

These let you distinguish three failure modes that `max_abs_diff` alone can't:

| mean_diff | std_diff | n_base_gt | n_compare_gt | Diagnosis |
|---|---|---|---|---|
| `+2.0` | `0.0` | `0` | `195` | Constant positive offset (unit conversion?) |
| `0.0` | `5.1` | `10` | `10` | No bias but high spread (real drift) |
| `+0.1` | `0.0` | `0` | `195` | Small uniform drift, within tolerance |

## Alternatives — when to use what

| | Best for | Limitation |
|---|---|---|
| `pandas.DataFrame.compare()` | Quick cell-level diff in a notebook | No summary, no report, no tolerance |
| `pandas.testing.assert_frame_equal` | Strict pass/fail in tests | Stops at first diff, no diagnostic depth |
| [`datacompy`](https://github.com/capitalone/datacompy) | Flat text report for one-screen review | Memory-bound, global-only tolerance, no `.lst`, no per-column diff stats |
| Great Expectations, Pandera | Schema/distribution validation on one frame | Don't compare two frames |
| **proccompy** | Structured diagnostic report for pipeline validation | Out-of-core via DuckDB engine only (Spark engine not yet implemented) |

If you need a one-line cell diff, use `pandas.compare()`. If you need a structured report you can attach to a PR, a Jira ticket, or a SAS analyst's email, use proccompy.

## What's under the hood

- **DuckDB engine.** The comparison is a `FULL OUTER JOIN` keyed on `id_columns` with per-column unequal expressions. DuckDB streams this on disk, so you're not memory-bound even for hundreds of millions of rows.
- **Polars I/O.** Frames register zero-copy via Arrow; results come back as polars DataFrames.
- **Null-safe.** Uses SQL `IS NOT DISTINCT FROM` semantics. No sentinel-string hacks.
- **Strict duplicate-key detection.** If your ID columns aren't unique, you'll get a clear error with counts, not silent wrong answers.
- **Apache 2.0 licensed.**

## Roadmap

- **v0.4** — BY groups (stratified comparison summaries)
- **v0.5** — `result.to_html()` self-contained HTML report
- **v1.0** — pytest plugin, full documentation site, PyPI release

Stable: top-level `pc.compare()`, `T.exact/absolute/percent`, `CompareResult` accessor methods, `.lst` format, CLI flags and exit codes.

Subject to change: `_joined` internal structure, exact `.lst` column widths.

## Contributing

Issues and PRs welcome. Run the test suite with `pytest tests/`. Currently 92 tests; please add one for any new behavior.

## License

Apache 2.0. See `LICENSE`.

---

Built by [Kulbhushan Sharma](https://github.com/kulbshar) because the PROC COMPARE report was the SAS feature I missed most after switching to Python.
