Metadata-Version: 2.4
Name: fastrecon
Version: 0.1.0
Summary: High-performance reconciliation engine for SQL tables, queries, CSV, and Parquet using DuckDB, Polars, and Arrow.
Project-URL: Homepage, https://github.com/yourname/fastrecon
Project-URL: Issues, https://github.com/yourname/fastrecon/issues
Author: fastrecon contributors
License: MIT
License-File: LICENSE
Keywords: arrow,data-quality,diff,duckdb,etl,polars,reconciliation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.9
Requires-Dist: duckdb>=1.0.0
Requires-Dist: orjson>=3.0.0
Requires-Dist: polars>=1.0.0
Requires-Dist: pyarrow>=16.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: tenacity>=8.0.0
Provides-Extra: mssql
Requires-Dist: pyodbc>=5.0; extra == 'mssql'
Provides-Extra: mysql
Requires-Dist: pymysql>=1.1; extra == 'mysql'
Provides-Extra: postgres
Requires-Dist: psycopg[binary]>=3.1; extra == 'postgres'
Provides-Extra: test
Requires-Dist: pytest>=8.0.0; extra == 'test'
Description-Content-Type: text/markdown

# fastrecon

**A focused, high-performance reconciliation engine** for comparing SQL tables, SQL queries, CSV files, and Parquet files at scale. Built on **DuckDB**, **Polars**, and **Apache Arrow**.

> fastrecon is not a pandas replacement. It is a *reconciliation engine* — built specifically for proving that two datasets are (or aren't) the same.

## Why fastrecon

Most data teams hand-roll reconciliation with pandas, ad-hoc SQL, or shell scripts. None scale. fastrecon gives you one consistent API across every common combination:

| Left           | Right         |
| -------------- | ------------- |
| SQL table      | SQL table     |
| SQL table      | SQL query     |
| SQL query      | SQL query     |
| SQL table/query| CSV / Parquet |
| CSV / Parquet  | CSV / Parquet |

Everything is normalized into a single internal **relation** (a DuckDB view), then compared with pushdown-friendly SQL — no whole-dataset materialization in Python.

## Install

```bash
pip install fastrecon                 # core
pip install "fastrecon[postgres]"     # + psycopg
pip install "fastrecon[mysql]"        # + pymysql
```

Requires Python 3.9+.

## Quick start

```python
from fastrecon import compare, SqlTable, ParquetFile

result = compare(
    left=SqlTable(conn="postgresql://user:pw@host/db", table="public.orders"),
    right=ParquetFile(path="orders.parquet"),
    keys=["order_id"],
    compare_mode="keyed",
    exclude_columns=["load_ts"],
    tolerances={"amount": 0.01},
)

print(result.summary())
print(result.to_json(indent=True))
```

Sample output:

```
status               : MISMATCH
compare_mode         : keyed
row_count_left       : 1,000,001
row_count_right      : 1,000,000
schema_match         : True
data_match           : False
missing_in_left      : 0
missing_in_right     : 1
changed_rows         : 4
duplicate_keys_left  : 0
duplicate_keys_right : 0
elapsed_sec          : 1.842
engine               : duckdb+polars
```

## Compare modes

| Mode       | What it does                                                    |
| ---------- | --------------------------------------------------------------- |
| `schema`   | Column names, types, missing/extra columns                      |
| `rowcount` | Schema + row counts on both sides                               |
| `keyed`    | Schema + counts + key-based diff (missing / changed / dup keys) |
| `profile`  | Schema + counts + per-column null/distinct/min/max              |

## Configuration & normalization

Reconciliation is mostly about handling the messy reality of "the same" data:

```python
from fastrecon import ReconConfig, compare

cfg = ReconConfig(
    trim_strings=True,
    case_sensitive=False,
    null_equals_empty=True,
    decimal_scale=2,
    timestamp_tz="UTC",
    column_mapping={"orderId": "order_id"},   # left -> right rename
    exclude_columns=["load_ts", "etl_batch"],
    tolerances={"amount": 0.01, "tax": 0.01},
    sample_limit=200,
)

result = compare(left, right, keys=["order_id"], config=cfg)
```

## Result object

`compare()` returns a `ReconResult` with:

- `status` — `MATCH` / `MISMATCH` / `ERROR`
- `row_count_left`, `row_count_right`
- `schema_match`, `data_match`, `schema_diff`
- `missing_in_left`, `missing_in_right`, `changed_rows`
- `duplicate_keys_left`, `duplicate_keys_right`
- `sample_mismatches` — sample rows for each mismatch class
- `column_stats` — populated in `profile` mode
- `execution_metrics` — `elapsed_sec`, `engine`

Use `result.summary()` for a printable report or `result.to_json()` / `result.to_dict()` to ship it to a logger, dashboard, or CI gate.

## Sources

```python
from fastrecon import SqlTable, SqlQuery, CsvFile, ParquetFile

SqlTable(conn="postgresql://...", table="schema.orders")
SqlQuery(conn="postgresql://...", query="SELECT * FROM orders WHERE dt >= '2026-01-01'")
CsvFile("/path/to/orders.csv", options={"delim": ","})
ParquetFile("/path/to/orders.parquet")        # also supports DuckDB globs: 'data/*.parquet'
```

## Architecture

```
fastrecon/
├── api.py                  # public compare()
├── config.py               # ReconConfig
├── sources/                # SqlTable / SqlQuery / CsvFile / ParquetFile
├── engines/                # DuckDB execution engine
├── compare/                # schema / rowcount / keyed / profile
├── output/                 # ReconResult (summary, to_dict, to_json)
└── utils/                  # normalization, logging
```

Internally:

1. Each source is registered into an in-memory DuckDB connection as a view (zero-copy from Arrow when possible).
2. Schema is read with `DESCRIBE`.
3. Row counts, anti-joins, and inner joins run in DuckDB — no full Python materialization.
4. Mismatch samples are pulled lazily, capped by `sample_limit`.

## Roadmap

- ✅ MVP: package, sources, schema/rowcount/keyed/profile compare, JSON result, tests
- ⏳ Partition-wise compare (date / id / hash buckets)
- ⏳ HTML and JSON report generators
- ⏳ Rust extension (PyO3) for hashing / normalization hot path
- ⏳ Distributed mode (S3 + Spark connector)

## License

MIT
