Metadata-Version: 2.4
Name: rowflow
Version: 0.1.0
Summary: Catch silent row-count corruption in pandas pipelines at runtime, and see it as a flow diagram.
Project-URL: Homepage, https://github.com/Tommasoaiello13/rowflow
Project-URL: Issues, https://github.com/Tommasoaiello13/rowflow/issues
Project-URL: Changelog, https://github.com/Tommasoaiello13/rowflow/blob/main/CHANGELOG.md
Author: Tommaso Aiello
License: MIT License
        
        Copyright (c) 2026 Tommaso Aiello
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: data-engineering,data-lineage,data-quality,join,merge,pandas,row-explosion,visualization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Visualization
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.10
Requires-Dist: pandas>=1.5
Provides-Extra: dev
Requires-Dist: matplotlib>=3.7; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: numpy; extra == 'dev'
Requires-Dist: plotly>=5; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: rich>=13; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: rich
Requires-Dist: rich>=13; extra == 'rich'
Provides-Extra: static
Requires-Dist: matplotlib>=3.7; extra == 'static'
Provides-Extra: viz
Requires-Dist: plotly>=5; extra == 'viz'
Description-Content-Type: text/markdown

<h1 align="center">rowflow</h1>

<p align="center">
  <img src="https://raw.githubusercontent.com/Tommasoaiello13/rowflow/main/assets/logo.png" alt="rowflow" width="92">
</p>

<p align="center">
  <a href="LICENSE"><img alt="License: MIT" src="https://img.shields.io/badge/license-MIT-blue.svg"></a>
  <img alt="Python 3.10+" src="https://img.shields.io/badge/python-3.10%2B-blue.svg">
  <img alt="Built with pandas" src="https://img.shields.io/badge/built%20with-pandas-blue.svg">
  <a href="https://github.com/astral-sh/ruff"><img alt="Code style: Ruff" src="https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json"></a>
  <a href="https://github.com/Tommasoaiello13/rowflow/actions/workflows/ci.yml"><img alt="CI" src="https://github.com/Tommasoaiello13/rowflow/actions/workflows/ci.yml/badge.svg"></a>
</p>

<p align="center">
  <strong>Catch silent row-count corruption in pandas pipelines at runtime — in one import.</strong><br>
  When a join on a "unique" key that isn't silently multiplies your rows, rowflow flags it at the
  exact line and draws the flow.
</p>

<p align="center">
  <a href="#install">Install</a> ·
  <a href="#quickstart">Quickstart</a> ·
  <a href="#when-should-you-use-it">When to use it</a> ·
  <a href="#how-it-compares">Comparison</a> ·
  <a href="#benchmarks--analysis">Benchmarks</a> ·
  <a href="#what-it-does-not-detect-and-why">Limitations</a>
</p>

---

## Have you ever…

- **joined a dimension table you assumed was unique** — and quietly turned 1,000 rows into 4,700,
  so every total downstream was wrong?
- **watched a revenue number, a row count, or a mean come out too high** and only later traced it to
  a `merge` that fanned out on a duplicated key?
- **reached for pandas' `validate=`** — and realised you'd have to remember to add it, with the right
  cardinality, to every single join?

That's a **many-to-many join explosion**: a key duplicated on *both* sides multiplies rows. It's
quiet — the code runs fine, the DataFrame looks plausible — and you usually find out after a wrong
report has already gone out. rowflow catches it while your code is still running, and shows you the
flow.

> **A bit of context.** This is one of a handful of small tools I'm putting out — each one a problem
> I ran into on my own data work, and the fix I wish I'd had on hand. I wrote rowflow in an evening;
> it isn't trying to be everything. But a silent join explosion has cost me real hours of *"why is
> this total wrong?"* more than once, so here it is in case it saves you some. It's narrow on
> purpose, and honest about where it stops
> ([the limits are spelled out below](#what-it-does-not-detect-and-why)).

<p align="center">
  <img src="https://raw.githubusercontent.com/Tommasoaiello13/rowflow/main/assets/hero_revenue.png" alt="A duplicated key inflated a revenue total from 140 to 240" width="720">
</p>

In [`examples/one_import.py`](examples/one_import.py), one duplicated key in a region table inflates
a revenue total from an honest **140** to a wrong **240**. rowflow flags the offending `merge` and
the exact line; pandas raises nothing.

## Install

```bash
pip install rowflow              # core (pandas only)
pip install "rowflow[viz]"       # interactive Sankey diagram (plotly)
pip install "rowflow[rich]"      # prettier terminal reports
```

Requires Python ≥ 3.10. To try it straight from a clone (no install needed), run from the source tree:

```bash
git clone https://github.com/Tommasoaiello13/rowflow && cd rowflow
PYTHONPATH=src python examples/one_import.py
```

## Quickstart

One line at the top of any script or notebook watches the whole run and writes `rowflow.html`
(the Sankey) when it ends:

```python
import pandas as pd
import rowflow.auto

orders   = orders.merge(customers, on="customer_id")   # clean one-to-many — fine
enriched = orders.merge(regions,   on="region_id")     # region_id duplicated -> EXPLOSION, flagged
```

> The Sankey lands in `rowflow.html` in the working directory — set `ROWFLOW_HTML_PATH` to move it,
> or `ROWFLOW_DISABLE=1` to switch auto mode off. It's a generated file, so add it to your
> `.gitignore`.

Or scope it to a block and get the findings back:

```python
import rowflow

with rowflow.guard() as run:
    out = customers.merge(orders, on="customer_id")
run.render("flow.html")          # the Sankey of what happened
```

**Gate your CI.** In tests, the bundled fixture fails the build on any explosion:

```python
def test_pipeline(no_row_explosion):   # provided fixture
    build_report()
```

Or, in a plain pipeline script (no pytest), make an explosion a hard error:

```python
import rowflow
rowflow.install()
rowflow.configure(policy="raise")      # raises RowExplosionError on the first explosion
run_my_pipeline()
```

If a many-to-many join is **intentional**, declare it the idiomatic pandas way and rowflow stays
silent:

```python
left.merge(right, on="k", validate="many_to_many")   # intent declared -> no warning
```

## When should you use it?

| Use it for… | Why |
|---|---|
| **ETL / reporting pipelines** with several joins | the silent fan-out that corrupts totals is exactly what it catches, at the line |
| **Notebooks** doing ad-hoc joins | one import, a flow diagram at the end, no scaffolding |
| **A CI gate** on a data pipeline | the `no_row_explosion` fixture fails the build if a join explodes |
| **Onboarding / teaching** | shows *where* rows multiplied, with a one-line fix |

And when **not** to bother: if every join in your codebase already passes an explicit `validate=`,
rowflow has nothing to add — it just stays quiet (zero false positives). What it adds is catching the
joins where someone *forgot* to, with no per-call ceremony and a picture of the run.

## How it works

rowflow wraps pandas' `merge` / `DataFrame.merge` / `DataFrame.join` at runtime and, for each call,
records how many rows flowed in and out, plus the **exact call site**. A real explosion is a
*many-to-many* fan-out — a key value duplicated on **both** sides — and it is confirmed in two cheap,
sound stages:

1. an **O(1) gate** — only a join whose output exceeds its larger input is a candidate, so ordinary
   1:1 / 1:many / many:1 joins cost nothing beyond a length comparison;
2. a **key-cardinality check** confirms a true many-to-many, ruling out a legitimate one-to-many join
   (duplicates on one side only) and a disjoint-key outer union (rows grow, no shared duplicate).

It never mutates your data, never changes a return value, and never raises out of its own hooks —
instrumented code behaves identically to uninstrumented code. Non-pandas backends (Modin, cuDF,
Polars) are left untouched.

## How it compares

rowflow watches the **join itself** at runtime — the in-vs-out cardinality of the operation. Schema
validators inspect a frame **in isolation**; lineage tools track **where columns came from**; the
`validate=` argument is a per-call opt-in. None of them is a zero-config runtime guard with a flow
picture.

| | rowflow | pandas `validate=` | [pandera](https://github.com/unionai-oss/pandera) / [Great Expectations](https://github.com/great-expectations/great_expectations) | [dbt](https://github.com/dbt-labs/dbt-core) tests | [datalineagepy](https://pypi.org/project/datalineagepy/) |
|---|---|---|---|---|---|
| Axis | join cardinality (correctness) | join cardinality | frame schema/values | warehouse data tests | column provenance |
| Setup | **1 import** | a kwarg on every join | author a schema | a dbt project | wrap your frames |
| Catches silent join fan-out | ✅ | ✅ *if you remember it* | ❌ (frame looks valid) | ⚠️ post-hoc | ❌ |
| Compares rows in vs out | ✅ | n/a | ❌ | ❌ | ❌ |
| Zero-config, runtime | ✅ | ❌ (opt-in) | ❌ | ❌ | ⚠️ |
| Points at the exact line | ✅ | ✅ (raises) | n/a | ❌ | ❌ |
| Visual flow diagram | ✅ | ❌ | ❌ | ❌ | ✅ (lineage, not correctness) |

Where rowflow wins: one import, it runs live, it points at the exact line, and it draws the flow.
Where it doesn't: it isn't a schema validator (pandera/GE check column types and value ranges it has
no opinion on), and it isn't lineage/governance. Treat it as **complementary** to these tools, not a
replacement.

## Benchmarks & analysis

All figures are produced from live runs by [`tools/make_figures.py`](tools/make_figures.py) and the
KPIs by [`validation/kpi.py`](validation/kpi.py) — no hardcoded numbers
([full results](validation/RESULTS.md)).

**Accuracy.** On a randomized corpus: **100% recall** on realistic explosions, **0% false positives**
across the join shapes a naive row-count rule gets wrong (1:1, 1:many, many:1, 1:many **left** join,
disjoint **outer** union), and **100%** suppression when `validate=` declares intent.

**See the explosion.** rowflow renders the run as a Sankey (interactive HTML via `rowflow[viz]`); the
static view marks the offending step in red:

<p align="center"><img src="https://raw.githubusercontent.com/Tommasoaiello13/rowflow/main/assets/example_flow.png" alt="rowflow flow: the second merge explodes" width="680"></p>

**Cost.** The O(1) gate keeps the key check off the happy path, so overhead is the wrapper
bookkeeping — **sub-millisecond per merge**, a single-digit percentage at 100k rows and shrinking with
size:

<p align="center"><img src="https://raw.githubusercontent.com/Tommasoaiello13/rowflow/main/assets/overhead.png" alt="rowflow overhead vs input size" width="640"></p>

## What it does NOT detect, and why

rowflow flags fan-outs that **actually inflate** the result. Stated plainly:

| Not detected | Why | Use instead |
|---|---|---|
| A many-to-many **masked by row loss** (net rows don't grow) | it stays below the O(1) gate; rowflow targets fan-outs that corrupt totals | an explicit `validate=` on that join |
| Row changes outside `merge` / `join` (`concat`, `dropna`, filtering) | not wrapped yet — usually intended | an assertion on the row count |
| **Intentional** many-to-many (a deliberate cross/expand join) | flagged by default; it can't read your intent | pass `validate="many_to_many"`, or `rowflow.configure(min_fanout_ratio=…)` |
| Non-pandas backends (Modin, cuDF, Polars) | only pandas is patched (a safe no-op elsewhere) | — |

Silent inner-join row **loss** is also detectable, opt-in via `rowflow.configure(detect_loss=True)`
(off by default to keep zero false positives). rowflow is a **coverage-bounded detector of
materialised row-count corruption** — like a passing test, not a proof.

## References

- pandas #2690 — *combinatorial explosion when merging dataframes.*
  <https://github.com/pandas-dev/pandas/issues/2690>
- pandas — `merge` `validate=` parameter.
  <https://pandas.pydata.org/docs/reference/api/pandas.merge.html>
- *Merge, join, concatenate and compare* (pandas user guide).
  <https://pandas.pydata.org/docs/user_guide/merging.html>
- datalineagepy — column-level pandas lineage (a different axis: provenance, not correctness).
  <https://pypi.org/project/datalineagepy/>

## Contributing & contact

Issues and pull requests are very welcome — start with [CONTRIBUTING.md](CONTRIBUTING.md) and the
[Code of Conduct](CODE_OF_CONDUCT.md). Good places to start are observers for `concat` / `dropna` /
filtering, an opt-in strict key-scan mode (to catch the masked-fan-out boundary), or richer Sankey
rendering. And if rowflow ever misses an explosion it should have caught — or fires on a join that's
actually fine — please open an issue with a small reproducer; those are the reports I value most. You
can also reach me on **LinkedIn**.

## License

[MIT](LICENSE) © 2026 Tommaso Aiello — free to use, modify, and distribute (including commercially);
keep the copyright notice; provided "as is", without warranty.
