Metadata-Version: 2.4
Name: pandas_diff
Version: 2.0.0
Summary: Generate event logs of changes between two pandas DataFrames.
Project-URL: Homepage, https://github.com/jaimevalero/pandas_diff
Project-URL: Repository, https://github.com/jaimevalero/pandas_diff
Project-URL: Issues, https://github.com/jaimevalero/pandas_diff/issues
Author-email: Jaime Valero <jaimevalero78@gmail.com>
License-Expression: MIT
License-File: AUTHORS.rst
License-File: LICENSE
Keywords: changelog,dataframe,diff,event-log,pandas
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Requires-Dist: click>=7.0
Requires-Dist: pandas>=1.5
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Provides-Extra: parquet
Requires-Dist: pyarrow>=10.0; extra == 'parquet'
Description-Content-Type: text/markdown

# pandas_diff

[![CI](https://github.com/jaimevalero/pandas_diff/actions/workflows/python-test.yml/badge.svg)](https://github.com/jaimevalero/pandas_diff/actions/workflows/python-test.yml)
[![PyPI](https://img.shields.io/pypi/v/pandas_diff)](https://pypi.org/project/pandas_diff/)
[![Python](https://img.shields.io/pypi/pyversions/pandas_diff)](https://pypi.org/project/pandas_diff/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Generate event logs of row-level changes between two pandas DataFrames.

Not a statistical comparison tool — pandas_diff tells you **what changed**: which rows were created, deleted, or modified, and exactly which fields changed.

## Installation

```bash
pip install pandas_diff

# With Parquet support
pip install pandas_diff[parquet]
```

## Quick start

```python
import pandas as pd
from pandas_diff import get_diffs

before = pd.DataFrame([
    {"hero": "hulk", "power": "strength"},
    {"hero": "black_widow", "power": "spy"},
    {"hero": "thor", "hammers": 0},
])
after = pd.DataFrame([
    {"hero": "hulk", "power": "smart"},
    {"hero": "captain marvel", "power": "strength"},
    {"hero": "thor", "hammers": 2},
])

df = get_diffs(before, after, keys="hero")
```

| operation | object_keys | object_values  | attribute_changed | old_value | new_value |
|-----------|-------------|----------------|-------------------|-----------|-----------|
| create    | [hero]      | captain marvel |                   |           |           |
| delete    | [hero]      | black_widow    |                   |           |           |
| modify    | [hero]      | hulk           | power             | strength  | smart     |
| modify    | [hero]      | thor           | hammers           | 0         | 2         |

## CLI

```bash
pandas_diff before.csv after.csv --keys id
pandas_diff old.parquet new.parquet --keys name,date --format json
pandas_diff a.csv b.csv --keys id --ignore updated_at -o diff.csv
```

Supported file formats: CSV, JSON (flat records), Parquet.

## Use cases

- **Batch to event-driven migration** — Detect changes between pipeline runs and stream them to Kafka.
- **Audit event logs** — Track how resources change over time.
- **Data conciliation** — Compare a CMDB against the real state of infrastructure.
- **Environment sync** — Propagate changes between production and disaster recovery.

## API

```python
get_diffs(
    before: pd.DataFrame,      # Previous state
    after: pd.DataFrame,        # Current state
    keys: list[str] | str,      # Column(s) identifying each row
    ignore_columns: list[str],  # Columns to skip (optional)
) -> pd.DataFrame
```

Returns a DataFrame with columns: `operation`, `object_keys`, `object_values`, `object_json`, `attribute_changed`, `old_value`, `new_value`.

## License

MIT
