Metadata-Version: 2.4
Name: dift-cli
Version: 0.2.1
Summary: Git diff for datasets: compare datasets and understand what changed.
Author-email: Reginald Erzoah <reginalderzoah10@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/ReginaldErzoah/Dift
Project-URL: Repository, https://github.com/ReginaldErzoah/Dift
Project-URL: Issues, https://github.com/ReginaldErzoah/Dift/issues
Keywords: data,dataset,diff,data-quality,validation,etl,mlops,analytics,data-engineering,data-diff,data-validation,dataset-comparison
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.2.0
Requires-Dist: polars>=1.0.0
Requires-Dist: pyarrow>=15.0.0
Requires-Dist: typer>=0.12.0
Requires-Dist: rich>=13.7.0
Requires-Dist: pydantic>=2.7.0
Requires-Dist: fastexcel>=0.12.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: mypy>=1.10.0; extra == "dev"
Requires-Dist: mkdocs>=1.6.0; extra == "dev"
Requires-Dist: pre-commit>=3.7.0; extra == "dev"
Requires-Dist: build>=1.2.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Dynamic: license-file

# Dift

<p align="left">
  <img src="assets/dift-logo.png" width="400" alt="Dift Logo">
</p>

Dift is an open-source CLI tool that helps data professionals compare two datasets and instantly understand:

- what changed  
- why it matters  
- whether the new data is safe to trust  

---

## What's New in v0.2.1

Dift v0.2.1 introduces a more polished CLI experience and broader file support.

### New Improvements

- Better console formatting
- Rich terminal colors
- Cleaner summary tables
- Risk level highlighting
- Percentage row change display
- Better missing file error messages
- JSON dataset support
- JSON example datasets
- Excel example datasets
- Parquet example datasets
- Improved installation instructions

---

## Why Dift?

Bad data breaks:

- dashboards  
- reports  
- ETL pipelines  
- analytics workflows  
- ML models  
- business decisions  

Dift helps teams catch risky data changes **before they cause damage**.

---

## Features (v0.2.0)

Compare two datasets in seconds.

### Supported Formats

- CSV
- Parquet
- Excel (`.xlsx`, `.xls`)
- JSON

### Detect Changes

- Schema diff
- Row count diff
- Added rows
- Removed rows
- Changed rows (with key column)
- Column type changes
- Null spikes
- Duplicate increases
- Numeric stats diff
- Categorical value changes
- Risk scoring (`low`, `medium`, `high`)

### Output

- Rich CLI report
- JSON report export

---

## Requirements

- Python 3.10+

---

## Quick Install

```bash
pip install dift-cli
````

Then run:

```bash
dift --help
```

---

## Cross Platform Setup

### Windows (Git Bash)

```bash
python -m venv .venv
source .venv/Scripts/activate
pip install dift-cli
```

### Windows (PowerShell)

```powershell
python -m venv .venv
.venv\Scripts\Activate.ps1
pip install dift-cli
```

### Mac / Linux

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install dift-cli
```

### pipx (Recommended for CLI Tools)

```bash
pipx install dift-cli
```

If `pipx` is not installed:

```bash
python -m pip install pipx
python -m pipx ensurepath
```

---

## Verify Install

```bash
dift --help
```

or

```bash
python -m dift.cli --help
```

---

## Upgrade Later

```bash
pip install --upgrade dift-cli
```

---

## If Command Not Found

Use:

```bash
python -m dift.cli --help
```

Or restart your terminal.

---

## Quick Start

### Compare CSV Files

```bash
dift examples/old.csv examples/new.csv --key customer_id
```

### Compare Parquet Files

```bash
dift examples/old.parquet examples/new.parquet --key customer_id
```

### Compare Excel Files

```bash
dift examples/old.xlsx examples/new.xlsx --key customer_id
```

### Compare JSON Files

```bash
dift examples/old.json examples/new.json --key customer_id
```

### Generate JSON Report

```bash
dift examples/old.csv examples/new.csv --key customer_id --report json --output report.json
```

---

## Example Output

```text
╭─────────────────────────╮
│ Dift Dataset Comparison │
│ Risk Level: HIGH        │
╰─────────────────────────╯

Summary
Rows old: 10
Rows new: 11
Row delta: +1
Row change %: +10.00%

Warnings:
Nulls increased in revenue by 9.09%
```

---

## Example Files Included

```text
examples/
├── old.csv
├── new.csv
├── old.parquet
├── new.parquet
├── old.xlsx
├── new.xlsx
├── old.json
└── new.json
```

Use them to test instantly.

---

## Example Use Cases

### ETL Validation

```bash
dift before.csv after.csv
```

### Daily Snapshot Checks

```bash
dift yesterday.parquet today.parquet
```

### Excel File Audits

```bash
dift old.xlsx new.xlsx --key id
```

### JSON API Export Checks

```bash
dift old.json new.json --key id
```

### Production vs Staging

```bash
dift prod.csv staging.csv --key id
```

### ML Dataset Drift Checks

```bash
dift train_v1.csv train_v2.csv
```

---

## Project Structure

```text
dift/
├── cli.py
├── core/
│   ├── comparator.py
│   ├── schema_diff.py
│   ├── row_diff.py
│   ├── quality_diff.py
│   ├── risk.py
│   └── stats_diff.py
├── io/
│   └── readers.py
├── reports/
│   ├── console_report.py
│   ├── json_report.py
│   └── models.py
└── utils/

tests/
examples/
```

---

## Run Tests

```bash
pytest
```

Lint code:

```bash
ruff check .
```

---

## Roadmap

## v0.3.0 — Report Exports

* HTML report export
* CSV summary export
* Excel report export
* Better JSON report structure
* Report templates
* `--output-dir`

## v0.4.0

* Improve null spike detection
* Improve duplicate detection

## v0.5.0

* Outlier detection
* Numeric drift thresholds
* Categorical shift warnings
* Better risk scoring

## v0.6.0

* SQL database support
* Postgres connector

## v0.7.0

* Snowflake connector
* BigQuery connector

## v0.8.0

* CI/CD fail checks
* dbt integration

## v0.9.0

* Drift alerts
* Python API
* Plugin system

## v1.0.0

* Stable CLI
* Stable Python API
* Full test coverage
* Full docs site
* Benchmarks
* Security review
* Production-ready install

---

## Contributing

Contributions are welcome.

Please read:

```text
CONTRIBUTING.md
```

Ways to help:

* Fix bugs
* Improve docs
* Add tests
* Improve performance
* Add connectors
* Improve CLI UX

---

## License

MIT License

---

## Vision

Dift aims to become the standard open-source tool for dataset comparison and trust checks.

**If Git has `git diff`, data teams should have `dift`.**
