Metadata-Version: 2.4
Name: dfdoctor
Version: 0.3.0
Summary: Audit messy DataFrames, auto-fix issues, and run five-method correlation analysis — zero dependencies beyond pandas.
Author-email: Ajay Ramineni <ajayvarmaramineni1128@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/ajayvarmaramineni/dfdoctor
Project-URL: Repository, https://github.com/ajayvarmaramineni/dfdoctor
Project-URL: Issues, https://github.com/ajayvarmaramineni/dfdoctor/issues
Project-URL: Changelog, https://github.com/ajayvarmaramineni/dfdoctor/releases
Keywords: data,pandas,dataframe,data-quality,data-cleaning,eda,audit,profiling,correlation,outlier,visualization,missing-values
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Education
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Provides-Extra: excel
Requires-Dist: openpyxl>=3.0; extra == "excel"
Dynamic: license-file

# dfdoctor 🩺

> **The data quality library that doesn't just tell you what's wrong — it tells you what to do next, fixes it for you, and shows you the full picture.**

[![PyPI version](https://img.shields.io/pypi/v/dfdoctor.svg)](https://pypi.org/project/dfdoctor/)
[![Python](https://img.shields.io/pypi/pyversions/dfdoctor.svg)](https://pypi.org/project/dfdoctor/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Tests](https://github.com/ajayvarmaramineni/dfdoctor/actions/workflows/tests.yml/badge.svg)](https://github.com/ajayvarmaramineni/dfdoctor/actions)

---

## What is dfdoctor?

`dfdoctor` is a lightweight, **zero-dependency** Python library (beyond pandas) that audits a messy DataFrame and gives you:

- **Plain-English explanations** of every issue found
- **Priority scores** so you know what to fix first
- **Automatic fixes** with a full before/after comparison
- **Five correlation methods** — Pearson, Spearman, Kendall τ, Cramér's V, and Phi-k — all from scratch, no scipy
- **ASCII charts** in the terminal and **SVG heatmaps** in the HTML report
- A **CLI tool** so you can audit any CSV without writing a line of code

Most data tools give you statistics. `dfdoctor` gives you a treatment plan.

```python
from dfdoctor import audit

report = audit(df)
report.pretty_print()
```

```
════════════════════════════════════════════════════════════════════
  dfdoctor — Dataset Audit Report
════════════════════════════════════════════════════════════════════
  Rows        : 10,000
  Columns     : 12
  Duplicates  : 215
  Memory      : 2.4 MB

  Issues Found: 8  (high: 3, medium: 4, low: 1)
════════════════════════════════════════════════════════════════════

  !!! [signup_date]  (HIGH)
      'signup_date' looks like a date column stored as text (98% parseable).
      Why it matters : Dates as strings block time-series ops and sorting.
      Recommendation : df['signup_date'] = pd.to_datetime(df['signup_date'])
      Safe auto-fix  : No   |  Score: 2.55
  ────────────────────────────────────────────────────────────────
  ...
```

---

## Installation

```bash
pip install dfdoctor
```

**Requirements:** Python 3.9+ · pandas 1.3+ · no other dependencies.

To install with development tools:

```bash
pip install "dfdoctor[dev]"
```

---

## Five-minute quick start

```python
import pandas as pd
from dfdoctor import audit, auto_fix, compare, correlate

df = pd.read_csv("your_data.csv")

# 1. Audit — find every issue
report = audit(df)
report.pretty_print()

# 2. Fix — apply safe fixes automatically
cleaned, log = auto_fix(df)
print(log)   # ["Dropped all-null column 'notes'", "Converted 'revenue' to numeric", ...]

# 3. Compare — see exactly what changed
compare(df, cleaned).pretty_print()

# 4. Correlate — full five-method correlation analysis
corr = report.correlations()
corr.pretty_print()

# 5. Visualise — ASCII charts in terminal, HTML report with SVG heatmaps
report.plot()
report.to_html("audit_report.html")
```

---

## Feature walkthrough

### Audit

`audit(df)` runs every detector and returns an `AuditReport` with all issues ranked by priority.

```python
from dfdoctor import audit

report = audit(df)

report.pretty_print()          # formatted terminal output
report.summary()               # dict: row/col/dup/memory counts + issue totals
report.high_priority()         # list of HIGH severity issues
report.sorted_by_priority()    # all issues, highest score first
report.by_column("revenue")    # issues for one specific column
report.to_dict()               # full JSON-serialisable output
```

---

### Auto-fix

`auto_fix()` applies safe, reversible fixes automatically. Risky fixes (like deduplication) are opt-in with `safe_only=False`.

```python
from dfdoctor import auto_fix, compare

# Safe fixes only (default)
cleaned, log = auto_fix(df)

# All fixes including risky ones
cleaned, log = auto_fix(df, safe_only=False)

# See a full before/after breakdown
compare(df, cleaned).pretty_print()
```

**Safe fixes applied automatically:**

| Issue | Fix |
|---|---|
| All-null column | Drop the column |
| Constant column | Drop the column |
| Numeric stored as string | `pd.to_numeric()` |
| Suspected identifier | Cast to string |

**Risky fixes (opt-in with `safe_only=False`):**

| Issue | Fix |
|---|---|
| Duplicate rows | `drop_duplicates()` |
| Suspicious placeholders | Replace with `pd.NA` |
| Date stored as string | `pd.to_datetime()` |

---

### Correlation analysis

`correlate(df)` computes all five correlation methods with **zero extra dependencies** — no scipy, no statsmodels, nothing beyond pandas and numpy.

```python
from dfdoctor import correlate

corr = correlate(df)          # or: report.correlations()
corr.pretty_print()
```

```
════════════════════════════════════════════════════════════════════
  dfdoctor — Correlation Report
════════════════════════════════════════════════════════════════════

  Pearson r  (numeric × numeric, linear) — 4 cols
  ─────────────────────────────────────────────
            revenue    spend    visits    age
  revenue      1.00     0.87      0.43  -0.12
  spend        0.87     1.00      0.51  -0.09
  ...

  Phi-k  (ALL columns × ALL columns, 0=none 1=perfect) — 8 cols
  ─────────────────────────────────────────────────────────────
  ...covers numeric AND categorical columns in one matrix...

  Top Correlated Pairs  (|value| ≥ 0.4):
  ────────────────────────────────────────
  revenue          × spend           [pearson ]  ████████████████████  +0.870  (strong)
  country          × region          [cramers_v]  █████████████░░░░░░░  +0.650  (strong)
```

| Method | Type | Range | What it measures |
|---|---|---|---|
| **Pearson r** | Numeric × Numeric | −1 … +1 | Linear relationship |
| **Spearman ρ** | Numeric × Numeric | −1 … +1 | Rank-order relationship |
| **Kendall τ** | Numeric × Numeric | −1 … +1 | Concordance (robust on ties) |
| **Cramér's V** | Categorical × Categorical | 0 … 1 | Association strength |
| **Phi-k** | **Any × Any** | 0 … 1 | Universal association (numeric bins → Cramér's V) |

---

### Exploratory data analysis

```python
from dfdoctor import quick_eda

insights = quick_eda(df, target="churn")

insights.pretty_print()

print(insights.high_missing_columns)   # [{"column": "notes", "missing_pct": 0.82}, ...]
print(insights.skewed_columns)         # ["revenue", "session_length"]
print(insights.strong_correlations)    # [("revenue", "spend", 0.87)]
print(insights.target_correlations)    # {"revenue": 0.61, "spend": 0.55, ...}
print(insights.top_findings)           # plain-English list of key insights
```

---

### Visualisations

Two output modes, zero extra dependencies:

#### Terminal — ASCII bar charts

```python
report.plot()
# or standalone:
from dfdoctor import plot_ascii
plot_ascii(report)
```

```
════════════════════════════════════════════════════════════════════
  dfdoctor — Visualizations
════════════════════════════════════════════════════════════════════

  Issue Severity Breakdown
  ────────────────────────────────────────────────────
  HIGH        ████████████░░░░░░░░░░░░░░░░   3.0
  MEDIUM      ████████████████████████████   4.0
  LOW         ███████░░░░░░░░░░░░░░░░░░░░░   1.0

  Top Missing-Value Columns  (% missing)
  ────────────────────────────────────────────────────
  notes             ████████████████████████████  82.0%
  phone             ████████░░░░░░░░░░░░░░░░░░░░  23.0%
```

#### HTML — self-contained report with SVG charts

```python
report.to_html("audit_report.html")   # write to file
html = report.to_html()               # get string
```

The HTML report includes a stats dashboard, colour-coded issue table, SVG bar charts, and five interactive SVG correlation heatmaps — all in a single self-contained file with no external assets.

---

### Command-line interface

No Python required. Audit any CSV, TSV, or Excel file directly from your terminal:

```bash
# Audit a file
dfdoctor audit data.csv

# Auto-fix and save
dfdoctor fix data.csv --output cleaned.csv

# Apply all fixes (including risky)
dfdoctor fix data.csv --all --output cleaned.csv

# Generate HTML report
dfdoctor html data.csv --output report.html
```

---

### Reading files

```python
from dfdoctor import read_file

df = read_file("data.csv")       # CSV
df = read_file("data.tsv")       # TSV
df = read_file("data.xlsx")      # Excel (requires openpyxl)
df = read_file("data.xls")       # Legacy Excel
```

---

### Prioritised cleaning suggestions

```python
from dfdoctor import suggest_cleaning

suggestions = suggest_cleaning(df)   # sorted by priority score, highest first

for issue in suggestions:
    print(f"[{issue.severity.upper()}]  {issue.message}")
    print(f"  → {issue.recommendation}\n")
```

---

## What dfdoctor detects

| Issue | Severity | Auto-fixable |
|---|---|---|
| All-null column | HIGH | ✅ Safe |
| High missing values (≥ 50%) | HIGH | — |
| Moderate missing values (≥ 20%) | MEDIUM | — |
| Duplicate rows | HIGH | ⚠️ Risky |
| Constant column (one unique value) | MEDIUM | ✅ Safe |
| Near-constant column (≥ 95% one value) | MEDIUM | — |
| Numeric stored as string | MEDIUM | ✅ Safe |
| Date column stored as string | HIGH | ⚠️ Risky |
| Mixed date formats in one column | HIGH | — |
| Suspected identifier column | MEDIUM | ✅ Safe |
| High-cardinality categorical | MEDIUM | — |
| Inconsistent category labels (e.g. "US" vs "U.S.") | MEDIUM | — |
| Suspicious placeholder values ("NA", "?", "unknown") | LOW | ⚠️ Risky |
| Statistical outliers (IQR method) | MEDIUM | — |

---

## Priority scoring

Every issue gets a numeric score so you know what to tackle first — no more guessing:

```
priority_score = severity_weight × confidence × impact
```

| Component | Description |
|---|---|
| `severity_weight` | HIGH = 3, MEDIUM = 2, LOW = 1 |
| `confidence` | 0.0 – 1.0 — how certain the rule is |
| `impact` | 0.0 – 1.0 — how much this affects downstream analysis |

```python
for issue in report.sorted_by_priority():
    print(f"{issue.priority_score:.2f}  [{issue.severity}]  {issue.column}  —  {issue.message}")
```

---

## The `Issue` object

Every issue returned by `audit()` or `suggest_cleaning()` is an `Issue` dataclass:

| Field | Type | Description |
|---|---|---|
| `column` | `str \| None` | Column the issue belongs to (`None` = dataset-level) |
| `issue_type` | `str` | Machine-readable key, e.g. `"date_as_string"` |
| `severity` | `str` | `"high"`, `"medium"`, or `"low"` |
| `confidence` | `float` | 0.0 – 1.0 |
| `impact` | `float` | 0.0 – 1.0 |
| `message` | `str` | Plain-English description |
| `why_it_matters` | `str` | Why this issue is a problem |
| `recommendation` | `str` | Exact code fix to apply |
| `safe_to_auto_fix` | `bool` | Whether `auto_fix()` will apply this by default |
| `priority_score` | `float` | `severity_weight × confidence × impact` |

---

## Why zero dependencies?

Every alternative library (`ydata-profiling`, `sweetviz`, `dataprep`) pulls in matplotlib, scipy, seaborn, and dozens more. `dfdoctor` requires only pandas — which you already have.

This means:
- Works in any environment: CI/CD pipelines, serverless functions, Docker containers, Jupyter, Colab, bare scripts
- Installs in seconds with no dependency conflicts
- Five correlation methods including Kendall τ and Phi-k — all implemented with pure numpy, no scipy required

---

## Project structure

```
dfdoctor/
├── src/
│   └── dfdoctor/
│       ├── __init__.py        # public API exports
│       ├── audit.py           # main audit() function
│       ├── types.py           # AuditReport, Issue dataclasses
│       ├── suggest.py         # suggest_cleaning()
│       ├── eda.py             # quick_eda(), EDAReport
│       ├── fix.py             # auto_fix()
│       ├── compare.py         # compare(), CompareReport
│       ├── correlations.py    # correlate(), five methods, zero-dep
│       ├── viz.py             # plot_ascii(), SVG chart generators
│       ├── cli.py             # dfdoctor CLI (argparse)
│       ├── utils.py           # read_file(), memory helpers
│       └── rules/
│           ├── missing.py     # null / high-missing detection
│           ├── duplicates.py  # duplicate row detection
│           ├── datatypes.py   # numeric-as-string, type inference
│           ├── identifiers.py # suspected ID column detection
│           ├── dates.py       # date-as-string, mixed formats
│           ├── cardinality.py # high-cardinality categoricals
│           ├── categories.py  # inconsistent labels, placeholders
│           └── outliers.py    # IQR-based outlier detection
├── tests/                     # 132 tests, 0 warnings
├── demo/
│   ├── messy_sales_data.csv   # example messy dataset (215 rows)
│   └── run_demo.py            # full end-to-end demo script
├── pyproject.toml
├── LICENSE
└── README.md
```

---

## API reference

### `audit(df) → AuditReport`

```python
report = audit(df)
report.pretty_print()
report.summary()               # → dict
report.to_dict()               # → dict (JSON-serialisable)
report.high_priority()         # → list[Issue]
report.sorted_by_priority()    # → list[Issue]
report.by_column("col")        # → list[Issue]
report.plot()                  # print ASCII charts
report.to_html("report.html")  # save HTML report
report.correlations()          # → CorrelationReport
```

### `auto_fix(df, safe_only=True) → tuple[DataFrame, list[str]]`

```python
cleaned, log = auto_fix(df)                  # safe fixes only
cleaned, log = auto_fix(df, safe_only=False) # all fixes
```

### `compare(df_before, df_after) → CompareReport`

```python
rep = compare(df, cleaned)
rep.pretty_print()
rep.to_dict()
```

### `correlate(df) → CorrelationReport`

```python
corr = correlate(df)
corr.pearson_matrix     # dict[str, dict[str, float]]
corr.spearman_matrix    # dict[str, dict[str, float]]
corr.kendall_matrix     # dict[str, dict[str, float]]
corr.cramers_matrix     # dict[str, dict[str, float]]
corr.phik_matrix        # dict[str, dict[str, float]] — ALL column pairs
corr.top_pairs          # list[CorrelationPair], sorted by |value|
corr.pretty_print()
corr.to_dict()
```

### `quick_eda(df, target=None) → EDAReport`

```python
insights = quick_eda(df, target="churn")
insights.pretty_print()
insights.top_findings           # list[str]
insights.strong_correlations    # list[tuple]
insights.skewed_columns         # list[str]
insights.high_missing_columns   # list[dict]
insights.target_correlations    # dict[str, float]
```

### `suggest_cleaning(df) → list[Issue]`

Returns issues sorted by priority score (highest first).

### `read_file(path) → DataFrame`

Supports `.csv`, `.tsv`, `.xlsx`, `.xls`, `.xlsm`.

### `plot_ascii(report) → None`

Prints ASCII bar charts for issue severity, missing values, and outliers.

---

## Contributing

Pull requests are welcome. To get started:

```bash
git clone https://github.com/ajayvarmaramineni/dfdoctor
cd dfdoctor
pip install -e ".[dev]"
pytest tests/
```

Please open an issue first to discuss what you'd like to change. All contributions should include tests and pass with `0 warnings`.

---

## License

MIT © [Ajay Ramineni](https://github.com/ajayvarmaramineni)
