Metadata-Version: 2.4
Name: framelint
Version: 0.1.0
Summary: A lightweight data-quality profiler and CI gate for tabular data.
Project-URL: Homepage, https://github.com/AnoopIbrampur/framelint
Project-URL: Repository, https://github.com/AnoopIbrampur/framelint
Project-URL: Issues, https://github.com/AnoopIbrampur/framelint/issues
Project-URL: Changelog, https://github.com/AnoopIbrampur/framelint/blob/main/CHANGELOG.md
Author-email: Anoop Ibrampur <anoopibrampur@gmail.com>
License: MIT
License-File: LICENSE
Keywords: ci,data-engineering,data-profiling,data-quality,data-validation,dataframe,pandas,schema-drift
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: pandas>=1.3
Requires-Dist: rich>=13.0
Requires-Dist: tomli>=2.0; python_version < '3.11'
Requires-Dist: typer>=0.9
Provides-Extra: dev
Requires-Dist: build>=1.0; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pandas-stubs; extra == 'dev'
Requires-Dist: pre-commit>=3.0; extra == 'dev'
Requires-Dist: pyarrow>=10.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Requires-Dist: twine>=5.0; extra == 'dev'
Provides-Extra: parquet
Requires-Dist: pyarrow>=10.0; extra == 'parquet'
Description-Content-Type: text/markdown

# framelint

[![PyPI version](https://img.shields.io/pypi/v/framelint.svg)](https://pypi.org/project/framelint/)
[![Python versions](https://img.shields.io/pypi/pyversions/framelint.svg)](https://pypi.org/project/framelint/)
[![CI](https://github.com/AnoopIbrampur/framelint/actions/workflows/ci.yml/badge.svg)](https://github.com/AnoopIbrampur/framelint/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/AnoopIbrampur/framelint/branch/main/graph/badge.svg)](https://codecov.io/gh/AnoopIbrampur/framelint)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Typed](https://img.shields.io/badge/typed-yes-brightgreen.svg)](https://peps.python.org/pep-0561/)

**A lightweight data-quality profiler and CI gate for tabular data.**

`framelint` scans a pandas DataFrame or a CSV/Parquet file and produces a clear
data-quality report — nulls, duplicates, constant columns, likely-ID columns,
type inconsistencies, numeric outliers, format violations, and schema drift.

Its standout feature: it doubles as a **CI gate**. Point it at your data, set
thresholds, and it exits non-zero when quality drops — so a bad dataset fails
the build instead of silently flowing downstream.

---

## Why this exists

Data pipelines break quietly. A column starts arriving 40% null, an upstream job
starts writing numbers as strings, a join silently doubles your rows — and
nobody notices until a dashboard looks wrong weeks later. `framelint` turns
those failures into loud, early, automated signals you can drop into CI in one
line.

## Install

```bash
pip install framelint
# Parquet support:
pip install "framelint[parquet]"
```

Requires Python 3.9+.

## 30-second quickstart

```python
import framelint

report = framelint.scan("sales.csv")   # or pass a DataFrame
report.summary()                        # pretty console table
print(report.passed)                    # -> True / False

report.to_json("report.json")           # machine-readable
report.to_html("report.html")           # shareable report
```

Example console output:

```
framelint  FAILED  rows=1000 cols=6  errors=1 warnings=3 info=1
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Severity ┃ Check            ┃ Column  ┃ Message                               ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ error    │ missingness      │ region  │ Column 'region' is 62.0% null.        │
│ warning  │ duplicates       │ —       │ Found 12 duplicate rows (full-row).   │
│ warning  │ type_consistency │ price   │ Column 'price' holds numbers as ...   │
│ warning  │ outliers         │ amount  │ Column 'amount' has 18 outliers ...   │
│ info     │ cardinality      │ id      │ Column 'id' looks like an identifier. │
└──────────┴──────────────────┴─────────┴───────────────────────────────────────┘
```

## Features

- **Missingness** — per-column null counts and rates, with severity thresholds.
- **Duplicate rows** — full-row or by a subset of key columns.
- **Constant / zero-variance** and all-null columns.
- **Cardinality** — likely-identifier and high-cardinality column detection.
- **Type consistency** — numbers stored as strings, mixed-type columns.
- **Outliers** — numeric outliers via IQR or z-score (configurable).
- **Format validation** (opt-in) — email, date/datetime, numeric ranges,
  regex, and allowed-value sets, per column.
- **Schema drift** — save a baseline, then detect added/removed columns, dtype
  changes, null-rate jumps, and distribution shifts.
- **Severity levels** — every finding is `info`, `warning`, or `error`.
- **Pass/fail decision** — based on configurable thresholds, for use in CI.
- **Outputs** — rich console, `dict`, JSON, HTML, and Markdown.

## CLI

```bash
# Scan and write reports
framelint scan sales.csv --html report.html --json report.json

# Fail the build if any error-level finding is present
framelint scan sales.csv --fail-on error

# Save a baseline, then scan a new file for drift
framelint baseline save sales.csv baseline.json
framelint scan new.csv --baseline baseline.json
```

Exit codes: **0** = passed, **1** = quality failure, **2** = usage error.

### Use it in CI to gate data quality

```yaml
# .github/workflows/data-quality.yml
name: data-quality
on: [push, pull_request]
jobs:
  check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: "3.12" }
      - run: pip install framelint
      - run: framelint scan data/sales.csv --fail-on error --baseline data/baseline.json
```

If quality drops below your thresholds, the step exits non-zero and the build
fails — no extra glue code required.

## Configuration

Thresholds and per-column rules can be set, in increasing order of precedence:

1. Built-in defaults
2. `[tool.framelint]` in `pyproject.toml`
3. A standalone TOML file (`--config rules.toml`)
4. A `dict` / `Config` passed to `scan(...)`
5. Individual CLI flags (e.g. `--fail-on`, `--outlier-method`)

```toml
# pyproject.toml  (or a standalone --config file, same schema)
[tool.framelint]
null_rate_warning = 0.10
null_rate_error = 0.50
duplicate_rate_error = 0.05
outlier_method = "iqr"      # or "zscore"
fail_on = "error"

[tool.framelint.columns.email]
type = "email"

[tool.framelint.columns.age]
min = 0
max = 120
```

| Key | Default | Meaning |
| --- | --- | --- |
| `null_rate_warning` / `null_rate_error` | 0.10 / 0.50 | Null-rate thresholds |
| `duplicate_rate_warning` / `duplicate_rate_error` | 0.0 / 0.10 | Duplicate-row thresholds |
| `duplicate_subset` | `null` | Key columns for duplicate detection |
| `id_cardinality_ratio` | 0.95 | Unique-ratio to flag a likely ID |
| `high_cardinality_ratio` | 0.50 | Unique-ratio to flag high cardinality |
| `outlier_method` | `"iqr"` | `iqr` or `zscore` |
| `iqr_factor` / `zscore_threshold` | 1.5 / 3.0 | Outlier sensitivity |
| `outlier_rate_warning` / `outlier_rate_error` | 0.01 / 0.10 | Outlier-rate thresholds |
| `drift_mean_shift` | 3.0 | Mean shift (in baseline std) to flag drift |
| `drift_null_rate_increase` | 0.10 | Null-rate jump to flag drift |
| `fail_on` | `"error"` | Severity at/above which `passed` is `False` |

Per-column rules (`[tool.framelint.columns.<name>]`): `type` (`email`/`date`/
`datetime`), `min`, `max`, `regex`, `allowed`.

## Programmatic API

```python
import framelint

# Baseline + drift
framelint.save_baseline("sales.csv", "baseline.json")
report = framelint.scan("new.csv", baseline="baseline.json")

# Inline configuration
report = framelint.scan(df, config={"fail_on": "warning", "outlier_method": "zscore"})

report.to_dict()        # full machine-readable result
report.to_markdown()    # Markdown string
report.counts_by_severity()
```

## Contributing

Contributions are welcome — see [CONTRIBUTING.md](CONTRIBUTING.md) and the
[Code of Conduct](CODE_OF_CONDUCT.md). In short:

```bash
pip install -e ".[dev]"
ruff check . && ruff format --check .
mypy
pytest
```

## License

[MIT](LICENSE) © Anoop Ibrampur
