Metadata-Version: 2.4
Name: leaklint
Version: 0.1.0
Summary: Detect data leakage in ML datasets and pipelines at runtime — focused, framework-agnostic, CI-friendly.
Author: Atharva Khambete
License: MIT
Keywords: machine-learning,data-leakage,validation,reproducibility,mlops
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21
Requires-Dist: pandas>=1.3
Requires-Dist: scikit-learn>=1.0
Provides-Extra: yaml
Requires-Dist: pyyaml; extra == "yaml"
Provides-Extra: progress
Requires-Dist: tqdm; extra == "progress"
Provides-Extra: all
Requires-Dist: pyyaml; extra == "all"
Requires-Dist: tqdm; extra == "all"
Dynamic: license-file

# leaklint

**Runtime data-leakage detection for ML datasets and pipelines.** Point it at your
actual data and splits — `leaklint` tells you, ranked and in plain language, where
leakage is hiding.

Data leakage (information from outside the training data sneaking in) is the single
most common reason a model looks great offline and fails in production — documented
across **294 papers in 17 fields** ([Kapoor & Narayanan, *Patterns* 2023](https://www.cell.com/patterns/fulltext/S2666-3899(23)00159-9)).

## Why another tool?

| Approach | Examples | Limitation |
|---|---|---|
| IDE / static **code** analysis | LeakageDetector (PyCharm/VSCode) | reads your *code*, not your *data*; not an importable library |
| broad validation suites | deepchecks | leakage is a small, shallow part of a heavy suite |
| narrow black-box trick | leak-detect | dormant since 2020; instruments one function |

`leaklint` is focused, framework-agnostic (numpy / pandas / any sklearn-style
estimator), zero-config, and CI-friendly — it inspects the **data + splits** directly.

## Install

```bash
pip install -e .        # from this repo (PyPI release TBD)
```

## Usage

```python
from leaklint import detect_leakage

report = detect_leakage(
    X_train, y_train, X_test, y_test,
    groups_train=cust_ids_tr, groups_test=cust_ids_te,  # optional
    time_train=dates_tr,      time_test=dates_te,        # optional
)
report.summary()          # ranked, human-readable findings
if not report.clean:      # gate your CI
    raise SystemExit(1)
```

## What it detects

- **Exact** train/test duplicate rows (overlap contamination) — NaN- and float-safe hashing
- **Near-duplicate** rows across splits (jittered copies / augmentation overlap) — *opt-in*
  (`enable_near_dup=True`); distance-based, can false-positive on discrete/low-cardinality data
- **Group leakage** — same entity/group in both train and test (partial overlap counts)
- **Temporal leakage** — training rows dated at/after the test period (mixed-tz / sub-day safe)
- **Leaky / target-proxy features** — a single feature that ~perfectly predicts the target,
  via AUC/correlation **and mutual information** (catches non-linear, non-monotonic proxies)
- **Target / mean-encoding** — a feature whose values equal the per-group target mean
- **Identifier features** — near-unique id-like columns that shouldn't be features
- **Within-train duplicates** — inflate cross-validated scores
- **Preprocessing leakage** — a scaler/imputer (`transformer=`) fit on train+test, plus
  **encoders / feature-selectors** (categories or selected-features fit on the full data)
- **Cross-validation fold leakage** — pass `cv_splits=[(train_idx, test_idx), ...]` to catch
  index bleed / duplicate rows shared across folds

Multi-output / multi-label targets, NaN, all-constant columns, and integer-encoded
categoricals (`categorical_features=[...]`, excluded from near-dup distance) are handled.
All stochastic steps are deterministic via `random_state`.

## scikit-learn pipelines

```python
from leaklint import audit_pipeline
report = audit_pipeline(fitted_pipeline, X_train, y_train, X_test, y_test)
```
Pulls the transformer steps out of the pipeline and checks them for fit-on-full-data
leakage, plus the usual data-level checks.

## Use in CI

CLI (exits non-zero when leakage is found, so it gates the build):

```bash
leaklint --train train.csv --test test.csv --target label
# optional: --groups customer_id --time signup_date --enable-near-dup
```

GitHub Actions:

```yaml
- run: pip install leaklint
- run: leaklint --train data/train.csv --test data/test.csv --target label --sarif > leaklint.sarif
```

Machine-readable output for artifact storage / diff-on-PR:

```python
report.to_json()      # {"clean": bool, "findings": [...]}
report.to_sarif()     # SARIF 2.1.0 (GitHub code-scanning compatible)
```
```bash
leaklint --train t.csv --test e.csv --target y --json     # or --sarif
```

Per-check severity (not everyone's "group leakage" is fatal) via `leaklint.yaml` or a dict:

```yaml
# leaklint.yaml   (auto-discovered, or pass --config / config=)
severity:
  group_leakage: low        # downgrade
  within_train_duplicates: ignore   # mute
```

Or as a local pre-commit hook (runs the check before each commit):

```yaml
# .pre-commit-config.yaml
- repo: local
  hooks:
    - id: leaklint
      name: leaklint
      entry: leaklint --train data/train.csv --test data/test.csv --target label
      language: system
      pass_filenames: false
```

## Threshold validation

Measured by `scripts/validate_thresholds.py` on 13 public OpenML datasets:

- **Near-duplicate eps** (default = data-relative `0.1 × median nearest-neighbour distance`):
  **recall 1.00** on injected near-duplicates across all 13 datasets; clean-data
  false-positive rate **mean 0.028, max 0.163** (spambase). Because that FP-rate is
  dataset-dependent, near-dup ships **opt-in** (`enable_near_dup=True`).
- **Identifier near-unique threshold** (≥ 95% distinct): **0 / 251 columns** flagged across
  the (id-free) benchmark datasets → empirical FPR ≈ **0.0000**, so valid ordinals aren't
  flagged.
- **Leaky-feature AUC threshold** is configurable (`leaky_auc_threshold`, default 0.99). AUC
  is rank-based and threshold-free, so it is robust to class imbalance for *ranking*; under
  extreme imbalance its variance grows, so treat near-threshold hits as suspicions, not proof.

## Scale

- Exact-duplicate detection is hash-based and **chunked** (`chunk_size=`) — O(n) memory.
- Near-duplicate uses tree-based nearest-neighbour search with a **sampling fallback**
  (`max_rows=`, default 20k; sampling lowers recall but never invents matches) and is
  deterministic via `random_state`. Optional `progress=True` shows a tqdm bar.

## Honest limitations

- Some leakage is **context-dependent** (e.g., "is this feature known at prediction
  time?"). `leaklint` flags *mechanically detectable* leakage and *suspicious* signals;
  it does not claim to certify a dataset leak-free.
- The near-duplicate threshold is a documented heuristic (`near_dup_eps=`), tunable.
- Complementary to static-analysis tools (they catch code-level mistakes like
  `scaler.fit(X)` before the split that a data-only view can miss).

## Roadmap (not yet implemented)

- **scipy.sparse** input support for high-dimensional text/embedding data.
- **LSH** (e.g. `datasketch`) for near-dup beyond the sampling cap; today large data is
  handled by the sampling fallback + tree search rather than true sub-linear LSH.
- True **out-of-core** streaming from disk (current chunking still expects an in-memory frame).
