Metadata-Version: 2.4
Name: tsauditor
Version: 0.1.2
Summary: A data quality auditing library for time-series tabular data in financial and sensor domains.
Author-email: Iman <iman@example.com>
License: MIT
Keywords: time-series,data-quality,leakage-detection,finance,sensor,audit
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas<3,>=1.5
Requires-Dist: numpy<3,>=1.23
Requires-Dist: scipy<2,>=1.9
Requires-Dist: statsmodels<1,>=0.13
Requires-Dist: rich<16,>=13.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file

# tsauditor
[![CI](https://github.com/imann128/tsauditor/actions/workflows/ci.yml/badge.svg)](https://github.com/imann128/tsauditor/actions/workflows/ci.yml)
[![codecov](https://codecov.io/github/imann128/tsauditor/graph/badge.svg)](https://codecov.io/github/imann128/tsauditor)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

A data-quality auditing library for **time-series tabular data**, with a focus on
financial and sensor domains. `tsauditor` scans a `DataFrame` and returns a 
structured report of structural problems, anomalies, and — its core contribution —
**data-leakage** between features and the prediction target.

The project grew out of a real bug in a Pakistani equity (OGDC) direction-prediction
model: a same-day percentage-change feature (`ChangeP`) was mathematically near-identical
to the target it was meant to predict. With `ChangeP` included, a Random Forest
classifier reached 99.68% accuracy (AUC 0.9987); a Gradient Boosting classifier reached
the same 99.68% accuracy (AUC 0.9967). Removing it — along with same-day `Open`, `High`,
and `Low`, which are equally unavailable at prediction time — dropped accuracy to 69.81%
(RF, AUC 0.7795) and 73.70% (GBM, AUC 0.8072) on a held-out test period
(2025-01-09 to 2026-04-03). Both models still beat a 50% baseline, but the headline
accuracy had been almost entirely an artifact of the leak. `tsauditor` exists to catch
this class of mistake automatically before it reaches a model.
See [`examples/ogdc_leakage_case`](examples/ogdc_leakage_case) for the full experiment,
script, and measured results.

## Installation

```bash
pip install tsauditor
```

Requires Python ≥ 3.9. Core dependencies: `pandas`, `numpy`, `scipy`, `statsmodels`, `rich`.

### Development setup

```bash
git clone https://github.com/imann128/tsauditor.git
cd tsauditor
pip install -e ".[dev]"
```

## Quickstart

```python
import tsauditor as tsa

report = tsa.scan(df, target="Direction", domain="finance")

report.summary()                 # rich-formatted CLI table
report.critical                  # list[Issue] that block modeling
report.filter(module="leakage")  # programmatic filtering
report.to_json("report.json")    # structured export
```

`scan()` returns a `GuardReport` holding `Issue` dataclasses bucketed by severity
(`critical`, `warnings`, `info`) plus dataset metadata.

## What it checks

| Module | Code | Severity | Detects |
|--------|------|----------|---------|
| profiler | PRF001 | warning | Irregular timestamp frequency |
| profiler | PRF002 | warning | Clustered missing values |
| profiler | PRF003 | info | Non-stationarity (Augmented Dickey-Fuller) |
| profiler | PRF004 | warning | Duplicate timestamps |
| profiler | PRF005 | warning | Clustered gaps |
| profiler | PRF006 | warning | High overall missing rate |
| anomaly | ANO001 | warning | Stuck / repeated constant values |
| anomaly | ANO002 | warning | Point outliers (z-score + IQR) |
| anomaly | ANO003 | warning | Contextual spikes (local rolling z-score) |
| leakage | LEK001 | critical | Target equivalence (feature reproduces the target) |
| leakage | LEK002 | warning | Positive-lag cross-correlation peak (future info) |
| leakage | LEK003 | warning | Rolling-window lookahead (excess over persistence) |

### Leakage detection (the research core)

Leakage checks are **rank-based**, chosen by target type:

- **LEK001 — equivalence.** Continuous targets use `|Spearman ρ|`; binary targets use
  **AUC separation** (`max(AUC, 1−AUC)`). This is deliberate: Pearson against a binary
  0/1 target is point-biserial correlation, which is capped near `√(2/π) ≈ 0.798`, so a
  feature whose sign *defines* the target scores only ~0.80 and slips under a naive
  threshold. AUC scores it 1.0.
- **LEK002 — cross-correlation.** Flags features whose peak association with the target
  falls at a *positive* lag (the feature aligns with the target's future).
- **LEK003 — temporal lookahead.** Flags features that correlate with the future target
  *beyond* what the target's own autocorrelation can explain — the signature of a
  forward-looking or centered window. The persistence baseline is what keeps a
  legitimate trailing feature from being false-flagged.

LEK002/LEK003 are WARNING-level *suspicions*: in pure cross-correlation a genuine strong
predictor and a leak are distinguishable only by magnitude. LEK001 is CRITICAL because
equivalence is near-deterministic.

## Architecture

```
tsauditor/
├── scanner.py          # scan() — orchestrates all modules into a GuardReport
├── profiler/           # structural checks: frequency, missing, stationarity
├── anomaly/            # point.py, contextual.py
├── leakage/            # equivalence.py, correlation.py, temporal.py
├── report/summary.py   # GuardReport + Issue dataclasses, rich/JSON output
└── utils/validation.py # input validation & DataFrame normalization
```

## Testing

```bash
pytest -q
```

## Contributing

Contributions are welcome. Check [open issues](https://github.com/imann128/tsauditor/issues)
for ideas, or look for the `good first issue` label. Run `pytest -q` before opening a PR —
all 93 tests must pass, and CI will verify this across Python 3.9–3.14 on Linux, Windows, and macOS.


## Status

Beta (`0.1.2`). Profiler, anomaly, and leakage modules are implemented and tested
(93 tests passing, CI across Python 3.9–3.14 on Linux, Windows, macOS).

## License

MIT — see [LICENSE](LICENSE).
