Metadata-Version: 2.4
Name: spotless
Version: 0.3.0
Summary: The Ultimate Data Cleaning Engine for Python
Author-email: Spotless Maintainers <maintainers@spotless.org>
License: MIT
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.12
Requires-Dist: polars>=0.20.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: typer>=0.9.0
Requires-Dist: tzdata>=2024.1; sys_platform == 'win32'
Description-Content-Type: text/markdown

# Spotless: The Operating System for Data Quality

Spotless is a production-grade Python package that acts as **"The Operating System for Data Quality."** Instead of introducing custom data wrappers, Spotless integrates seamlessly into existing pipelines by accepting and returning standard Pandas DataFrames, Polars DataFrames/LazyFrames, and PyArrow Tables.

Spotless relies on two primitives to drastically improve data workflows:
1. `sp.inspect(df)`: Generates a stunning Dataset Intelligence Report detailing Trust Scores, DNA signatures, and Semantics.
2. `sp.clean(df)`: Generates an explainable, deterministic cleaning plan to sanitize missing data, duplicate rows, memory bloat, and semantically noisy strings (Dates, Emails, Phones).

---

## 🚀 Key Features

1. **Zero Friction API**: Call `sp.inspect()` or `sp.clean()` on any Polars, Pandas, or PyArrow dataframe.
2. **Lighthouse Dataset Trust Scores**: Computes multi-dimensional quality scores (Overall, Reliability, ML Readiness, Memory Efficiency, Schema Stability, and Semantic Quality).
3. **Deep Semantic Engine**: Heuristic regexes and checksum algorithms (Luhn for Credit Cards, Verhoeff for Aadhaar) to validate PAN, GSTIN, IP addresses, emails, phone numbers, and currencies.
4. **Explainable Cleaning**: Automatically converts types, normalizes PII formats, imputes missing values, and drops exact duplicates—explaining exactly *what* changed, *why* it changed, and how much it bumped the Trust Score. By default, Spotless avoids forward-filling missing values (to prevent hallucinating metadata in cross-sectional data) and uses constant/mode imputation instead.
5. **Streaming Native**: Built on Polars, `sp.clean()` natively supports `collect(streaming=True)` on massive out-of-core datasets.

---

## 📦 Installation

To install Spotless in your project, use `pip` or `uv`:

```bash
pip install spotless
```

or

```bash
uv add spotless
```

*(Note: On Windows systems, Spotless automatically includes the `tzdata` package to support timezone-aware datetime validation).*

---

## ⚡ Quick Start

### 1. Dataset Inspection

```python
import spotless as sp
import polars as pl

# 1. Load your standard dataframe
df = pl.read_csv("messy_sales.csv")

# 2. Inspect the dataset
profile = sp.inspect(df)

# Retrieve metrics programmatically
print(f"Overall Trust Score: {profile.trust_score.overall}/100")
print(f"ML Readiness: {profile.trust_score.ml_readiness}/100")

# 3. Display the stunning visual report in your terminal
profile.show()
```

### 2. Explainable Cleaning

```python
import spotless as sp
import polars as pl

df = pl.read_csv("messy_sales.csv")

# Generate the plan, show it in the terminal, and execute it
clean_df = sp.clean(df)

# Alternatively, step through it manually:
plan = sp.plan(df)
plan.show()

# Dry run to see exactly what rows will be affected before mutating
plan.execute(dry_run=True)

# Execute
clean_df = plan.execute()
```

### 3. Command Line Interface (CLI)

Spotless exposes a Typer-based CLI for instant dataset diagnostics directly from your terminal:

```bash
# Get a stunning visual diagnostic report
spotless inspect --input messy_sales.csv
```

![Spotless Demo](demo.gif)

---

## 🛠️ Benchmarks

Spotless is brutally fast. Check out our benchmarking suite to see how we stack up against `PyJanitor`, `Pandera`, `ydata-profiling`, and `Great Expectations`.

### 100,000 Rows (19MB DataFrame)
| Tool | Time (s) | Memory Peak (MB) |
|------|----------|-----------------|
| **Spotless** | **1.02s** | **113.79** |
| Pandera | 1.18s | 14.38 |
| PyJanitor | N/A* | N/A* |
| Great Expectations | N/A* | N/A* |
| ydata-profiling | N/A* | N/A* |

*\*Note: As of Pandas 2.x/3.x, `pyjanitor` and `ydata-profiling` have severe internal breaking changes that cause crashes. Great Expectations V1.0+ has completely removed its standard `from_pandas` API.*

Despite producing a massively detailed heuristic semantic analysis AND executing data transformations, **Spotless is still faster than pure schema-validation libraries like Pandera**.

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details on how to set up your development environment, run tests, and submit pull requests.
