Metadata-Version: 2.4
Name: dataxid-profiling
Version: 0.3.0
Summary: Fast, Polars-native data profiling with interactive HTML reports and data quality alerts.
Project-URL: Homepage, https://dataxid.com
Project-URL: Repository, https://github.com/dataxid/dataxid-profiling
Project-URL: Changelog, https://github.com/dataxid/dataxid-profiling/blob/main/CHANGELOG.md
Project-URL: Issues, https://github.com/dataxid/dataxid-profiling/issues
Author-email: DataXID <dev@dataxid.com>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: data-profiling,data-quality,eda,exploratory-data-analysis,pandas-profiling,polars,profiling,ydata-profiling
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: jinja2>=3.1
Requires-Dist: phik>=0.12
Requires-Dist: polars-statistics>=0.4
Requires-Dist: polars>=1.0
Requires-Dist: scipy>=1.12
Description-Content-Type: text/markdown

# dataxid-profiling

[![PyPI version](https://img.shields.io/pypi/v/dataxid-profiling.svg)](https://pypi.org/project/dataxid-profiling/)
[![Python versions](https://img.shields.io/pypi/pyversions/dataxid-profiling.svg)](https://pypi.org/project/dataxid-profiling/)
[![License](https://img.shields.io/pypi/l/dataxid-profiling.svg)](https://github.com/dataxid/dataxid-profiling/blob/main/LICENSE)

Fast, Polars-native data profiling with interactive HTML reports and data quality alerts.

## Quickstart

```python
import polars as pl
from dataxid_profiling import ProfileReport

df = pl.read_csv("data.csv")
report = ProfileReport(df)
report.to_html("report.html")
```

Pandas works too:

```python
report = ProfileReport(pd.read_csv("data.csv"))
```

## Report Preview

**Dataset overview** — row/column counts, missing cells, duplicates, memory usage, and column type distribution at a glance.

<p align="center">
  <img src="docs/report-overview.png" alt="Dataset overview and alerts" width="700">
</p>

**Column details** — per-column statistics, top value distribution, and word clouds for categorical data.

<p align="center">
  <img src="docs/report-columns.png" alt="Column details with charts and word cloud" width="700">
</p>

**Correlations** — interactive heatmap showing relationships between numeric columns.

<p align="center">
  <img src="docs/report-correlations.png" alt="Correlation heatmap" width="700">
</p>

**Interactions** — scatter plots for numeric pairs and box plots for categorical × numeric pairs, with dynamic column selection.

## Highlights

- Built on Polars — fast, memory-efficient, Rust-powered
- 3 lines to profile any dataset
- Programmatic-first: `.to_dict()`, `.stats`, `.alerts`
- Interactive HTML reports with ECharts
- Accepts Polars, Pandas, CSV, and Parquet
- 5 column types: numeric, categorical, boolean, datetime, text
- 7 data quality alerts out of the box
- 5 correlation types: Pearson, Spearman, Kendall, Cramér's V, Phi K
- Interactions: scatter plot + box plot with dynamic column selection
- Two modes: `"complete"` for deep analysis, `"overview"` for speed
- Fully typed

## Installation

```bash
pip install dataxid-profiling
```

## Usage

### Programmatic access

```python
report = ProfileReport(df, title="Customer Data Profile")

stats = report.to_dict()
alerts = report.alerts
column_stats = report.stats["age"]
correlations = report.correlations
```

### JSON export

```python
report.to_json("report.json")
```

### Configuration

```python
from dataxid_profiling import ProfileReport, ProfileConfig

config = ProfileConfig(
    title="Customer Data Profile",
    mode="overview",
    missing_threshold=0.1,
    histogram_bins=30,
)
report = ProfileReport(df, config=config)
```

### Modes

| Feature | `"complete"` | `"overview"` |
|---|:-:|:-:|
| Basic stats | ✓ | ✓ |
| Histograms & value counts | ✓ | ✓ |
| Correlations | ✓ | ✗ |
| Interactions | ✓ | ✗ |
| Character analysis | ✓ | ✗ |
| Duplicate rows table | ✓ | ✗ |

## Output formats

| Format | Method | Use case |
|---|---|---|
| HTML | `report.to_html("report.html")` | Interactive report |
| JSON | `report.to_json("report.json")` | Machine-readable |
| Dict | `report.to_dict()` | Python-native |

## Contributing

Contributions are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for details.

## Links

- [PyPI](https://pypi.org/project/dataxid-profiling/)
- [Changelog](CHANGELOG.md)
- [GitHub Issues](https://github.com/dataxid/dataxid-profiling/issues)

## License

[Apache-2.0](LICENSE)
