Metadata-Version: 2.4
Name: pandas-cat
Version: 0.1.5
Summary: Profile pandas DataFrames and generate self-contained HTML reports with categorical and continuous column analysis.
Author-email: Petr Masa <masa@petrmasa.com>
License: MIT
Project-URL: Homepage, https://petrmasa.com/pandas-cat
Project-URL: Documentation, https://petrmasa.com/pandas-cat/index.html
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 5 - Production/Stable
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: seaborn
Requires-Dist: matplotlib
Requires-Dist: jinja2
Requires-Dist: cleverminer>=1.0.7
Requires-Dist: packaging
Requires-Dist: scipy
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: check-wheel-contents; extra == "dev"
Dynamic: license-file

# pandas-cat

<img alt="PyPI - License" src="https://img.shields.io/pypi/l/pandas-cat">
<img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/pandas-cat">
<img alt="PyPI - Wheel" src="https://img.shields.io/pypi/wheel/pandas-cat">
<img alt="PyPI - Status" src="https://img.shields.io/pypi/status/pandas-cat">
<img alt="PyPI - Downloads" src="https://img.shields.io/pypi/dm/pandas-cat">

**pandas-cat** (PANDAS-CATegorical profiling) is a library for profiling categorical datasets
and preparing them for analysis. It generates HTML reports with category distributions,
correlations, and missing-value summaries, and automatically reorders numeric-like categories
into their natural order.

Datasets have typically mixed variables (both categorical and continuous) and package can show both of them. 
Types of variables are detected automatically. Because pandas-cat focuses on categorical profiling, the default 
preparation engine converts numeric columns with few distinct values (<= `cat_limit`, default 20) to ordered categoricals 
so a 0/1 flag or a 1–5 rating scale gets a frequency bar chart rather than histogram. Numeric columns with more distinct 
values are left as continuous.  If you really need it, this behaviour can be overridden by passing `auto_prepare=False` 
or by casting the column to `float` before calling `profile()`.

Pass any DataFrame and get a self-contained HTML report in one call:

```python
import pandas_cat
pandas_cat.profile(df, dataset_name="Road accidents")
```

The report gives you:

- **Bar charts** — frequency counts and percentages for every categorical column.
- **Histograms** — distribution for every numeric column.
- **Correlations** — between all variables and between categorical values.
- **Missing-value summary** — sentinel detection and gap counts per column.
- **Memory breakdown** — usage by column.

Two preparation helpers keep the data clean before profiling (you can use them also separately):

- **prepare(df)** detects numeric-like categories and converts them to ordered
  `CategoricalDtype` so charts and correlations respect natural order.

  Without `prepare()`, pandas sorts categories alphabetically — a common trap:

  ```
  # Alphabetical (wrong) — pandas default
  16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 6–10, 76+, Under 6
  ```

  After `prepare()`, the natural numeric order is restored:

  ```
  # Natural order (correct) — after prepare()
  Under 6, 6–10, 16–20, 21–25, 26–35, 36–45, 46–55, 56–65, 66–75, 76+
  ```

- **handle_missing_values(df)** replaces 75+ sentinel strings
  (`"Unknown"`, `"N/A"`, `"–"`, `"Missing"`, …) with `pd.NA`
  so they are counted as missing rather than treated as valid categories.

## Installation

```bash
pip install pandas-cat
```

## Quick start

```python
import pandas as pd
import pandas_cat

df = pd.read_csv('data.csv')
pandas_cat.profile(df=df, dataset_name="My dataset")
# generates report/report.html
```

## Continuous variables

Numeric columns are profiled out of the box as continuous. The built-in
preparation engine (`auto_prepare=True`) preserves numeric columns with many
unique values (above `cat_limit`) as continuous — they are profiled with
histograms rather than excluded. Low-cardinality numeric columns are converted
to ordered categoricals instead.

```python
df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
                 encoding='cp1250', sep='\t')

pandas_cat.profile(df=df, dataset_name="Accidents", out_html="accidents_full.html")
# Columns like Driver_Age, Hour, Engine_Capacity are profiled with histograms.
# String-encoded categories like Driver_Age_Band are ordered correctly.
```

## Categorical data with numeric-looking values

Many real datasets store ordered categories as strings: `"0-10"`, `"Over 75"`,
`"60+"`. Alphabetical sorting produces `"Over 75"` before `"Under 5"`.
pandas-cat fixes this automatically:

```python
df = pd.read_csv('https://petrmasa.com/pandas-cat/data/accidents.zip',
                 encoding='cp1250', sep='\t')
df = df[['Driver_Age_Band', 'Driver_IMD', 'Sex', 'Journey']]

pandas_cat.profile(df=df, dataset_name="Accidents")
```

`auto_prepare=True` (default) converts `Driver_Age_Band` to an ordered
`pandas.Categorical` sorted by the extracted numeric values before profiling.

## Report templates

```python
# Default — static HTML with SVG charts
pandas_cat.profile(df=df, dataset_name="Accidents")

# Modern — same content, refreshed visual style
pandas_cat.profile(df=df, dataset_name="Accidents", template="modern")

# Interactive — three correlation metrics (Cramér's V, Spearman, Theil's U),
# per-category crosstabs, raw data driven
pandas_cat.profile(df=df, dataset_name="Accidents", template="interactive")
```

## All options

```python
pandas_cat.profile(
    df=df,
    dataset_name="Accidents",
    out_html="report.html",   # written to report/<out_html>
    opts={
        "auto_prepare":    True,         # convert numeric-string categories
        "cat_limit":       20,           # max categories before column is excluded
        "na_values":       ["MyNA"],     # extra missing-value sentinels
        "na_ignore":       ["NA"],       # built-in sentinels to keep as-is
        "keep_default_na": True,         # False = use only na_values
    }
)
```

## Data preparation only

```python
# Built-in engine (default) — preserves high-cardinality numeric columns as continuous
df = pandas_cat.prepare(df)

# CleverMiner engine — opt in explicitly
df = pandas_cat.prepare(df, auto_data_prep="CLM")
```

## Missing-value handling only

```python
df, detected, counts = pandas_cat.handle_missing_values(
    df,
    na_values=["TBD"],
    na_ignore=["-"],
)
```

75+ built-in sentinel strings are detected by default (NA, N/A, NULL, None,
Unknown, Missing, …).

## Sample reports

- [Short report — default template](https://petrmasa.com/pandas-cat/accidents_short.html)
- [Full report — all columns, continuous variables](https://petrmasa.com/pandas-cat/accidents_full.html)
- [Modern template](https://petrmasa.com/pandas-cat/accidents_modern.html)
- [Interactive report](https://petrmasa.com/pandas-cat/accidents_interactive_report.html)

## Credits

Petr Masa — base package, data preparation, maintaining the package and development 

Jan Nejedly — interactive report (first version), missing-value handling (first version)
