Metadata-Version: 2.4
Name: autoeda-plus
Version: 1.0.0
Summary: Automated EDA reports with quality scores, human-readable insights, and actionable recommendations
Project-URL: Homepage, https://github.com/arijit1204/autoeda
Project-URL: Bug Tracker, https://github.com/arijit1204/autoeda/issues
Author: AutoEDA+ Contributors
License: MIT
License-File: LICENSE
Keywords: data-profiling,eda,exploratory-data-analysis,machine-learning,pandas
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Scientific/Engineering :: Visualization
Requires-Python: >=3.8
Requires-Dist: numpy>=1.21
Requires-Dist: pandas>=1.3
Requires-Dist: plotly>=5.0
Requires-Dist: scipy>=1.7
Provides-Extra: dev
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# AutoEDA+

Automated Exploratory Data Analysis with quality scores, human-readable insights, and actionable preprocessing recommendations.

[![Python](https://img.shields.io/badge/Python-3.8%2B-blue)](https://python.org)
[![License: MIT](https://img.shields.io/badge/License-MIT-green)](LICENSE)

---

## What is AutoEDA+?

AutoEDA+ is a Python library for exploring Pandas DataFrames and CSV files. It goes beyond raw statistics to provide:

- **Dataset Quality Score** (0-100) with weighted breakdown
- **Per-Column Health Scores** with specific checks and flags
- **Human-readable insights** (e.g., "Salary is heavily right-skewed")
- **Actionable recommendations** (e.g., "Impute Age using Median")

---

## Installation

```bash
pip install autoeda-plus
```

---

## Quick Start

```python
import autoeda as ae
import pandas as pd

# Load your dataset
df = pd.read_csv("titanic.csv")

# Run the complete analysis pipeline
result = ae.analyze(df)

# View the overall quality score
print(result.quality.score)
```

---

## Features Overview

### 1. Dataset Overview
Provides rows, columns, memory usage, missing cells, duplicate rows, and column type breakdown.

### 2. Dataset Quality Score
An overall health score from 0-100 computed from 5 weighted dimensions:
* **Completeness (30%)**: Fraction of non-missing cells
* **Duplicates (20%)**: Fraction of non-duplicate rows
* **Outliers (20%)**: Inverse average outlier rate
* **Consistency (15%)**: No mismatched data types
* **Balance (15%)**: Category frequency balance

### 3. Column Health Score
Every column receives a score (0-100) with individual checks:
* Correct data type
* Missing value evaluation
* Outlier detection
* Cardinality checks
* Skewness detection

### 4. Human-Readable Insights
Transforms statistical metrics into accessible natural language.
```text
Instead of: Skewness = 2.4
AutoEDA+ says: 'Salary' is highly right-skewed (skewness = 2.40). Most values are concentrated at the lower end with a long upper tail.
```

### 5. Smart Recommendations
Provides prioritized rule-based suggestions for data preprocessing.
```text
[HIGH]   Investigate outliers in 'Income'
[MEDIUM] Impute 'Age' using Median
[MEDIUM] Convert 'JoinDate' to datetime
[LOW]    One-Hot Encode 'Gender'
```

### 6. Dataset Comparison (Train vs Test)
Compare two datasets to detect **Data Drift** and **Schema Mismatches**:
* Detects missing columns (e.g., target missing from test set).
* Calculates numeric distribution drift using the Kolmogorov-Smirnov (KS) test.
* Calculates categorical frequency shifts.
* Visualizes differences with overlaid histograms and grouped bar charts.

---

## Complete Public API Reference

Below are all the available API methods. You can easily copy and paste these into your project.

```python
import autoeda as ae

# Full Pipeline Analysis
result = ae.analyze(dataframe)

# Dataset Comparison (Train vs Test)
comp = ae.compare(train_dataframe, test_dataframe)

# Dataset Profiling & Quality
ae.overview(dataframe)
ae.quality(dataframe)
ae.column_health(dataframe)[column_name]

# Statistical DataFrames & Metrics
ae.missing(dataframe)
ae.duplicates(dataframe)
ae.dtypes(dataframe)
ae.statistics(dataframe)
ae.outliers(dataframe)
ae.correlation(dataframe)
ae.distribution(dataframe)

# Insights & Recommendations
ae.insights(dataframe)
ae.recommend(dataframe)

# Visualizations (Standard EDA Plots)
ae.histogram(dataframe, column_name)
ae.boxplot(dataframe, column_name)
ae.countplot(dataframe, column_name)
ae.heatmap(dataframe)
ae.missing_heatmap(dataframe)
ae.scatter(dataframe, x_column_name, y_column_name)

# Visualizations (Dataset Comparison Plots)
ae.compare_histogram(train_dataframe, test_dataframe, column_name)
ae.compare_countplot(train_dataframe, test_dataframe, column_name)
```

---

## API Examples & Outputs

### Dataset Profiling & Quality

**`ae.overview()`**  
Returns a high-level summary profile of the dataset including shape and memory footprint.
```python
In [1]: ae.overview(df)
Out[1]: DatasetProfile(n_rows=891, n_cols=12, memory='83.7 KB', ...)
```

**`ae.quality()`**  
Calculates an overall 0-100 quality score based on completeness, duplicates, and outliers.
```python
In [2]: ae.quality(df)
Out[2]: QualityResult(score=78.5, grade='C+', sub_scores=[...])
```

**`ae.column_health()`**  
Returns specific health metrics (0-100 score) and flagged issues for a single column.
```python
In [3]: ae.column_health(df)['Age']
Out[3]: ColumnHealth(score=80.0, missing_pct=19.87, outlier_pct=0.0)
```

### Statistical DataFrames

**`ae.missing()`**  
Generates a pandas DataFrame listing the exact count and percentage of missing values per column.
```python
In [4]: ae.missing(df).head(2)
Out[4]: 
       missing_count  missing_pct
Cabin            687        77.10
Age              177        19.87
```

**`ae.duplicates()`**  
Evaluates the dataset for completely identical rows and returns the count and percentage.
```python
In [5]: ae.duplicates(df)
Out[5]: {'count': 0, 'percentage': 0.0}
```

**`ae.dtypes()`**  
Returns a pandas DataFrame of the pandas internal data types of each column.
```python
In [6]: ae.dtypes(df).head(2)
Out[6]: 
             type
PassengerId  int64
Survived     int64
```

**`ae.statistics()`**  
Computes core descriptive statistics (mean, standard deviation, min, max) for all numeric columns.
```python
In [7]: ae.statistics(df).head(2)
Out[7]: 
             mean        std   min   max
PassengerId  446.00   257.35     1   891
Survived       0.38     0.48     0     1
```

**`ae.outliers()`**  
Detects statistical outliers in numeric columns using the IQR method and returns their frequencies.
```python
In [8]: ae.outliers(df).head(2)
Out[8]: 
       outlier_count  outlier_pct
Fare             116        13.02
SibSp             46         5.16
```

**`ae.correlation()`**  
Constructs a Pearson correlation matrix DataFrame mapping relationships between numeric variables.
```python
In [9]: ae.correlation(df).iloc[:2, :2]
Out[9]: 
             PassengerId  Survived
PassengerId     1.000000 -0.005007
Survived       -0.005007  1.000000
```

### Insights & Recommendations

**`ae.insights()`**  
Analyzes statistical anomalies and formulates human-readable text insights regarding the dataset.
```python
In [10]: ae.insights(df)[0]
Out[10]: "Dataset has 891 rows and 12 columns."
```

**`ae.recommend()`**  
Generates a prioritized list of actionable data cleaning steps (e.g. dropping columns, imputing values).
```python
In [11]: ae.recommend(df)[0]
Out[11]: "[HIGH] Drop 'Cabin' due to excessive missing values (77.10%)."
```

### Dataset Comparison (Data Drift)

**`ae.compare()`**  
Cross-references two datasets to detect schema mismatches and distribution drift (Kolmogorov-Smirnov test).
```python
In [12]: comp = ae.compare(train_df, test_df)
         print(comp.insights[0])
         print([c.column for c in comp.drifted_columns])
Out[12]: 
Drift detected in 'Fare' (KS p-value = 0.0031).
['Fare', 'Age']
```

---

## Visualizations

When called in a Jupyter Notebook, these visualization functions natively render interactive Plotly `graph_objects.Figure` charts.

### Standard EDA Plots

**`ae.histogram()`**  
Displays the distribution of a numeric column with a Kernel Density Estimate (KDE) overlay.
```python
In [13]: ae.histogram(df, 'Age')
Out[13]:
```
![Histogram Example](https://github.com/arijit1204/autoeda/raw/main/docs/histogram.png)

**`ae.boxplot()`**  
Visualizes data dispersion and isolates statistical outliers for a specified numeric column.
```python
In [14]: ae.boxplot(df, 'Fare')
Out[14]:
```
![Boxplot Example](https://github.com/arijit1204/autoeda/raw/main/docs/boxplot.png)

**`ae.heatmap()`**  
Renders an interactive correlation matrix heatmap connecting all numeric variables.
```python
In [15]: ae.heatmap(df)
Out[15]:
```
![Heatmap Example](https://github.com/arijit1204/autoeda/raw/main/docs/heatmap.png)

**`ae.missing_heatmap()`**  
Generates a visual nullity matrix indicating precisely where missing values occur across rows.
```python
In [16]: ae.missing_heatmap(df)
Out[16]:
```
![Missing Heatmap Example](https://github.com/arijit1204/autoeda/raw/main/docs/missing_heatmap.png)

### Dataset Comparison Plots

**`ae.compare_histogram()`**  
Overlays two distributions to visually compare a numeric column between a training and testing set.
```python
In [17]: ae.compare_histogram(train_df, test_df, 'Fare')
Out[17]:
```
![Compare Histogram Example](https://github.com/arijit1204/autoeda/raw/main/docs/compare_histogram.png)

**`ae.compare_countplot()`**  
Aligns categorical frequencies side-by-side to visually compare classifications across two datasets.
```python
In [18]: ae.compare_countplot(train_df, test_df, 'Pclass')
Out[18]:
```
![Compare Countplot Example](https://github.com/arijit1204/autoeda/raw/main/docs/compare_countplot.png)

---

## Dependencies

| Package | Version | Purpose |
|---|---|---|
| pandas | ≥ 1.3 | DataFrame operations |
| numpy | ≥ 1.21 | Numerical computations |
| scipy | ≥ 1.7 | KDE, skewness, kurtosis |
| plotly | ≥ 5.0 | Interactive Visualizations |

---

## License

MIT License — see [LICENSE](LICENSE).
