Metadata-Version: 2.4
Name: dissectml
Version: 0.1.2
Summary: The missing middle layer between EDA and AutoML - deep data understanding meets model comparison
Project-URL: Homepage, https://github.com/rupeshbharambe24/dissectML
Project-URL: Documentation, https://dissectml.readthedocs.io
Project-URL: Repository, https://github.com/rupeshbharambe24/dissectML
Project-URL: Bug Tracker, https://github.com/rupeshbharambe24/dissectML/issues
Author-email: Rupesh Bharambe <rupeshbharambe2004@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: automl,data-science,eda,machine-learning,model-comparison
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: jinja2>=3.1
Requires-Dist: joblib>=1.3
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: plotly>=5.18
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Requires-Dist: statsmodels>=0.14
Requires-Dist: tqdm>=4.65
Provides-Extra: boost
Requires-Dist: catboost>=1.2; extra == 'boost'
Requires-Dist: lightgbm>=4.0; extra == 'boost'
Requires-Dist: xgboost>=2.0; extra == 'boost'
Provides-Extra: dev
Requires-Dist: catboost>=1.2; extra == 'dev'
Requires-Dist: hatchling; extra == 'dev'
Requires-Dist: kaleido>=0.2; extra == 'dev'
Requires-Dist: lightgbm>=4.0; extra == 'dev'
Requires-Dist: mkdocs-material>=9.5; extra == 'dev'
Requires-Dist: mkdocstrings[python]>=0.24; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: optuna>=3.4; extra == 'dev'
Requires-Dist: polars>=0.20; extra == 'dev'
Requires-Dist: pre-commit>=3.6; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest-xdist>=3.5; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.3; extra == 'dev'
Requires-Dist: shap>=0.44; extra == 'dev'
Requires-Dist: weasyprint>=60.0; extra == 'dev'
Requires-Dist: xgboost>=2.0; extra == 'dev'
Provides-Extra: explain
Requires-Dist: shap>=0.44; extra == 'explain'
Provides-Extra: full
Requires-Dist: catboost>=1.2; extra == 'full'
Requires-Dist: kaleido>=0.2; extra == 'full'
Requires-Dist: lightgbm>=4.0; extra == 'full'
Requires-Dist: optuna>=3.4; extra == 'full'
Requires-Dist: polars>=0.20; extra == 'full'
Requires-Dist: shap>=0.44; extra == 'full'
Requires-Dist: weasyprint>=60.0; extra == 'full'
Requires-Dist: xgboost>=2.0; extra == 'full'
Provides-Extra: report
Requires-Dist: kaleido>=0.2; extra == 'report'
Requires-Dist: weasyprint>=60.0; extra == 'report'
Provides-Extra: scale
Requires-Dist: optuna>=3.4; extra == 'scale'
Requires-Dist: polars>=0.20; extra == 'scale'
Description-Content-Type: text/markdown

<div align="center">

# DissectML

[![PyPI version](https://img.shields.io/pypi/v/dissectml)](https://pypi.org/project/dissectml/)
[![Python](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue)](https://github.com/rupeshbharambe24/dissectML)
[![CI](https://github.com/rupeshbharambe24/dissectML/actions/workflows/ci.yml/badge.svg)](https://github.com/rupeshbharambe24/dissectML/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](https://github.com/rupeshbharambe24/dissectML/blob/master/LICENSE)

**The missing middle layer between EDA and AutoML.**

*Deep data understanding meets model comparison -- the full journey from
"What is my data?" to "Which model is best and WHY?", in as few as 3 function calls.*

[Quick Start](#quick-start) | [Features](#key-features) | [Installation](#installation) | [Documentation](https://dissectml.readthedocs.io) | [Contributing](#contributing)

</div>

<p align="center">
  <img src="docs/assets/report_preview.png" alt="DissectML HTML Report Preview" width="90%">
</p>

---

## Why DissectML?

Most data science workflows look the same: run pandas-profiling for a quick
summary, switch to scikit-learn for preprocessing, try a handful of models with
PyCaret or LazyPredict, then stitch SHAP plots together in a notebook. By the
time you have answers, you have imported 3-5 separate libraries, written
hundreds of lines of glue code, and lost the thread that connects your data
findings to your modelling decisions.

DissectML (`dissectml`) closes that gap. It is a single, unified pipeline that
runs deep exploratory data analysis, pre-model intelligence checks (leakage
detection, readiness scoring, algorithm recommendations), a multi-model battle
arena, cross-model statistical comparison, and publication-ready HTML report
generation -- all driven by a consistent API. Three function calls replace three
notebooks.

---

## Key Features

### Exploratory Data Analysis

- **Unified correlation matrix** --
  Pearson, Cramer's V, and point-biserial correlation computed together and
  rendered in a single heatmap, regardless of column types.

- **Missing data intelligence** --
  Little's MCAR test plus MAR/MNAR classification, with automatic imputation
  strategy recommendations tailored to each column.

- **Statistical test battery** --
  Normality, independence, and variance tests auto-selected based on data type
  and sample size. No manual test selection required.

- **Auto cluster discovery** --
  K-Means and DBSCAN with automatically tuned parameters (elbow method, silhouette
  scoring) to surface natural groupings in your data.

- **Feature interaction and non-linearity detection** --
  Identifies non-linear relationships and interaction effects that linear models
  would miss.

### Pre-Model Intelligence

- **Target leakage detection** --
  Four-pronged analysis covering correlation leakage, mutual information leakage,
  temporal leakage, and derived-feature leakage.

- **Data readiness score** --
  A 0-100 composite score with waterfall breakdown showing exactly what is
  dragging your data quality down (missing values, cardinality, class balance,
  outliers, and more).

- **Algorithm recommendations** --
  A rules engine that maps your EDA findings (data size, feature types,
  non-linearity, multicollinearity) to a ranked list of recommended model
  families.

### Model Comparison

- **36-model battle arena** --
  19 classifiers and 17 regressors (plus optional XGBoost, LightGBM, and
  CatBoost) trained and evaluated with parallel cross-validation in a single
  call.

- **Cross-model error analysis** --
  Identifies the hardest samples, builds a model complementarity matrix, and
  highlights where ensemble strategies could improve performance.

- **Statistical significance testing** --
  McNemar's test for classifiers and corrected repeated k-fold paired t-test for
  regressors, so you know which performance differences are real.

### Reporting

- **Publication-ready HTML reports** --
  Interactive Plotly charts, narrative summaries, and structured sections covering
  every stage of the pipeline, exportable as a single self-contained HTML file.

---

## Quick Start

```python
import dissectml as dml

# Load a built-in dataset
df = dml.load_titanic()
```

### 1. Deep Exploratory Data Analysis

```python
eda = dml.explore(df)

eda.overview.show()           # Shape, dtypes, memory usage
eda.correlations.heatmap()    # Unified correlation matrix
eda.missing.patterns()        # Missing data analysis with MCAR test
eda.outliers.plot()           # Outlier detection across numeric columns
eda.clusters.summary()        # Auto-discovered clusters
```

### 2. Model Battle Arena

```python
models = dml.battle(df, target="survived")

models.leaderboard()          # Ranked models with CV scores
models.timing()               # Training time comparison
```

### 3. Full Pipeline (EDA + Intelligence + Battle + Compare + Report)

```python
report = dml.analyze(df, target="survived", task="classification")

report.summary()              # High-level findings
report.export("report.html")  # Self-contained interactive report
```

The `analyze` function runs all five stages end-to-end: EDA, intelligence
checks, model training, cross-model comparison, and report generation. For
fine-grained control, call each stage individually.

---

## Installation

### Core package

```bash
pip install dissectml
```

### Optional extras

```bash
pip install dissectml[boost]     # XGBoost, LightGBM, CatBoost
pip install dissectml[explain]   # SHAP explainability
pip install dissectml[report]    # PDF export (WeasyPrint + Kaleido)
pip install dissectml[scale]     # Polars backend + Optuna tuning
pip install dissectml[full]      # Everything above
```

### Development

```bash
git clone https://github.com/rupeshbharambe24/dissectML.git
cd DissectML
pip install -e ".[dev]"
```

**Requirements:** Python 3.10 or later.

---

## Comparison with Alternatives

| Feature                    | DissectML | PyCaret | LazyPredict | YData Profiling |
|----------------------------|:---------:|:-------:|:-----------:|:---------------:|
| Deep EDA                   | Yes       | --      | --          | Yes             |
| Statistical Tests          | Yes       | --      | --          | Partial         |
| Model Training             | Yes       | Yes     | Yes         | --              |
| Model Comparison           | Yes       | Yes     | Partial     | --              |
| SHAP Analysis              | Yes       | Yes     | --          | --              |
| Interactive Reports        | Yes       | --      | --          | Yes             |
| Target Leakage Detection   | Yes       | --      | --          | --              |
| Data Readiness Score       | Yes       | --      | --          | --              |

DissectML is the only library that covers the full spectrum from statistical data
profiling through model comparison with a single, coherent API. Other tools excel
at individual stages but leave you to bridge the gaps yourself.

---

## Architecture

DissectML is organized into five pipeline stages, each backed by a dedicated
subpackage:

```
Stage 1: EDA            dissectml.eda           9 sub-modules (overview, correlations,
                                                missing, outliers, univariate, bivariate,
                                                clusters, interactions, statistical_tests)

Stage 2: Intelligence   dissectml.intelligence  Leakage detection, multicollinearity,
                                                feature importance, readiness scoring,
                                                algorithm recommendations

Stage 3: Battle         dissectml.battle        Model catalog, preprocessing pipeline,
                                                parallel CV runner, hyperparameter tuner

Stage 4: Compare        dissectml.compare       Metrics tables, significance tests,
                                                error analysis, Pareto frontiers,
                                                ROC/PR curves, SHAP comparison

Stage 5: Report         dissectml.report        Jinja2 HTML builder, narrative generator,
                                                section renderers, PDF export
```

---

## Configuration

DissectML uses a global configuration object for controlling default behavior:

```python
import dissectml as dml

# View current config
print(dml.get_config())

# Temporarily override settings
with dml.config_context(n_jobs=4, cv_folds=10):
    report = dml.analyze(df, target="price")
```

---

## Built-in Datasets

Two datasets are bundled for quick experimentation:

```python
df_titanic = dml.load_titanic()    # Binary classification (survival)
df_housing = dml.load_housing()    # Regression (house prices)
```

---

## Documentation

Full documentation, API reference, and tutorials are available at:

**[https://dissectml.readthedocs.io](https://dissectml.readthedocs.io)**

---

## Contributing

Contributions are welcome. Please see
[CONTRIBUTING.md](https://github.com/rupeshbharambe24/dissectML/blob/master/CONTRIBUTING.md)
for guidelines on setting up a development environment, running the test suite,
and submitting pull requests.

If you find a bug or have a feature request, please open an issue on the
[GitHub issue tracker](https://github.com/rupeshbharambe24/dissectML/issues).

---

## License

DissectML is released under the [MIT License](https://github.com/rupeshbharambe24/dissectML/blob/master/LICENSE).

---

<div align="center">

**Built by [Rupesh Bharambe](https://github.com/rupeshbharambe24)**

</div>
