Metadata-Version: 2.2
Name: megaprofiler
Version: 1.0.0
Summary: megaprofiler is a highly customizable and extensible data profiling library designed to help data scientists and engineers understand their datasets before performing analysis or building models.
Home-page: https://github.com/sidkris/megaprofiler
Author: Siddharth Krishnan
Author-email: sid@sidkrishnan.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: numpy
Requires-Dist: tabulate
Requires-Dist: scikit-learn
Requires-Dist: imbalanced-learn
Requires-Dist: statsmodels
Requires-Dist: scipy
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

[![Downloads](https://img.shields.io/pypi/dm/megaprofiler)](https://pypi.org/project/megaprofiler/)

`MegaProfiler` is an easy-to-use, highly customizable Python library designed for profiling and analyzing datasets. It provides deep insights into your data's structure, distributions, missing values, anomalies, and more. With built-in support for data validation, anomaly detection, and data drift tracking, it's the perfect tool for data scientists and engineers looking to automate exploratory data analysis (EDA) and quality checks for large datasets.

While other libraries like `pandas-profiling` exist, **MegaProfiler** stands out for its extensibility, scalability, and integration with data validation and anomaly detection, making it ideal for data preprocessing and ETL pipelines.

---

## Key Features

- **Automatic Data Summaries**:
  - Automatically generate statistical summaries, distributions, unique values, missing values, and data types for each column.
  
- **Anomaly Detection**:
  - Flag unusual distributions, outliers, or inconsistent data using z-score, IQR, or machine learning techniques (e.g., Isolation Forest).
  
- **Data Validation**:
  - Set custom validation rules (e.g., no missing values in specific columns, data type constraints) and receive alerts for rule violations.
  
- **Custom Reports**:
  - Generate configurable reports in various formats (e.g., HTML, PDF), with customizable thresholds for anomalies.
  
- **Data Drift Detection**:
  - Track changes in data distributions over time to detect shifts in data quality or content, useful for continuous monitoring of data pipelines.

- **Multicollinearity and Correlation Analysis**:
  - Perform advanced correlation analysis and detect multicollinearity with Variance Inflation Factor (VIF).

- **Time Series Analysis**:
  - Decompose and analyze time series data to identify trends, seasonality, and residuals.

---

## Benefits

`MegaProfiler` is an invaluable tool for:
- **Data Scientists and Engineers**: It automates exploratory data analysis, saving valuable time and reducing manual inspection of large datasets.
- **ETL Pipelines**: Easily detect issues such as missing data, outliers, or data drift, and ensure the quality of data moving through your pipeline.
- **Data Quality Assurance**: Validate the integrity of your data before model training or analysis, minimizing the risk of poor model performance due to flawed data.

---

## Installation

You can install `MegaProfiler` using pip:

```bash
pip install megaprofiler
