Metadata-Version: 2.4
Name: jan883-eda
Version: 0.2.2
Summary: Library that provides helperfunctions for data science preprocessing and exploratory data analysis.
Project-URL: Repository, https://github.com/janduplessis883/jan883-eda
Project-URL: Issues, https://github.com/janduplessis883/jan883-eda/issues
Author-email: Jan du Plessis <drjanduplessis@icloud.com>
License-Expression: MIT
Requires-Python: >=3.12
Requires-Dist: imbalanced-learn>=0.13.0
Requires-Dist: ipython>=9.0.2
Requires-Dist: joblib>=1.4.2
Requires-Dist: matplotlib>=3.10.1
Requires-Dist: numpy>=2.2.4
Requires-Dist: pandas>=2.2.3
Requires-Dist: scikit-learn>=1.6.1
Requires-Dist: scipy>=1.15.2
Requires-Dist: seaborn>=0.13.2
Requires-Dist: setuptools>=69
Requires-Dist: statsmodels>=0.14.4
Requires-Dist: tqdm>=4.67.1
Requires-Dist: xgboost>=3.0.0
Requires-Dist: yellowbrick>=1.5
Description-Content-Type: text/markdown

# jan883-eda

A collection of utility functions for data analysis, preprocessing, model evaluation, and clustering in Python. Designed to streamline the workflow of data scientists and machine learning practitioners.

## Installation

Install the package via pip:

```bash
pip install jan883-eda
```

For local development from this repository:

```bash
uv sync
uv run python -c "import jan883_eda; print(jan883_eda.__all__)"
```

## Usage

Below are examples demonstrating how to use some of the key functions in the package. These examples assume you have a DataFrame (`your_dataframe`) or feature matrix (`X`) and target vector (`y`) ready.

### Exploratory Data Analysis (EDA)

- **Inspect DataFrame:**

```python
from jan883_eda import inspect_df

inspect_df(your_dataframe)
```

This displays the head, shape, description, NaN values, and duplicates of the DataFrame.

- **Column Summary:**

```python
from jan883_eda import column_summary

summary = column_summary(your_dataframe)
print(summary)
```

- **Data Quality Report:**

```python
from jan883_eda import data_quality_report

quality = data_quality_report(your_dataframe)
print(quality)
```

### Data Preprocessing

- **Update Column Names:**

```python
from jan883_eda import update_column_names

updated_df = update_column_names(your_dataframe)
```

- **Label Encoding:**

```python
from jan883_eda import label_encode_column

encoded_df = label_encode_column(your_dataframe, 'column_name')
```

- **Train-Test Safe Preprocessor:**

```python
from jan883_eda import fit_transform_preprocessor

preprocessor, X_train_ready, X_test_ready = fit_transform_preprocessor(X_train, X_test)
```

### Model Evaluation

- **Evaluate Classification Model:**

```python
from jan883_eda import evaluate_classification_model
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier()
evaluate_classification_model(model, X, y)
```

- **Test Multiple Regression Models:**

```python
from jan883_eda import best_regression_models

results = best_regression_models(X, y)
print(results)
```

- **Cross-Validated Model Comparison:**

```python
from jan883_eda import compare_classifiers_cv, compare_regressors_cv

classification_results = compare_classifiers_cv(X, y, scoring="f1_weighted")
regression_results = compare_regressors_cv(X, y, scoring="r2")
```

### Diagnostics

- **Classification and Regression Diagnostics:**

```python
from jan883_eda import (
    class_balance_report,
    classification_metrics_table,
    plot_confusion_matrix,
    regression_metrics,
    plot_regression_diagnostics,
)

balance = class_balance_report(y)
metrics = classification_metrics_table(y_test, y_pred)
plot_confusion_matrix(y_test, y_pred)
regression_summary = regression_metrics(y_test, y_pred)
residuals = plot_regression_diagnostics(y_test, y_pred)
```

### Feature Selection

- **Feature Ranking and Pruning:**

```python
from jan883_eda import (
    low_variance_features,
    correlation_prune,
    mutual_information_ranking,
    permutation_importance_table,
)

low_variance = low_variance_features(X)
correlated = correlation_prune(X, threshold=0.9)
mi_scores = mutual_information_ranking(X, y, problem_type="classification")
importance = permutation_importance_table(fitted_model, X_test, y_test)
```

### Clustering

- **Evaluate and Profile Clusters:**

```python
from jan883_eda import evaluate_kmeans_clusters, cluster_profile, pca_cluster_projection

k_scores = evaluate_kmeans_clusters(X_scaled, k_range=range(2, 10))
profiles = cluster_profile(your_dataframe, labels)
projection = pca_cluster_projection(X_scaled, labels)
```

### Time Series

- **Analyze Stationarity:**

```python
from jan883_eda import analyze_stationarity

stationary_series = your_time_series.diff().dropna()
analyze_stationarity(stationary_series, alpha=0.05, lags=15)
```

This runs an Augmented Dickey-Fuller test, prints a plain-English stationarity interpretation, and plots ACF/PACF charts to help inspect autoregressive and moving-average structure.

- **Forecasting Helpers:**

```python
from jan883_eda import (
    stationarity_report,
    plot_rolling_statistics,
    seasonal_decomposition_plot,
    make_lag_features,
    time_series_train_test_split,
    forecast_metrics,
)

report = stationarity_report(your_time_series)
rolling = plot_rolling_statistics(your_time_series, window=12)
decomposition = seasonal_decomposition_plot(your_time_series, period=12)
lagged = make_lag_features(your_time_series, lags=(1, 2, 3), rolling_windows=(7, 14))
train_ts, test_ts = time_series_train_test_split(lagged, test_size=0.2)
scores = forecast_metrics(y_true, y_pred)
```

### Drift and Pipelines

- **Train-Test Drift and Production Pipelines:**

```python
from jan883_eda import (
    compare_train_test_distributions,
    build_model_pipeline,
    validate_prediction_columns,
    save_pipeline,
    load_pipeline,
)

drift = compare_train_test_distributions(X_train, X_test)
pipeline = build_model_pipeline(X_train, estimator)
pipeline.fit(X_train, y_train)
validated = validate_prediction_columns(new_data, X_train.columns)
save_pipeline(pipeline, "model.joblib")
loaded_pipeline = load_pipeline("model.joblib")
```

## Functions Overview

The package provides a variety of functions grouped by their purpose:

- **EDA Functions:** `inspect_df`, `column_summary`, `univariate_analysis`, and more.
- **Data Quality:** `data_quality_report`, `duplicate_summary`.
- **Data Preprocessing:** `update_column_names`, `label_encode_column`, `one_hot_encode_column`, `build_preprocessor`, `fit_transform_preprocessor`, and more.
- **Model Evaluation:** `evaluate_classification_model`, `evaluate_regression_model`, `best_classification_models`, `best_regression_models`, `compare_classifiers_cv`, `compare_regressors_cv`, and more.
- **Diagnostics:** `class_balance_report`, `classification_metrics_table`, `plot_confusion_matrix`, `regression_metrics`, `plot_regression_diagnostics`, and more.
- **Feature Selection:** `low_variance_features`, `correlation_prune`, `mutual_information_ranking`, `permutation_importance_table`.
- **Clustering Analysis:** `plot_elbow_method`, `plot_intercluster_distance`, `plot_silhouette_visualizer`, `evaluate_kmeans_clusters`, `cluster_profile`, and more.
- **Time Series:** `analyze_stationarity`, `stationarity_report`, `make_lag_features`, `forecast_metrics`, and more.
- **Drift and Pipelines:** `compare_train_test_distributions`, `population_stability_index`, `build_model_pipeline`, `save_pipeline`, `load_pipeline`.

For a complete list of functions and their detailed documentation, refer to the docstrings within the source code.

## Requirements

The following dependencies are required to use the package:

- Python >= 3.12
- pandas >= 2.2.3
- numpy >= 2.2.4
- matplotlib >= 3.10.1
- seaborn >= 0.13.2
- scikit-learn >= 1.6.1
- setuptools >= 69
- statsmodels >= 0.14.4
- yellowbrick >= 1.5
- imbalanced-learn >= 0.13.0
- xgboost >= 3.0.0

These are installed automatically when you install the package with pip.

## License

This package is distributed under the MIT License.

## Contact

For questions, bug reports, or contributions, use the project repository where this package is maintained.
