Metadata-Version: 2.4
Name: imbreg
Version: 0.1.2
Summary: A Python library for Imbalanced Regression with SMOGN, stratified CV, and utility-based metrics.
Author: Gabriel Oliveros
License-Expression: MIT
Project-URL: Homepage, https://github.com/goliverosj/imbreg
Project-URL: Issues, https://github.com/goliverosj/imbreg/issues
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: scipy
Requires-Dist: scikit-learn
Requires-Dist: xgboost
Requires-Dist: matplotlib
Requires-Dist: seaborn
Requires-Dist: plotly
Dynamic: license-file

# imbreg

[![PyPI Version](https://img.shields.io/pypi/v/imbreg.svg)](https://pypi.org/project/imbreg/)
![Python Version](https://img.shields.io/badge/python-3.9%2B-blue.svg)
![Status](https://img.shields.io/badge/status-production_ready-brightgreen.svg)
![License](https://img.shields.io/badge/license-MIT-green)

**imbreg** is a powerful Python library specifically designed to tackle the **Imbalanced Regression** problem. It facilitates the processing of datasets with missing values, applies advanced synthetic over-sampling techniques like SMOGN (Synthetic Minority Over-sampling Technique for Regression with Gaussian Noise), evaluates predictive models using utility-based metrics, and manages stratified cross-validation partitioning.

---

## Key Features

- **SMOGN Resampling:** Generates synthetic examples for extreme minority values in continuous domains using the SMOGN strategy (a combination of SmoteR interpolation and GaussNoise perturbation).
- **Stratified Partitioning:** Implements purely stratified cross-validation (CV) algorithms to ensure that extreme values are evenly distributed across folds.
- **Robust Data Imputation:** Native integration with iterative algorithms (Scikit-Learn IterativeImputer) that prevents data leakage between training and test partitions.
- **Advanced Utility-based Metrics:** Precise calculation of specialized metrics for imbalanced regression:
  - **Utility-based F1-Score** ($\beta$-measure).
  - **SERA** (Squared Error Relevance Area).
- **Dataset Loading (KEEL/CSV/ARFF):** A smart data loader that infers categorical variables, caps decimals, maps ranges, and cleans noisy values automatically.
- **Data Visualization:** Built-in 2D and 3D plotting modules (using Plotly, Seaborn) to visually analyze the relevance of the target variable and the impact of noise/distribution.

---

## Requirements and Installation

To use this library, ensure you have Python 3.9 or higher installed. The library is available on [PyPI](https://pypi.org/project/imbreg/):

```bash
pip install imbreg
```

---

## Quickstart Guide

### Examples

Check the `examples/` directory in the repository for ready-to-run scripts:

- `quickstart.py`: Minimal example of all features using synthetic data.
- `plot_examples.py`: How to use the plotting functions.
- `generate_cv_partitions.py` & `evaluate_models.py`: Full cross-validation pipeline.
- `evaluate_external_predictions.py`: Utility to evaluate arbitrary external predictions against ground truth using the library's relevance metrics.

Here is a quick snippet of how to use the core functions:

### 1. Generate Partitions (Cross-Validation)

The `cv_partitions` function will take care of reading your original dataset, cleaning it, performing missing data imputation, and injecting SMOGN oversampling automatically into each repetition.

```python
from imbreg import cv_partitions

cv_partitions(
    ds_name="my_dataset.csv",
    ds_location="raw_data/",
    times=1,                 # Number of repetitions
    folds=10,                # Number of partitions (k-fold)
    strat=True,              # Enable stratification
    smogn=True,              # Apply SMOGN during training
    impute=True,             # Impute missing values (NaNs)
    out_dir="Output/"        # Output directory for raw data partitions
)
```

### 2. Evaluate Predictions

Once the physical folds are generated on your disk, you can automatically train the algorithms and retrieve the results summary containing SERA and F1 metrics.

```python
from imbreg import evaluate_folds

results = evaluate_folds(
    output_dir="Output/",    # Directory containing the generated folds
    dataset="my_dataset",
    model_type="rf",         # 'rf' (Random Forest), 'et' (Extra Trees), 'xgb' (XGBoost)
    n_reps=1,
    n_folds=10,
    use_imputation=True,
    use_smogn=True,
    thr_rel=0.8              # Relevance threshold to define "rare" cases
)

# You can export these results to a flat structure using the built-in exporter
from imbreg.validation import export_experiment_summaries
export_experiment_summaries(results, output_dir="Results/", dataset_name="my_dataset", flat_output=True)
```

### 3. Visualize the Data

Analyze the relevance curve of your target variable:

```python
import matplotlib.pyplot as plt
from imbreg import read_dataset, phi_control, plot_target_distribution

# Load dataset and create relevance control structure
df = read_dataset("my_dataset.csv", "raw_data/")
ctrl = phi_control(df["y"].values, method="extremes")

# Visualize distribution vs relevance
fig = plot_target_distribution(df, target_col="y", phi_ctrl=ctrl, thr_rel=0.8)
plt.show()
```

---

## Project Structure

```text
imbreg/
│
├── data_loader.py    # I/O functions (CSV/KEEL) and imputation wrappers
├── metrics.py        # Mathematical evaluation functions (Utility F1, SERA, Bumps)
├── models.py         # Training and prediction wrappers (RF, ET, XGBoost)
├── plots.py          # Advanced visualizations (Histograms, Scatters, Prediction Error)
├── resampling.py     # Core engine for the SMOGN strategy (SmoteR + GaussNoise)
├── stratification.py # Phi function (relevance) and K-Folds generators
├── utils.py          # Math operations, distance metrics, and internal helpers
└── validation.py     # Cross-validation evaluation pipeline and result export
```

### Datasets

A sample dataset (`Datasets/servo/`) is included for quick testing. Additional regression datasets can be downloaded from [KEEL](https://sci2s.ugr.es/keel/category.php?cat=reg).

### Folder Architecture for Experiments

When running the full pipeline (e.g., `examples/evaluate_models.py`), the project enforces a clean separation of concerns:

- **`Output/`**: Stores all heavy, raw data partitions generated by cross-validation and SMOGN.
- **`Results/`**: A flat, clean directory containing only the final `.txt` and `.csv` summary metrics.
- **`Plots/`**: Directory where generated visualizations and figures are saved.

---

## Testing

The project includes a robust suite of unit tests implemented with `pytest` that covers:

- Parser resilience against null values and troublesome column formats.
- Mathematical precision in array dimensional flattening.
- Robustness against memory leakage or empty variable crashes.

To run the stress test suite locally:

```bash
python -m pytest tests/ -v
```

---

**Author: Gabriel Oliveros**
