Metadata-Version: 2.4
Name: robustmodelmaker
Version: 0.3.1
Summary: A reproducible stability-selection pipeline for scientific machine learning
Author: Amanda S Barnard
License: MIT License
        
        Copyright (c) 2026 Amanda S Barnard
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/amaxiom/RobustModelMaker
Project-URL: Repository, https://github.com/amaxiom/RobustModelMaker
Project-URL: Bug Tracker, https://github.com/amaxiom/RobustModelMaker/issues
Project-URL: Documentation, https://github.com/amaxiom/RobustModelMaker/tree/main/docs
Project-URL: Changelog, https://github.com/amaxiom/RobustModelMaker/blob/main/CHANGELOG.md
Keywords: machine learning,feature selection,bootstrap stability selection,nested cross-validation,scientific computing,reproducibility
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=1.5
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Provides-Extra: xgb
Requires-Dist: xgboost>=1.7; extra == "xgb"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: benchmake; extra == "dev"
Provides-Extra: all
Requires-Dist: xgboost>=1.7; extra == "all"
Requires-Dist: pytest>=7.0; extra == "all"
Requires-Dist: benchmake; extra == "all"
Dynamic: license-file

# RobustModelMaker

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/amaxiom/RobustModelMaker/blob/main/LICENSE)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/)
[![Version](https://img.shields.io/badge/version-0.3.1-green.svg)](https://github.com/amaxiom/RobustModelMaker/blob/main/CHANGELOG.md)
[![PyPI](https://img.shields.io/pypi/v/robustmodelmaker.svg)](https://pypi.org/project/robustmodelmaker/)

**A reproducible model-building pipeline for small-to-medium scientific datasets.**

RobustModelMaker (ROBUST) combines bootstrap stability selection with leakage-safe nested cross-validation to identify a stable, minimal feature subset and produce honest performance estimates. It is designed for scientific datasets where reproducibility, interpretability, and honest generalisation estimates matter as much as raw predictive performance.

---

## Why RobustModelMaker?

Standard machine learning pipelines applied to scientific data suffer from two problems that ROBUST addresses directly:

**Optimistic performance estimates.** When feature selection, hyperparameter tuning, and model evaluation share the same data, the reported score reflects the data used for model building, not future data. ROBUST uses strict nested cross-validation in which each of those steps is performed entirely on the training partition of each fold. The test partition is used only to evaluate the final fold model, never to inform any modelling decision.

**Unstable feature selection.** Single-run feature selection produces a feature set that can change substantially with small changes in the data. ROBUST runs bootstrap stability selection: features are ranked by how consistently they are selected across hundreds of random subsamples of the training data. Only features that exceed a stability threshold (70% of bootstrap runs by default) are retained.

The result is a model built on a smaller, more reproducible feature set whose estimated performance is trustworthy.

---

## Installation

```bash
pip install robustmodelmaker
```

For XGBoost support:

```bash
pip install robustmodelmaker[xgb]
```

**Requirements:** Python >= 3.9, numpy, pandas, scikit-learn, scipy

---

## Quick start

```python
import pandas as pd
from robustmodelmaker import RobustModelMaker

X = pd.read_csv("features.csv")
y = pd.read_csv("labels.csv").squeeze()

maker = RobustModelMaker(
    alg="eln",           # elastic net: interpretable and fast
    task_type="binary",  # always set explicitly: "binary", "multiclass", or "regression"
    outer_cv=5,
    inner_cv=5,
    n_bootstrap=100,
    stability_threshold=0.7,
    random_state=42,
).fit(X, y)

result = maker.result_
print(f"Selected {len(result.selected_features)} of {len(result.feature_names)} features")
print(f"Nested CV AUC: {result.mean_score:.4f} +/- {result.std_score:.4f}")

# Predict on new data (preprocessing and feature selection applied automatically)
predictions = maker.predict(X_new)
probabilities = maker.predict_proba(X_new)
```

The functional API is also available:

```python
from robustmodelmaker import run_pipeline

result = run_pipeline(X, y, alg="eln", task_type="binary",
                      outer_cv=5, inner_cv=5, random_state=42)
```

---

## Algorithms

| Code | Model | Tasks | Notes |
|---|---|---|---|
| `eln` | Elastic net | all | Fastest; coefficient-based importance; auto-scales |
| `rdg` | Ridge (L2) | all | Stable; good default for many scientific datasets |
| `las` | Lasso (L1) | all | Sparse coefficients; strong feature selector |
| `log` | L2 logistic regression | classification | Reliable baseline |
| `svm` | Linear SVM | all | Effective in high-dimensional spaces |
| `rf` | Random forest | all | Non-linear; no scaling needed; class_weight balanced |
| `xgb` | XGBoost | all | Highest raw performance; requires `pip install robustmodelmaker[xgb]` |
| `mlp` | Multi-layer perceptron | all | Neural baseline; slower on small datasets |
| `lin` | Linear regression (OLS) | regression only | Interpretable; no regularisation |

---

## Key capabilities

| Capability | Detail |
|---|---|
| Task types | Binary classification, multiclass classification, regression |
| Feature selection | Bootstrap stability selection with configurable threshold and bootstrap count |
| Performance estimation | Nested CV (outer + inner), repeated nested CV, grouped CV |
| Preprocessing | Median imputation + optional standard scaling, fitted inside each fold |
| Missing data | NaN-tolerant by default; optional data-driven missingness filter |
| Cutoff determination | Bootstrap specificity-targeted threshold for binary classification |
| Probability calibration | Platt scaling (sigmoid) or isotonic regression |
| Post-hoc analysis | Permutation importance, SHAP-ready export, feature stability plots |
| External validation | One-call evaluation on a held-out set with full metric suite |
| Reproducibility | Fully deterministic given a fixed random seed, verified by test suite |
| Save/load | JSON metadata, CSV tables, and pickle of the fitted result |

---

## Saving results

```python
# Save at fit time
maker = RobustModelMaker(
    alg="eln", task_type="binary",
    save_results=True,
    output_dir="results/",
    output_prefix="my_model",
    random_state=42,
).fit(X, y)

# Or save afterwards
maker.save_results(output_dir="results/", output_prefix="my_model")
```

Saves: JSON metadata, full pickle, per-fold score CSVs, stability selection table, and a formatted text summary.

---

## External validation

```python
maker = RobustModelMaker(alg="eln", task_type="binary", random_state=42)
maker.fit(X_train, y_train, X_validation=X_val, y_validation=y_val)

val = maker.result_.validation_result
print(val.metrics)   # auc, accuracy, sensitivity, specificity, ...
```

---

## Permutation importance

```python
pi = maker.permutation_importance(X_val, y_val, n_repeats=20, random_state=42)
print(pi.summary().head(10))
```

---

## SHAP integration

```python
shap_data = maker.result_.export_shap_ready(X)

import shap
explainer = shap.LinearExplainer(shap_data["model"], shap_data["X"])
shap_values = explainer.shap_values(shap_data["X"])
shap.summary_plot(shap_values, shap_data["X"])
```

---

## Grouped cross-validation

```python
maker = RobustModelMaker(alg="eln", task_type="regression", random_state=42)
maker.fit(X, y, groups=sample_ids)   # prevents leakage across experimental units
```

---

## Benchmark results

Three real scientific datasets evaluated against a full-feature nested-CV baseline using the same algorithm and fold structure. All three benchmarks use Random Forest (`rf`) for both ROBUST and the baseline, isolating the effect of bootstrap stability selection from any algorithm differences:

| Dataset | Task | n x p | ROBUST feats | Reduction | Metric | Outcome |
|---|---|---|---|---|---|---|
| SECOM Manufacturing | binary | 1254 x 590 | ~47 | ~92% | AUC (higher=better) | preserved |
| Urban Land Cover | multiclass | 540 x 147 | ~31 | ~79% | AUC-OVR (higher=better) | preserved |
| Graphene Oxide Bulk | regression | 1294 x 412 | ~68 | ~83% | RMSE in eV (lower=better) | preserved |

`preserved` is the primary success criterion: the stability-selected feature subset achieves statistically equivalent performance to the full-feature baseline (paired Wilcoxon, p >= 0.05) while using a fraction of the features. The selected features are robust across bootstrap resamples of the training data, not optimal for any single model fit; a small non-significant performance difference from the baseline is the expected and intended outcome.

**Note on split methodology:** All benchmarks use [BenchMake](https://github.com/amaxiom/benchmake) archetypal splits, which are adversarial by design. BenchMake selects maximally representative train/test partitions that keep the two sets apart in feature space, producing more conservative scores than conventional random splits. This is intentional: the benchmark is a worst-case assessment. Scores on your own data with default random splits will typically be higher. The ROBUST vs. full-feature baseline comparison within each benchmark is internally consistent because both models use the same split.

---

## Citing this work

```
Barnard, A. S. (2026). RobustModelMaker: A reproducible stability-selection pipeline
for scientific machine learning (v0.3). GitHub: https://github.com/amaxiom/RobustModelMaker
```

---

## Documentation

Full documentation is available in the [GitHub repository](https://github.com/amaxiom/RobustModelMaker):

- [User Guide](https://github.com/amaxiom/RobustModelMaker/blob/main/docs/USER_GUIDE.md): parameters, methods, prediction, validation, SHAP, saving
- [Implementation Guide](https://github.com/amaxiom/RobustModelMaker/blob/main/docs/IMPLEMENTATION_GUIDE.md): internal design, algorithm details, tuning for speed and rigor
- [Interpretation Guide](https://github.com/amaxiom/RobustModelMaker/blob/main/docs/INTERPRETATION_GUIDE.md): reading results correctly, statistical tests, what to report in a paper

---

## Author

**Prof Amanda S Barnard**
GitHub: [amaxiom](https://github.com/amaxiom)

RobustModelMaker is developed and maintained as a tool for rigorous, reproducible machine learning in scientific research.
