Metadata-Version: 2.4
Name: optuml
Version: 0.2.7
Summary: Hyperparameter optimization for multiple machine learning algorithms using Optuna, with Scikit-learn API
Home-page: https://github.com/filipsPL/optuml
Author: Filip S.
Author-email: filip.ursynow@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: optuna>=3.0.1
Requires-Dist: scikit-learn
Requires-Dist: numpy
Provides-Extra: catboost
Requires-Dist: catboost; extra == "catboost"
Provides-Extra: xgboost
Requires-Dist: xgboost; extra == "xgboost"
Provides-Extra: lightgbm
Requires-Dist: lightgbm; extra == "lightgbm"
Provides-Extra: all
Requires-Dist: catboost; extra == "all"
Requires-Dist: xgboost; extra == "all"
Requires-Dist: lightgbm; extra == "all"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# OptuML: Hyperparameter Optimization for Machine Learning Algorithms using Optuna

```
 ⣰⡁ ⡀⣀ ⢀⡀ ⣀⣀    ⢀⡀ ⣀⡀ ⣰⡀ ⡀⢀ ⣀⣀  ⡇   ⠄ ⣀⣀  ⣀⡀ ⢀⡀ ⡀⣀ ⣰⡀   ⡎⢱ ⣀⡀ ⣰⡀ ⠄ ⣀⣀  ⠄ ⣀⣀ ⢀⡀ ⡀⣀
 ⢸  ⠏  ⠣⠜ ⠇⠇⠇   ⠣⠜ ⡧⠜ ⠘⠤ ⠣⠼ ⠇⠇⠇ ⠣   ⠇ ⠇⠇⠇ ⡧⠜ ⠣⠜ ⠏  ⠘⠤   ⠣⠜ ⡧⠜ ⠘⠤ ⠇ ⠇⠇⠇ ⠇ ⠴⠥ ⠣⠭ ⠏ 
```

`OptuML` (*Optu*na + *ML*) is a Python module providing hyperparameter optimization for machine learning algorithms using the [Optuna](https://optuna.org/) framework. The module offers a scikit-learn compatible API with enhanced features for robust optimization.

[![Python manual install](https://github.com/filipsPL/optuml/actions/workflows/python-package.yml/badge.svg)](https://github.com/filipsPL/optuml/actions/workflows/python-package.yml) [![Python pip install](https://github.com/filipsPL/optuml/actions/workflows/python-pip.yml/badge.svg)](https://github.com/filipsPL/optuml/actions/workflows/python-pip.yml) [![pypi version](https://img.shields.io/pypi/v/optuml)](https://pypi.org/project/optuml/) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17305964.svg)](https://doi.org/10.5281/zenodo.17305963)

## tl;dr

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Create and train optimizer
clf = Optimizer(algorithm="RandomForestClassifier", n_trials=50, cv=5, scoring="accuracy")
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(accuracy)
# 0.9111111111111111
print(y_pred[:10])
# [1 1 1 1 0 0 2 2 0 0]
```

## tl;dr why this module?

*I want to make a fair comparison of ML methods, where 'fair' means that each method has tuned hyperparameters, making it the best version of itself.*


## Key Features

- **Comprehensive Algorithm Support**: Full scikit-learn algorithm zoo plus CatBoost and XGBoost
- **Full Scikit-learn Compatibility**: Seamless integration with pipelines, cross-validation, and all sklearn tools
- **Robust Optimization**: Powered by Optuna with early stopping, timeout protection, and parallel execution
- **Type-Safe Design**: Separate optimizers for classification and regression with proper type checking
- **Production Ready**: Cross-platform compatibility, comprehensive error handling, and extensive validation
- **Flexible Configuration**: Control every aspect of the optimization process
- **Benchmarking** of multiple algorithms at once, see [benchmarking](#algorithm-benchmarking)

## Installation

### Option A: pip (recommended)

```bash
pip install optuml
```

With optional algorithm support:

```bash
pip install optuml[all]          # CatBoost + XGBoost + LightGBM
pip install optuml[catboost]     # CatBoost only
pip install optuml[xgboost]      # XGBoost only
pip install optuml[lightgbm]     # LightGBM only
```

or upgrade:

```bash
pip install optuml --upgrade
```

### Option B: Manual installation

```bash
# Install required dependencies
pip install optuna scikit-learn numpy

# Optional: Install additional algorithms
pip install catboost xgboost

# Download the module
wget https://raw.githubusercontent.com/filipsPL/optuml/main/optuml/optuml.py
```

## Quick Start

### Classification Example

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from optuml import Optimizer

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train optimizer
clf = Optimizer(
    algorithm="RandomForestClassifier",
    n_trials=50,
    cv=5,
    scoring="accuracy",
    random_state=42,
    show_progress_bar=True
)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# View results
print(f"Accuracy: {accuracy:.3f}")
print(f"Best parameters: {clf.best_params_}")
print(f"Optimization took: {clf.study_time_:.2f} seconds")
print(f"Trials completed: {clf.n_trials_completed_}")
```

### Regression Example

```python
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from optuml import Optimizer

# Load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create and train optimizer
reg = Optimizer(
    algorithm="XGBRegressor",
    n_trials=100,
    cv=5,
    scoring="r2",
    early_stopping_patience=10,  # Stop if no improvement for 10 trials
    n_jobs=-1,  # Use all CPU cores for CV
    verbose=True
)
reg.fit(X_train, y_train)

# Evaluate
y_pred = reg.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.3f}")
```

## Supported Algorithms

### Classification Algorithms

| Algorithm                        | Description                     | Key Features                              |
| -------------------------------- | ------------------------------- | ----------------------------------------- |
| `SVC`                            | Support Vector Classifier       | Non-linear kernels, probability estimates |
| `LogisticRegression`             | Logistic Regression             | L1/L2/Elastic-Net regularization          |
| `RidgeClassifier`                | Ridge Classifier                | L2 regularization, fast linear model      |
| `KNeighborsClassifier`           | k-Nearest Neighbors             | Distance weighting, various metrics       |
| `RandomForestClassifier`         | Random Forest                   | Feature importance, OOB score             |
| `ExtraTreesClassifier`           | Extremely Randomized Trees      | Faster than RF, reduced variance          |
| `AdaBoostClassifier`             | AdaBoost                        | Boosted ensemble, learning rate tuning    |
| `GradientBoostingClassifier`     | Gradient Boosting               | Sequential boosting, feature subsampling  |
| `HistGradientBoostingClassifier` | Histogram Gradient Boosting     | Fast GBDT, native NaN support             |
| `MLPClassifier`                  | Neural Network                  | Multiple architectures, early stopping    |
| `GaussianNB`                     | Gaussian Naive Bayes            | Fast, probabilistic                       |
| `QDA`                            | Quadratic Discriminant Analysis | Non-linear boundaries                     |
| `DecisionTreeClassifier`         | Decision Tree                   | Multiple criteria, pruning                |
| `SGDClassifier`                  | Stochastic Gradient Descent     | Multiple losses, L1/L2/ElasticNet, online |
| `CatBoostClassifier`*            | CatBoost                        | Categorical features, GPU support         |
| `XGBClassifier`*                 | XGBoost                         | Regularization, missing values            |
| `LGBMClassifier`*                | LightGBM                        | Fast GBDT, leaf-wise growth               |

### Regression Algorithms

| Algorithm                       | Description                 | Key Features                             |
| ------------------------------- | --------------------------- | ---------------------------------------- |
| `SVR`                           | Support Vector Regression   | Epsilon-insensitive loss                 |
| `LinearRegression`              | Linear Regression           | Simple, interpretable                    |
| `Ridge`                         | Ridge Regression            | L2 regularization, stable on collinear   |
| `Lasso`                         | Lasso Regression            | L1 regularization, feature selection     |
| `ElasticNet`                    | Elastic Net                 | L1+L2 regularization, sparse solutions   |
| `KNeighborsRegressor`           | k-Nearest Neighbors         | Local regression                         |
| `RandomForestRegressor`         | Random Forest               | Reduces overfitting                      |
| `ExtraTreesRegressor`           | Extremely Randomized Trees  | Faster than RF, reduced variance         |
| `AdaBoostRegressor`             | AdaBoost                    | Sequential learning                      |
| `GradientBoostingRegressor`     | Gradient Boosting           | Sequential boosting, feature subsampling |
| `HistGradientBoostingRegressor` | Histogram Gradient Boosting | Fast GBDT, native NaN support            |
| `MLPRegressor`                  | Neural Network              | Non-linear patterns                      |
| `DecisionTreeRegressor`         | Decision Tree               | Non-parametric                           |
| `SGDRegressor`                  | Stochastic Gradient Descent | Multiple losses, L1/L2/ElasticNet, online |
| `CatBoostRegressor`*            | CatBoost                    | Handles categoricals                     |
| `XGBRegressor`*                 | XGBoost                     | High performance                         |
| `LGBMRegressor`*                | LightGBM                    | Fast GBDT, leaf-wise growth              |

*Optional dependencies (install separately)

## Advanced Features

### Early Stopping

Stop optimization when no improvement is observed:

```python
optimizer = Optimizer(
    algorithm="XGBClassifier",
    n_trials=1000,
    early_stopping_patience=20  # Stop after 20 trials without improvement
)
```

### Parallel Cross-Validation

Speed up optimization using multiple CPU cores:

```python
optimizer = Optimizer(
    algorithm="RandomForestClassifier",
    n_trials=100,
    cv=10,
    n_jobs=-1  # Use all available cores
)
```

### Custom Scoring Metrics

Use any scikit-learn compatible scoring metric:

```python
optimizer = Optimizer(
    algorithm="SVC",
    scoring="roc_auc",  # For classification
    # scoring="neg_mean_squared_error",  # For regression
    # scoring="f1_weighted",  # For imbalanced classes
)
```

### Timeout Protection

Set time limits for optimization:

```python
optimizer = Optimizer(
    algorithm="MLPClassifier",
    timeout=300,  # Total optimization timeout (5 minutes)
    cv_timeout=30,  # Per-trial timeout (30 seconds)
    n_trials=1000  # Will stop at timeout even if trials remain
)
```

### Access to Optuna Study

Get detailed optimization information:

```python
# After fitting
optimizer.fit(X_train, y_train)

# Access the Optuna study object
study = optimizer.study_
print(f"Best trial: {study.best_trial.number}")
print(f"Best value: {study.best_value:.4f}")

# Plot optimization history (requires plotly)
import optuna.visualization as vis
fig = vis.plot_optimization_history(study)
fig.show()

# Plot parameter importances
fig = vis.plot_param_importances(study)
fig.show()
```

### Pipeline Integration

Full compatibility with scikit-learn pipelines:

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Create pipeline with OptuML
pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('optimizer', Optimizer(algorithm="SVC", n_trials=50))
])

# Use like any sklearn pipeline
pipe.fit(X_train, y_train)
predictions = pipe.predict(X_test)
```

### Algorithm Benchmarking

`AlgorithmBenchmark` runs every supported algorithm (or a chosen subset) on your data, optimizes each one independently, and reports a ranked comparison — without any scikit-learn estimator constraints. Check [sample script](examples/example-benchmark-all.py) and [outputs](examples/benchmark_results/).

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from optuml import AlgorithmBenchmark

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

bench = AlgorithmBenchmark(
    task="classification",   # "classification" or "regression"
    n_trials=50,
    random_state=42,
)
bench.fit(X_train, y_train)

# Ranked results as a DataFrame (requires pandas) or list of dicts
print(bench.summary())
#                      algorithm  best_score  n_trials_completed  fit_time error
# 0         RandomForestClassifier    0.983333                  50      4.21  None
# 1           ExtraTreesClassifier    0.975000                  50      3.87  None
# ...

print(bench.best_algorithm_)   # e.g. "RandomForestClassifier"
print(bench.best_score_)       # best CV score across all algorithms

# Use the winning estimator directly
predictions = bench.best_estimator_.predict(X_test)

# Or drill into any individual optimizer
rf_optimizer = bench.optimizers_["RandomForestClassifier"]
print(rf_optimizer.best_params_)
```

To benchmark a specific subset of algorithms:

```python
bench = AlgorithmBenchmark(
    task="regression",
    algorithms=["Ridge", "RandomForestRegressor", "XGBRegressor"],
    n_trials=50,
    scoring="r2",
)
bench.fit(X_train, y_train)
```

Run algorithms in parallel across CPU cores with `n_jobs_algorithms`:

```python
bench = AlgorithmBenchmark(
    task="classification",
    n_trials=50,
    n_jobs_algorithms=-1,   # one process per algorithm, all cores
)
```

#### `AlgorithmBenchmark` Parameters

| Parameter                 | Type           | Default      | Description                                         |
| ------------------------- | -------------- | ------------ | --------------------------------------------------- |
| `task`                    | str            | required     | `"classification"` or `"regression"`                |
| `algorithms`              | list or `"all"`| `"all"`      | Algorithms to benchmark                             |
| `n_trials`                | int            | 50           | Optuna trials per algorithm                         |
| `timeout`                 | float/None     | None         | Per-algorithm study timeout (seconds)               |
| `cv`                      | int            | 5            | Cross-validation folds                              |
| `scoring`                 | str/None       | Auto*        | Scoring metric                                      |
| `cv_timeout`              | float          | 120          | Per-trial CV timeout (seconds)                      |
| `random_state`            | int/None       | None         | Random seed forwarded to every `Optimizer`          |
| `early_stopping_patience` | int/None       | None         | Early stopping patience per algorithm               |
| `n_jobs`                  | int            | 1            | Parallel CV jobs inside each `Optimizer`            |
| `n_jobs_algorithms`       | int            | 1            | Algorithms to run in parallel (`-1` = all cores)    |
| `verbose`                 | bool/int       | False        | Verbosity forwarded to each `Optimizer`             |

*Auto defaults: `"accuracy"` for classification, `"r2"` for regression

#### Attributes after `fit()`

| Attribute           | Description                                              |
| ------------------- | -------------------------------------------------------- |
| `best_algorithm_`   | Name of the best-scoring algorithm                       |
| `best_estimator_`   | Fitted sklearn estimator from the winning optimizer      |
| `best_score_`       | Best CV score across all algorithms                      |
| `best_params_`      | Hyperparameters of the winning optimizer                 |
| `results_`          | List of per-algorithm result dicts (including failures)  |
| `optimizers_`       | `dict[algorithm_name, Optimizer]` for full introspection |

### Type-Specific Optimizers

For more control, use the specific optimizer classes:

```python
from optuml.optuml import ClassifierOptimizer, RegressorOptimizer

# Classifier with all classifier-specific methods
clf = ClassifierOptimizer(
    algorithm="RandomForestClassifier",
    n_trials=100
)
clf.fit(X_train, y_train)
probas = clf.predict_proba(X_test)
decision = clf.decision_function(X_test)  # If supported

# Regressor with regression-specific defaults
reg = RegressorOptimizer(
    algorithm="RandomForestRegressor",
    n_trials=100,
    scoring="r2"  # Default for regressors
)
```

## API Reference

### Main Classes

#### `Optimizer`
Universal optimizer that automatically selects between classification and regression.

#### `ClassifierOptimizer`
Specialized optimizer for classification algorithms with methods like `predict_proba()` and `decision_function()`.

#### `RegressorOptimizer`
Specialized optimizer for regression algorithms with appropriate default scoring metrics.

### Common Parameters

| Parameter                 | Type       | Default    | Description                                |
| ------------------------- | ---------- | ---------- | ------------------------------------------ |
| `algorithm`               | str        | required   | ML algorithm to optimize                   |
| `n_trials`                | int        | 100        | Number of optimization trials              |
| `cv`                      | int        | 5          | Cross-validation folds                     |
| `scoring`                 | str/None   | Auto*      | Scoring metric for CV                      |
| `direction`               | str        | "maximize" | Optimization direction                     |
| `timeout`                 | float/None | None       | Total optimization timeout (seconds)       |
| `cv_timeout`              | float      | 120        | Single CV evaluation timeout               |
| `random_state`            | int/None   | None       | Random seed for reproducibility            |
| `n_jobs`                  | int        | 1          | Parallel jobs for CV (-1 for all cores)    |
| `early_stopping_patience` | int/None   | None       | Trials without improvement before stopping |
| `verbose`                 | bool/int   | False      | Verbosity level                            |
| `show_progress_bar`       | bool       | False      | Show optimization progress                 |

*Auto defaults: "accuracy" for classifiers, "r2" for regressors

### Methods

| Method                 | Description                        | Available For    |
| ---------------------- | ---------------------------------- | ---------------- |
| `fit(X, y)`            | Optimize hyperparameters and train | All              |
| `predict(X)`           | Make predictions                   | All              |
| `score(X, y)`          | Evaluate model performance         | All              |
| `predict_proba(X)`     | Predict class probabilities        | Classifiers      |
| `decision_function(X)` | Get decision values                | Some classifiers |
| `get_params()`         | Get optimizer parameters           | All              |
| `set_params(**params)` | Set optimizer parameters           | All              |

### Attributes (after fitting)

| Attribute             | Description                        |
| --------------------- | ---------------------------------- |
| `best_estimator_`     | Trained model with best parameters |
| `best_params_`        | Best hyperparameters found         |
| `best_score_`         | Best cross-validation score        |
| `study_`              | Optuna study object                |
| `study_time_`         | Total optimization time            |
| `n_trials_completed_` | Number of completed trials         |
| `classes_`            | Class labels (classifiers only)    |
| `n_features_in_`      | Number of input features           |
| `feature_names_in_`   | Feature names (if available)       |

## Troubleshooting

### Issue: "No successful trials completed"
**Solution**: Increase `cv_timeout` or reduce `cv` folds:
```python
optimizer = Optimizer(algorithm="SVC", cv_timeout=300, cv=3)
```

### Issue: CatBoost/XGBoost/LightGBM not available
**Solution**: Install optional dependencies:
```bash
pip install optuml[all]
# or individually:
pip install catboost xgboost lightgbm
```

### Issue: Optimization takes too long
**Solutions**:
1. Use parallel CV: `n_jobs=-1`
2. Set timeout: `timeout=600`
3. Use early stopping: `early_stopping_patience=10`
4. Reduce trials: `n_trials=50`

### Issue: Memory errors with large datasets
**Solutions**:
1. Use algorithms with lower memory footprint (e.g., `LogisticRegression`, `SGDClassifier`, or `SGDRegressor`)
2. Reduce CV folds

## Best Practices

1. **Start with fewer trials**: Begin with `n_trials=20-50` for exploration, then increase for final optimization

2. **Use appropriate scoring metrics**: 
   - Imbalanced classification: `"f1_weighted"`, `"roc_auc"`
   - Regression: `"r2"`, `"neg_mean_squared_error"`
   
3. **Enable early stopping** for large trial counts:
   ```python
   Optimizer(n_trials=1000, early_stopping_patience=20)
   ```

4. **Set random state** for reproducibility:
   ```python
   Optimizer(random_state=42)
   ```

5. **Use parallel processing** for faster optimization:
   ```python
   Optimizer(n_jobs=-1)
   ```

## Benchmark results

See [this page](benchmark/README.md) for benchmark results.

## Citation

If you use OptuML in your research, please cite:

```bibtex
@software{stefaniak_optuml_2024,
  author       = {Filip Stefaniak},
  title        = {OptuML: Hyperparameter Optimization for Multiple Machine Learning Algorithms using Optuna},
  year         = {2024},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17305963},
  url          = {https://doi.org/10.5281/zenodo.17305963}
}
```
