Metadata-Version: 2.4
Name: mlfastopt
Version: 0.0.10.1
Summary: ML Fast Opt - Advanced ensemble optimization system for LightGBM hyperparameter tuning
Author-email: GenX AI Lab <contact@genxai.cc>
License-Expression: MIT
Project-URL: Homepage, https://github.com/example/mlfastopt
Project-URL: Documentation, https://github.com/example/mlfastopt/docs
Project-URL: Repository, https://github.com/example/mlfastopt
Project-URL: Issues, https://github.com/example/mlfastopt/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ax-platform>=1.0.0
Requires-Dist: sqlalchemy<2.0.0
Requires-Dist: lightgbm<5.0.0,>=4.6.0
Requires-Dist: polars>=1.31.0
Requires-Dist: pandas<3.0.0,>=2.3.0
Requires-Dist: numpy>=2.3.1
Requires-Dist: scikit-learn<2.0.0,>=1.7.0
Requires-Dist: matplotlib<=3.10.0,>=3.5.0
Requires-Dist: joblib>=1.5.1
Requires-Dist: flask<4.0.0,>=2.2.5
Requires-Dist: plotly<7.0.0,>=6.2.0
Requires-Dist: seaborn>=0.13.2
Requires-Dist: pyarrow>=20.0.0
Requires-Dist: keyring>=25.5.0
Requires-Dist: build>=1.2.2.post1
Requires-Dist: PyYAML>=6.0.2
Requires-Dist: gcsfs>=2025.12.0
Requires-Dist: fastparquet>=2025.12.0
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: xgboost>=3.1.2
Requires-Dist: shap>=0.50.0
Provides-Extra: dev
Requires-Dist: pytest>=9.0.2; extra == "dev"
Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
Requires-Dist: coverage>=7.13.0; extra == "dev"
Requires-Dist: black>=25.12.0; extra == "dev"
Requires-Dist: flake8>=7.3.0; extra == "dev"
Requires-Dist: mypy>=1.19.1; extra == "dev"
Dynamic: license-file

<p align="center">
  <h1 align="center">🚀 MLFastOpt</h1>
  <p align="center">
    <strong>High-Speed Bayesian Hyperparameter Optimization for ML Ensembles</strong>
  </p>
</p>

<p align="center">
  <a href="https://pypi.org/project/mlfastopt/">
    <img src="https://badge.fury.io/py/mlfastopt.svg" alt="PyPI version">
  </a>
  <a href="https://pypi.org/project/mlfastopt/">
    <img src="https://img.shields.io/pypi/pyversions/mlfastopt.svg" alt="Python versions">
  </a>
  <a href="https://opensource.org/licenses/MIT">
    <img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT">
  </a>
  <a href="https://pypi.org/project/mlfastopt/">
    <img src="https://img.shields.io/pypi/dm/mlfastopt.svg" alt="Downloads">
  </a>
</p>

<p align="center">
  <a href="#-installation">Installation</a> •
  <a href="#-quick-start">Quick Start</a> •
  <a href="#-features">Features</a> •
  <a href="#-documentation">Documentation</a> •
  <a href="#-contributing">Contributing</a>
</p>

---

**MLFastOpt** is a production-ready framework for Bayesian hyperparameter optimization of **LightGBM**, **XGBoost**, and **Random Forest** ensemble models. It combines state-of-the-art Bayesian optimization algorithms with ensemble learning techniques.

## ✨ Features

| Feature | Description |
|---------|-------------|
| 🎯 **Bayesian Optimization** | Two-phase optimization: quasi-random exploration followed by Bayesian exploitation |
| 🧩 **Multi-Model Support** | LightGBM, XGBoost, and Random Forest with unified interface |
| 🔄 **Ensemble Learning** | Train N models per trial with different seeds, aggregate via soft/hard voting |
| ⚡ **Parallel Training** | Optional parallel ensemble training with joblib |
| � **Model Serialization** | Trained model objects saved to disk automatically — deploy the actual ensemble, not a retrained single model |
| �📊 **Rich Visualizations** | Auto-generated optimization plots and feature importance charts |
| 🎛️ **Flexible Configuration** | Hierarchical JSON configs with YAML/Python parameter spaces |
| 🔬 **SHAP Integration** | Built-in SHAP feature importance analysis |
| 🌐 **Web Dashboard** | Interactive Flask-based visualization tools |

## 📦 Installation

```bash
pip install mlfastopt
```

### Prerequisites

- **Python**: 3.12+
- **macOS Users**: Install OpenMP for LightGBM/XGBoost support:
  ```bash
  brew install libomp
  ```

## 🚀 Quick Start

### 1. Install the Package

```bash
pip install mlfastopt
```

### 2. Create Configuration Files

**config.json** - Main configuration:

```json
{
  "data": {
    "path": "data/train.parquet",
    "label_column": "target",
    "features": ["feature1", "feature2", "feature3"],
    "class_weight": {"0": 1, "1": 5}
  },
  "model": {
    "type": "lightgbm",
    "hyperparameter_path": "config/hyperparameters.yaml",
    "ensemble_size": 10
  },
  "training": {
    "total_trials": 30,
    "sobol_trials": 10,
    "metric": "soft_recall",
    "parallel": true,
    "n_jobs": 4
  },
  "output": {
    "dir": "outputs/runs"
  }
}
```

**config/hyperparameters.yaml** - Parameter search space:

```yaml
parameters:
  - name: learning_rate
    type: range
    bounds: [0.01, 0.3]
    value_type: float
    log_scale: true

  - name: max_depth
    type: range
    bounds: [3, 12]
    value_type: int

  - name: num_leaves
    type: range
    bounds: [20, 150]
    value_type: int

  - name: min_child_samples
    type: range
    bounds: [5, 100]
    value_type: int
```

### 3. Run Optimization

MLFastOpt offers **two ways** to run optimization:

#### Option A: Command Line (CLI)

```bash
# Set OMP_NUM_THREADS=1 to avoid LightGBM/XGBoost deadlocks
OMP_NUM_THREADS=1 mlfastopt-optimize --config config.json
```

**Additional CLI options:**
```bash
# Validate configuration without running
mlfastopt-optimize --config config.json --validate

# Override trials from command line
mlfastopt-optimize --config config.json --trials 50

# Start web dashboard
mlfastopt-web

# Analysis tools
mlfastopt-analyze
```

#### Option B: Python API

```python
from mlfastopt import AEModelTuner

# Initialize with config file
tuner = AEModelTuner(config_path="config.json")

# Run optimization
results = tuner.run_complete_optimization()

# Access results programmatically
print(f"Best parameters: {results['best_parameters']}")
print(f"Output directory: {results['output_dir']}")
```

| Method | Best For |
|--------|----------|
| **CLI** | Quick runs, shell scripts, cron jobs, CI/CD pipelines |
| **Python API** | Jupyter notebooks, integration with larger applications, programmatic access to results |

### 4. View Results

Results are saved to `outputs/runs/<timestamp>/`:
- `best_parameters.json` — Optimal hyperparameters + metrics (always written)
- `qualifying_trials_*.json` — All trials meeting the threshold, with per-trial params + metrics
- `models/manifest.json` — Index of every serialized model file
- `models/trial_NNNN_seed_SS.txt` — Trained model binaries (LightGBM native format; `.pkl` for other types)
- `optimization_progress.png` — Training curves
- `feature_importance.png` — Feature importance plots
- `README.md` — Run summary report

## 📖 How It Works

MLFastOpt uses a **two-level nested optimization loop**:

```
┌─────────────────────────────────────────────────────────────────┐
│ OUTER LOOP: Trial Iteration (total_trials = 30)                │
│                                                                 │
│  Trial 1: {learning_rate: 0.05, max_depth: 7, ...}             │
│  ├── Train Model 1 (seed=42)                                   │
│  ├── Train Model 2 (seed=43)                                   │
│  ├── ...                                                        │
│  └── Train Model 10 (seed=51)                                  │
│  └── Ensemble Prediction → Calculate Metrics → Update Optimizer│
│                                                                 │
│  Trial 2: {learning_rate: 0.12, max_depth: 5, ...}             │
│  └── ... (same ensemble training)                               │
│                                                                 │
│  Phase 1: Quasi-random exploration (sobol_trials)              │
│  Phase 2: Bayesian optimization (remaining trials)             │
└─────────────────────────────────────────────────────────────────┘
```

**Key concepts:**
- **Trial**: One hyperparameter configuration tested
- **Ensemble**: N models trained per trial (different random seeds)
- **Soft Voting**: Average probabilities across ensemble members
- **Hard Voting**: Average binary predictions across ensemble members

## ⚙️ Configuration Reference

### Data Section

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `path` | `string` | Path to dataset (CSV, Parquet, or URL) | Required |
| `label_column` | `string` | Target column name | Required |
| `features` | `list/string` | Feature names or path to YAML file | Required |
| `class_weight` | `dict` | Class weights for imbalanced data | `None` |
| `test_size` | `float` | Validation set proportion | `0.2` |

### Model Section

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `type` | `string` | `lightgbm`, `xgboost`, or `random_forest` | `lightgbm` |
| `hyperparameter_path` | `string` | Path to parameter space file | Required |
| `ensemble_size` | `int` | Models per ensemble | `10` |

### Training Section

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `total_trials` | `int` | Total optimization trials | `30` |
| `sobol_trials` | `int` | Initial exploration trials | `10` |
| `metric` | `string` | Optimization metric | `soft_recall` |
| `parallel` | `bool` | Parallel ensemble training | `false` |
| `n_jobs` | `int` | CPU cores for parallel training | `4` |

### Selection Section

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `threshold_saving_enabled` | `bool` | Save all trials meeting the metric threshold (and serialize their model files) | `true` |
| `metric` | `string` | Metric used for threshold comparison | `soft_recall` |
| `threshold_value` | `float` | Minimum metric value to qualify a trial for saving | `0.85` |

### Available Metrics

| Metric | Description |
|--------|-------------|
| `soft_recall` | Recall using probability averaging |
| `soft_f1_score` | F1 score using soft voting |
| `soft_precision` | Precision using soft voting |
| `soft_roc_auc` | AUC-ROC score |
| `neg_log_loss` | Negative log loss |
| `hard_recall` | Recall using hard voting |
| `hard_f1_score` | F1 using hard voting |

## 📊 Output Files

After optimization, find results in `outputs/runs/<timestamp>/`:

```
outputs/runs/20240205_143022/
├── best_parameters.json        # Best trial's hyperparameters & metrics (always written)
├── qualifying_trials_*.json    # All threshold-qualifying trials (threshold mode)
├── config.json                 # Configuration used for this run
├── optimization_progress.png   # Metric curves across all trials
├── feature_importance.png      # Feature importance chart
├── feature_importance.csv      # Numerical importance data
├── README.md                   # Run summary report
└── models/
    ├── manifest.json           # Index: trial → seed → file path + metrics
    ├── trial_0003_seed_00.txt  # LightGBM native format (.ubj for XGBoost,
    ├── trial_0003_seed_01.txt  #   .pkl for RandomForest)
    └── ...                     # One file per sub-model in each qualifying trial
```

### Loading Saved Models for Inference

```python
import json
import lightgbm as lgb
import numpy as np

# Read the manifest
with open("outputs/runs/<timestamp>/models/manifest.json") as f:
    manifest = json.load(f)

# Load all sub-models for the first qualifying trial
trial = manifest["trials"][0]
models = [lgb.Booster(model_file=sub["file"]) for sub in trial["sub_models"]]

# Ensemble soft-vote prediction
probas = np.mean([m.predict(X_new) for m in models], axis=0)
```

> **Why save model files?** Metrics reported during optimization reflect ensemble performance
> (N models averaged together). Deploying the saved ensemble directly guarantees you get the
> same performance at inference — no re-training required.


## 📧 Support

For questions, issues, or feature requests, please contact us at [contact@genxai.cc](mailto:contact@genxai.cc).

## 📄 License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## 🏢 About

Developed by [GenX AI Lab](https://genxai.cc) - Building intelligent AI solutions.
