Metadata-Version: 2.4
Name: mlfastopt
Version: 0.0.10
Summary: ML Fast Opt - Advanced ensemble optimization system for LightGBM hyperparameter tuning
Author-email: GenX AI Lab <contact@genxai.cc>
License: GENX AI LAB COMMUNITY LICENSE AGREEMENT
        Version 1.0
        
        Copyright (c) 2025 GenX AI Lab. All Rights Reserved.
        
        1. GRANT OF LICENSE
        GenX AI Lab ("Licensor") hereby grants to you ("Licensee") a non-exclusive, non-transferable, revocable license to install and use the software "mlfastopt" (the "Software") for personal, educational, research, and internal business purposes, free of charge.
        
        2. RESTRICTIONS
        You may NOT:
        (a) Distribute, sub-license, rent, lease, or sell the Software or any derivative works thereof;
        (b) Modify, reverse engineer, decompile, or disassemble the Software, except to the extent that such activity is expressly permitted by applicable law notwithstanding this limitation;
        (c) Remove or alter any copyright, trademark, or other proprietary notices from the Software.
        
        3. OWNERSHIP
        The Software is licensed, not sold. GenX AI Lab retains all ownership, title, copyright, and other intellectual property rights in the Software. This license does not grant you any rights to use GenX AI Lab's trademarks or service marks.
        
        4. DISCLAIMER OF WARRANTY
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT, OR OTHERWISE, ARISING FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
        
        5. TERMINATION
        This license is effective until terminated. Your rights under this license will terminate automatically without notice from the Licensor if you fail to comply with any term(s) of this license. Upon termination, you shall cease all use of the Software and destroy all copies, full or partial, of the Software.
        
Project-URL: Homepage, https://github.com/example/mlfastopt
Project-URL: Documentation, https://github.com/example/mlfastopt/docs
Project-URL: Repository, https://github.com/example/mlfastopt
Project-URL: Issues, https://github.com/example/mlfastopt/issues
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: ax-platform>=1.0.0
Requires-Dist: sqlalchemy<2.0.0
Requires-Dist: lightgbm<5.0.0,>=4.6.0
Requires-Dist: polars>=1.31.0
Requires-Dist: pandas<3.0.0,>=2.3.0
Requires-Dist: numpy>=2.3.1
Requires-Dist: scikit-learn<2.0.0,>=1.7.0
Requires-Dist: matplotlib<=3.10.0,>=3.5.0
Requires-Dist: joblib>=1.5.1
Requires-Dist: flask<4.0.0,>=2.2.5
Requires-Dist: plotly<7.0.0,>=6.2.0
Requires-Dist: seaborn>=0.13.2
Requires-Dist: pyarrow>=20.0.0
Requires-Dist: keyring>=25.5.0
Requires-Dist: build>=1.2.2.post1
Requires-Dist: PyYAML>=6.0.2
Requires-Dist: gcsfs>=2025.12.0
Requires-Dist: fastparquet>=2025.12.0
Requires-Dist: openpyxl>=3.1.5
Requires-Dist: xgboost>=3.1.2
Requires-Dist: shap>=0.50.0
Provides-Extra: dev
Requires-Dist: pytest>=9.0.2; extra == "dev"
Requires-Dist: pytest-cov>=7.0.0; extra == "dev"
Requires-Dist: coverage>=7.13.0; extra == "dev"
Requires-Dist: black>=25.12.0; extra == "dev"
Requires-Dist: flake8>=7.3.0; extra == "dev"
Requires-Dist: mypy>=1.19.1; extra == "dev"
Dynamic: license-file

<p align="center">
  <h1 align="center">🚀 MLFastOpt</h1>
  <p align="center">
    <strong>High-Speed Bayesian Hyperparameter Optimization for ML Ensembles</strong>
  </p>
</p>

<p align="center">
  <a href="https://pypi.org/project/mlfastopt/">
    <img src="https://badge.fury.io/py/mlfastopt.svg" alt="PyPI version">
  </a>
  <a href="https://pypi.org/project/mlfastopt/">
    <img src="https://img.shields.io/pypi/pyversions/mlfastopt.svg" alt="Python versions">
  </a>
  <a href="https://opensource.org/licenses/MIT">
    <img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT">
  </a>
  <a href="https://pypi.org/project/mlfastopt/">
    <img src="https://img.shields.io/pypi/dm/mlfastopt.svg" alt="Downloads">
  </a>
</p>

<p align="center">
  <a href="#-installation">Installation</a> •
  <a href="#-quick-start">Quick Start</a> •
  <a href="#-features">Features</a> •
  <a href="#-documentation">Documentation</a> •
  <a href="#-contributing">Contributing</a>
</p>

---

**MLFastOpt** is a production-ready framework for Bayesian hyperparameter optimization of **LightGBM**, **XGBoost**, and **Random Forest** ensemble models. It combines state-of-the-art Bayesian optimization algorithms with ensemble learning techniques.

## ✨ Features

| Feature | Description |
|---------|-------------|
| 🎯 **Bayesian Optimization** | Two-phase optimization: quasi-random exploration followed by Bayesian exploitation |
| 🧩 **Multi-Model Support** | LightGBM, XGBoost, and Random Forest with unified interface |
| 🔄 **Ensemble Learning** | Train N models per trial with different seeds, aggregate via soft/hard voting |
| ⚡ **Parallel Training** | Optional parallel ensemble training with joblib |
| � **Model Serialization** | Trained model objects saved to disk automatically — deploy the actual ensemble, not a retrained single model |
| �📊 **Rich Visualizations** | Auto-generated optimization plots and feature importance charts |
| 🎛️ **Flexible Configuration** | Hierarchical JSON configs with YAML/Python parameter spaces |
| 🔬 **SHAP Integration** | Built-in SHAP feature importance analysis |
| 🌐 **Web Dashboard** | Interactive Flask-based visualization tools |

## 📦 Installation

```bash
pip install mlfastopt
```

### Prerequisites

- **Python**: 3.12+
- **macOS Users**: Install OpenMP for LightGBM/XGBoost support:
  ```bash
  brew install libomp
  ```

## 🚀 Quick Start

### 1. Install the Package

```bash
pip install mlfastopt
```

### 2. Create Configuration Files

**config.json** - Main configuration:

```json
{
  "data": {
    "path": "data/train.parquet",
    "label_column": "target",
    "features": ["feature1", "feature2", "feature3"],
    "class_weight": {"0": 1, "1": 5}
  },
  "model": {
    "type": "lightgbm",
    "hyperparameter_path": "config/hyperparameters.yaml",
    "ensemble_size": 10
  },
  "training": {
    "total_trials": 30,
    "sobol_trials": 10,
    "metric": "soft_recall",
    "parallel": true,
    "n_jobs": 4
  },
  "output": {
    "dir": "outputs/runs"
  }
}
```

**config/hyperparameters.yaml** - Parameter search space:

```yaml
parameters:
  - name: learning_rate
    type: range
    bounds: [0.01, 0.3]
    value_type: float
    log_scale: true

  - name: max_depth
    type: range
    bounds: [3, 12]
    value_type: int

  - name: num_leaves
    type: range
    bounds: [20, 150]
    value_type: int

  - name: min_child_samples
    type: range
    bounds: [5, 100]
    value_type: int
```

### 3. Run Optimization

MLFastOpt offers **two ways** to run optimization:

#### Option A: Command Line (CLI)

```bash
# Set OMP_NUM_THREADS=1 to avoid LightGBM/XGBoost deadlocks
OMP_NUM_THREADS=1 mlfastopt-optimize --config config.json
```

**Additional CLI options:**
```bash
# Validate configuration without running
mlfastopt-optimize --config config.json --validate

# Override trials from command line
mlfastopt-optimize --config config.json --trials 50

# Start web dashboard
mlfastopt-web

# Analysis tools
mlfastopt-analyze
```

#### Option B: Python API

```python
from mlfastopt import AEModelTuner

# Initialize with config file
tuner = AEModelTuner(config_path="config.json")

# Run optimization
results = tuner.run_complete_optimization()

# Access results programmatically
print(f"Best parameters: {results['best_parameters']}")
print(f"Output directory: {results['output_dir']}")
```

| Method | Best For |
|--------|----------|
| **CLI** | Quick runs, shell scripts, cron jobs, CI/CD pipelines |
| **Python API** | Jupyter notebooks, integration with larger applications, programmatic access to results |

### 4. View Results

Results are saved to `outputs/runs/<timestamp>/`:
- `best_parameters.json` — Optimal hyperparameters + metrics (always written)
- `qualifying_trials_*.json` — All trials meeting the threshold, with per-trial params + metrics
- `models/manifest.json` — Index of every serialized model file
- `models/trial_NNNN_seed_SS.txt` — Trained model binaries (LightGBM native format; `.pkl` for other types)
- `optimization_progress.png` — Training curves
- `feature_importance.png` — Feature importance plots
- `README.md` — Run summary report

## 📖 How It Works

MLFastOpt uses a **two-level nested optimization loop**:

```
┌─────────────────────────────────────────────────────────────────┐
│ OUTER LOOP: Trial Iteration (total_trials = 30)                │
│                                                                 │
│  Trial 1: {learning_rate: 0.05, max_depth: 7, ...}             │
│  ├── Train Model 1 (seed=42)                                   │
│  ├── Train Model 2 (seed=43)                                   │
│  ├── ...                                                        │
│  └── Train Model 10 (seed=51)                                  │
│  └── Ensemble Prediction → Calculate Metrics → Update Optimizer│
│                                                                 │
│  Trial 2: {learning_rate: 0.12, max_depth: 5, ...}             │
│  └── ... (same ensemble training)                               │
│                                                                 │
│  Phase 1: Quasi-random exploration (sobol_trials)              │
│  Phase 2: Bayesian optimization (remaining trials)             │
└─────────────────────────────────────────────────────────────────┘
```

**Key concepts:**
- **Trial**: One hyperparameter configuration tested
- **Ensemble**: N models trained per trial (different random seeds)
- **Soft Voting**: Average probabilities across ensemble members
- **Hard Voting**: Average binary predictions across ensemble members

## ⚙️ Configuration Reference

### Data Section

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `path` | `string` | Path to dataset (CSV, Parquet, or URL) | Required |
| `label_column` | `string` | Target column name | Required |
| `features` | `list/string` | Feature names or path to YAML file | Required |
| `class_weight` | `dict` | Class weights for imbalanced data | `None` |
| `test_size` | `float` | Validation set proportion | `0.2` |

### Model Section

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `type` | `string` | `lightgbm`, `xgboost`, or `random_forest` | `lightgbm` |
| `hyperparameter_path` | `string` | Path to parameter space file | Required |
| `ensemble_size` | `int` | Models per ensemble | `10` |

### Training Section

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `total_trials` | `int` | Total optimization trials | `30` |
| `sobol_trials` | `int` | Initial exploration trials | `10` |
| `metric` | `string` | Optimization metric | `soft_recall` |
| `parallel` | `bool` | Parallel ensemble training | `false` |
| `n_jobs` | `int` | CPU cores for parallel training | `4` |

### Selection Section

| Parameter | Type | Description | Default |
|-----------|------|-------------|---------|
| `threshold_saving_enabled` | `bool` | Save all trials meeting the metric threshold (and serialize their model files) | `true` |
| `metric` | `string` | Metric used for threshold comparison | `soft_recall` |
| `threshold_value` | `float` | Minimum metric value to qualify a trial for saving | `0.85` |

### Available Metrics

| Metric | Description |
|--------|-------------|
| `soft_recall` | Recall using probability averaging |
| `soft_f1_score` | F1 score using soft voting |
| `soft_precision` | Precision using soft voting |
| `soft_roc_auc` | AUC-ROC score |
| `neg_log_loss` | Negative log loss |
| `hard_recall` | Recall using hard voting |
| `hard_f1_score` | F1 using hard voting |

## 📊 Output Files

After optimization, find results in `outputs/runs/<timestamp>/`:

```
outputs/runs/20240205_143022/
├── best_parameters.json        # Best trial's hyperparameters & metrics (always written)
├── qualifying_trials_*.json    # All threshold-qualifying trials (threshold mode)
├── config.json                 # Configuration used for this run
├── optimization_progress.png   # Metric curves across all trials
├── feature_importance.png      # Feature importance chart
├── feature_importance.csv      # Numerical importance data
├── README.md                   # Run summary report
└── models/
    ├── manifest.json           # Index: trial → seed → file path + metrics
    ├── trial_0003_seed_00.txt  # LightGBM native format (.ubj for XGBoost,
    ├── trial_0003_seed_01.txt  #   .pkl for RandomForest)
    └── ...                     # One file per sub-model in each qualifying trial
```

### Loading Saved Models for Inference

```python
import json
import lightgbm as lgb
import numpy as np

# Read the manifest
with open("outputs/runs/<timestamp>/models/manifest.json") as f:
    manifest = json.load(f)

# Load all sub-models for the first qualifying trial
trial = manifest["trials"][0]
models = [lgb.Booster(model_file=sub["file"]) for sub in trial["sub_models"]]

# Ensemble soft-vote prediction
probas = np.mean([m.predict(X_new) for m in models], axis=0)
```

> **Why save model files?** Metrics reported during optimization reflect ensemble performance
> (N models averaged together). Deploying the saved ensemble directly guarantees you get the
> same performance at inference — no re-training required.


## 📧 Support

For questions, issues, or feature requests, please contact us at [contact@genxai.cc](mailto:contact@genxai.cc).

## 📄 License

This is proprietary software. See the [LICENSE](LICENSE) file for details.

## 🏢 About

Developed by [GenX AI Lab](https://genxai.cc) - Building intelligent AI solutions.
