Metadata-Version: 2.4
Name: pymlpipeline
Version: 1.0.1
Summary: End-to-End ML Pipeline — BigQuery / GCS / Vertex AI + Local CSV
Home-page: https://gitlab.com/chears_package/pymlpipeline
Author: CHEARS
Author-email: CHEARS <project@chears.in>
License: MIT
Project-URL: Homepage, https://gitlab.com/chears_package/pymlpipeline
Project-URL: Repository, https://gitlab.com/chears_package/pymlpipeline
Project-URL: Bug Tracker, https://gitlab.com/chears_package/pymlpipeline
Project-URL: Changelog, https://gitlab.com/chears_package/pymlpipeline
Keywords: machine-learning,mlops,bigquery,gcp,preprocessing,scikit-learn,xgboost,lightgbm,catboost,pipeline
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: matplotlib>=3.7.0
Requires-Dist: seaborn>=0.12.0
Requires-Dist: PyYAML>=6.0
Requires-Dist: python-docx>=1.0.0
Provides-Extra: gcp
Requires-Dist: google-cloud-bigquery>=3.11.0; extra == "gcp"
Requires-Dist: google-cloud-storage>=2.10.0; extra == "gcp"
Requires-Dist: google-auth>=2.22.0; extra == "gcp"
Requires-Dist: db-dtypes>=1.1.0; extra == "gcp"
Provides-Extra: vertex
Requires-Dist: google-cloud-aiplatform>=1.38.0; extra == "vertex"
Provides-Extra: boosting
Requires-Dist: xgboost>=2.0.0; extra == "boosting"
Requires-Dist: lightgbm>=4.0.0; extra == "boosting"
Requires-Dist: catboost>=1.2.0; extra == "boosting"
Provides-Extra: all
Requires-Dist: pymlpipeline[gcp]; extra == "all"
Requires-Dist: pymlpipeline[vertex]; extra == "all"
Requires-Dist: pymlpipeline[boosting]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Dynamic: license-file

# pymlpipeline

**End-to-End ML Pipeline** — data cleaning, model training, evaluation and prediction.  
Works with **GCP** (BigQuery / GCS / Vertex AI) and **local CSV** files from the same config.

---

## Features

| | Preprocessor | Model Builder | Predictor |
|---|---|---|---|
| **Data input** | BigQuery, Local CSV, Demo | BigQuery, GCS, Local CSV | BigQuery, GCS, Local CSV |
| **Output** | Cleaned CSV + BQ table + GCS | Trained `.pkl` models + GCS | Predictions CSV |
| **Report** | Word `.docx` preprocessing report | Word `.docx` model evaluation report | — |
| **Environment** | GCP or local (auto-detected) | GCP or local | GCP or local |

### Preprocessor
- Reads from BigQuery (4 query modes) or local CSV
- Full column profile CSV uploaded to GCS before target selection
- Target encoding, stratified reload, identifier sidecar
- Keyword drop, high-null drop, dtype normalisation, imputation, outlier handling
- EDA charts, correlation filter, one-hot/label encoding, normalisation
- Writes output to BigQuery and/or local folder

### Model Builder
- Reads from BigQuery output table or local CSV
- **81 models**: sklearn (33 classifiers, 25 regressors, 23 clusterers) + XGBoost, LightGBM, CatBoost
- 5-method feature importance (MI, F-stat, Random Forest, Permutation, RFE)
- Correlation-based top-N feature selection
- Full evaluation: AUC-ROC, PR curve, MCC, Kappa, Log-Loss, Brier score, calibration plot, learning curve
- AI-generated training script via **Gemini 2.5 Pro** on Vertex AI (no API key)
- Saves all `.pkl` models + `best_model.pkl` + `predict.py` to GCS and locally

---

## Installation

```bash
# Core only (local CSV, no GCP)
pip install pymlpipeline

# With GCP support (BigQuery + GCS)
pip install "pymlpipeline[gcp]"

# With Vertex AI / Gemini code generation
pip install "pymlpipeline[gcp,vertex]"

# With XGBoost, LightGBM, CatBoost
pip install "pymlpipeline[gcp,vertex,boosting]"

# Everything
pip install "pymlpipeline[all]"
```

**Python 3.10+ required.**

---

## Quick Start

### 1 · Initialise config

```bash
pymlpipeline init
# Creates pipeline_config.yaml in the current directory
# Edit it for your environment (see Configuration below)
```

### 2 · Preprocess data

```bash
pymlpipeline preprocess --config pipeline_config.yaml
```

Outputs:
```
ml_pipeline_output/2026-03-21_14-30-00/
  profile/   column_profile.csv          ← review this first
  output/    processed_output.csv
  report/    ML_Preprocessing_Report.docx
  charts/    *.png
```

### 3 · Train models

```bash
pymlpipeline build --config pipeline_config.yaml
```

Outputs:
```
ml_model_output/2026-03-21_14-35-00/
  models/    *.pkl  best_model.pkl  scaler.pkl  predict.py
  charts/    confusion matrix, ROC, PR, learning curve, calibration, ...
  report/    ML_Model_Report.docx  results.json
  code/      model_training_code.py  gemini_prompt.txt
```

### 4 · Predict on new data

```bash
# Local CSV
pymlpipeline predict \
  --model  ml_model_output/.../models/best_model.pkl \
  --scaler ml_model_output/.../models/scaler.pkl \
  --data   new_data.csv

# BigQuery table
pymlpipeline predict \
  --model  models/best_model.pkl \
  --scaler models/scaler.pkl \
  --bq     my-project.my_dataset.new_customers

# GCS file
pymlpipeline predict \
  --model  models/best_model.pkl \
  --scaler models/scaler.pkl \
  --gcs    gs://my-bucket/data/new_data.csv
```

---

## Configuration

A single `pipeline_config.yaml` controls both tools. Run `pymlpipeline init` to get a pre-filled template.

### Environment

```yaml
pipeline:
  environment: "auto"    # auto | gcp | local
  data_source:  "bigquery"  # bigquery | csv | demo
```

| `environment` | Behaviour |
|---|---|
| `auto` | GCP if `google-cloud-*` + ADC credentials are available, otherwise local |
| `gcp`  | Force GCP mode — fail clearly if libraries/credentials are missing |
| `local`| Skip all GCP calls; read/write local files only |

### Local mode (no GCP)

```yaml
pipeline:
  environment: "local"
  data_source:  "csv"

local:
  csv_path:   "/path/to/your/data.csv"   # single file
  csv_folder: ""                          # or point to a folder (newest CSV used)
  separator:  ","
  encoding:   "utf-8"
```

### GCP mode

```yaml
pipeline:
  environment: "gcp"
  data_source:  "bigquery"

bigquery:
  project_id:  "my-gcp-project"
  dataset_id:  "my_dataset"
  table_id:    "my_table"
  query_mode:  "full_table"   # full_table | columns | filter | custom_sql

gcs:
  bucket:      "my-ml-bucket"
  base_folder: "preprocessing/runs"
```

### GCP authentication (no API key needed)

```bash
# Local development
gcloud auth application-default login

# CI / servers — set env var
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

# GCE / Cloud Run / GKE — automatic, no setup needed
```

### Gemini 2.5 Pro code generation

```yaml
gemini:
  vertex_project:  "my-gcp-project"   # billing target
  vertex_location: "us-central1"
```

No API key — uses ADC. Falls back to a static template if Vertex AI is unavailable.

---

## Python API

Both tools can be used programmatically:

```python
from pymlpipeline import run_pipeline, run_model_builder
from pymlpipeline import preprocessor_cfg, model_cfg

# Preprocessing
preprocessor_cfg.load("pipeline_config.yaml")
df_clean, df_ids, report_path = run_pipeline()

# Model building
model_cfg.load("pipeline_config.yaml")
run_model_builder()
```

---

## Models Available

### Classification (33 total + XGBoost/LightGBM/CatBoost when installed)

| Category | Models |
|---|---|
| 🚀 Boosting | Gradient Boosting, Hist GBM, AdaBoost, **XGBoost**, **XGBoost(dart)**, **LightGBM**, **LightGBM(DART/GOSS)**, **CatBoost**, **CatBoost(balanced)** |
| 🌲 Forest | Random Forest, Extra Trees |
| 📐 Linear | Logistic Regression (L1/L2), Ridge, SGD, Passive-Aggressive, Perceptron |
| ⚡ SVM | RBF, Linear, Poly, Nu-SVM, Linear SVC |
| 🧠 Neural | MLP (3 sizes) |
| 📍 KNN | k=3, k=5, k=11 |
| 📊 Naive Bayes | Gaussian, Bernoulli, Complement |
| Others | LDA, QDA, Gaussian Process, Label Spreading/Propagation |

### Regression (25 + optional), Clustering (23 algorithms)

Full lists shown in the interactive model selection menu.

---

## Supported Evaluation Metrics

**Classification:** Accuracy, Precision, Recall, F1 (weighted), ROC-AUC, Average Precision, MCC, Cohen's Kappa, Log-Loss, Brier Score, CV score  
**Regression:** MAE, RMSE, R², MAPE, CV R²  
**Clustering:** Silhouette, Calinski-Harabász, Davies-Bouldin


---

## License

MIT — see [LICENSE](LICENSE)
