Metadata-Version: 2.4
Name: caketool
Version: 1.9.0
Summary: Common tools for MLOps
Author-email: Khoi Do <khoidd@outlook.com>
License-Expression: MIT
Project-URL: Homepage, https://mazino2d.github.io/caketool
Project-URL: Repository, https://github.com/mazino2d/caketool
Keywords: cake,ds,mlops
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Requires-Dist: tqdm>=4.60
Requires-Dist: xgboost>=2.0
Requires-Dist: category-encoders>=2.6
Requires-Dist: python-dotenv>=1.0
Requires-Dist: shap>=0.45
Requires-Dist: plotly>=5.0
Requires-Dist: nbformat>=5.0
Requires-Dist: google-cloud-bigquery>=3.0
Provides-Extra: onprem
Requires-Dist: mlflow>=2.0; extra == "onprem"
Provides-Extra: gcp
Requires-Dist: google-cloud-aiplatform>=1.50; extra == "gcp"
Requires-Dist: google-cloud-bigquery>=3.0; extra == "gcp"
Requires-Dist: google-cloud-storage>=2.0; extra == "gcp"
Requires-Dist: bigframes>=1.0; extra == "gcp"
Provides-Extra: wandb
Requires-Dist: wandb>=0.17; extra == "wandb"
Provides-Extra: spark
Requires-Dist: pyspark>=3.3; extra == "spark"
Provides-Extra: polars
Requires-Dist: polars>=0.20; extra == "polars"
Provides-Extra: all
Requires-Dist: caketool[onprem]; extra == "all"
Requires-Dist: caketool[gcp]; extra == "all"
Requires-Dist: caketool[wandb]; extra == "all"
Requires-Dist: caketool[spark]; extra == "all"
Requires-Dist: caketool[polars]; extra == "all"
Provides-Extra: docs
Requires-Dist: pdoc>=14.0; extra == "docs"
Requires-Dist: mkdocs-material[imaging]>=9.0; extra == "docs"
Provides-Extra: dev
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Requires-Dist: pip-tools; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: ipykernel; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Dynamic: license-file

# Caketool

A Python MLOps toolkit for common machine learning and data science workflows.
Provides EDA, feature engineering, model training, explainability, experiment tracking, model monitoring, calibration, and metrics — all designed for production-ready ML pipelines.

---

## Installation

```bash
# Core (pandas, numpy, sklearn, xgboost)
pip install caketool

# With MLflow support
pip install "caketool[onprem]"

# With Google Cloud (Vertex AI, BigQuery)
pip install "caketool[gcp]"

# Everything
pip install "caketool[all]"
```

---

## Quick Start

### EDA

```python
from caketool import eda

# Dataset overview
eda.profile(df)
eda.plot_correlations(df)

# Univariate
eda.plot_numeric_distribution(df["age"])
eda.plot_categorical_frequency(df["category"])

# Bivariate
eda.plot_scatter(df, x="income", y="spend", color_by="segment")
eda.plot_distribution_by_group(df, cat_col="segment", num_col="income", mode="box")
```

### Feature Generation

```python
from caketool.feature import generate_features_by_window

result = generate_features_by_window(
    df,
    client_id_col="user_id",
    report_date_col="event_date",
    fs_event_timestamp="snapshot_date",
    numeric_cols=("amount", "balance"),
    string_cols=("category",),
    boolean_cols=("is_active",),
    lookback_days=(0, 7, 30),   # 0 = lifetime
    backend="pandas",           # "pandas" | "polars" | "spark" | "bigframes"
)
```

### Model Training

```python
from caketool.model import BoostTree

model = BoostTree()
model.fit(X_train, y_train, eval_set=[(X_val, y_val)])
proba = model.predict_proba(X_test)[:, 1]
importance = model.get_feature_importance()
```

### Out-of-Fold Cross-Validation

```python
from caketool.model import BoostTree, EnsembleBoostTree

models, oof_preds, oof_labels = BoostTree.fit_oof(X_train, y_train, n_splits=5)
ensemble = EnsembleBoostTree(models)
proba = ensemble.predict_proba(X_test)[:, 1]
```

### Explainability

```python
from caketool.explainability import PermutationExplainer

explainer = PermutationExplainer(model=model)
explainer.fit(X_test)

# Global feature importance
importance = explainer.get_feature_importance()

# Local explanation for a single sample
local = explainer.get_local_explanation(row_index=0)

# Visualize
explainer.show_summary()
explainer.show_waterfall(row_index=0)
explainer.show_dependence(feature="income")
```

### Score Calibration

```python
from caketool.calibration import calibrate_score_to_normal

calibrated = calibrate_score_to_normal(raw_scores, standard=False)
```

### Metrics

```python
from caketool.metric import gini, psi

print(gini(y_true, y_pred))          # Gini coefficient
print(psi(expected, actual))          # Population Stability Index
```

### Risk Report

```python
from caketool.report import decribe_risk_score

report = decribe_risk_score(score_df, pred_col="score", label_col="label")
```

### Drift Detection

```python
from caketool.monitor import AdversarialModel

model = AdversarialModel()
model.fit(reference_df, current_df)
model.show()    # prints ROC AUC and top important features
```

### Experiment Tracking

```python
from caketool.experiment import create_tracker

# MLflow
with create_tracker("mlflow", experiment_name="my-exp", run_name="run-001") as tracker:
    tracker.log_params({"lr": 0.01, "depth": 6})
    tracker.log_metrics({"gini": 0.72})
    tracker.log_pickle(model, "model")

# Vertex AI
with create_tracker("vertex_ai", experiment_name="my-exp", run_name="run-001",
                    project="my-gcp-project", location="us-central1",
                    bucket_name="my-bucket") as tracker:
    tracker.log_params({"lr": 0.01})
```

---

## API Overview

| Module | Key exports | Description |
| ------ | ----------- | ----------- |
| `caketool.eda` | `profile`, `plot_correlations`, `plot_scatter`, `plot_distribution_by_group`, `rank_associations` | Exploratory data analysis with Plotly |
| `caketool.feature` | `generate_features_by_window` | Multi-backend aggregated feature engineering |
| `caketool.model` | `BoostTree`, `EnsembleBoostTree`, `VotingModel` | XGBoost training & ensemble |
| `caketool.model` | `FeatureEncoder`, `FeatureRemover`, `ColinearFeatureRemover`, `UnivariateFeatureRemover`, `InfinityHandler` | sklearn-compatible preprocessing transformers |
| `caketool.explainability` | `PermutationExplainer` | SHAP-based model-agnostic explainability |
| `caketool.calibration` | `calibrate_score_to_normal` | Normal distribution score calibration |
| `caketool.metric` | `gini`, `psi`, `psi_from_distribution` | Classification and stability metrics |
| `caketool.report` | `decribe_risk_score` | Risk score band report |
| `caketool.monitor` | `AdversarialModel` | Dataset drift detection |
| `caketool.experiment` | `create_tracker`, `MLflowTracker`, `VertexAITracker` | Experiment tracking abstraction |

---

## Development

```bash
conda create -n caketool python=3.10
conda activate caketool
pip-compile pyproject.toml --all-extras
pip install -e ".[dev,all]"
pre-commit install
```

### Linting

Pre-commit hooks run ruff automatically on commit. To run manually:

```bash
ruff check src/ tests/ --fix  # Lint and auto-fix
ruff format src/ tests/        # Format code
pre-commit run --all-files     # Run all hooks
```

### Tests

```bash
pytest tests/ -v --tb=short
```

### Docs

```bash
pip install -e ".[docs]"
pdoc src/caketool   # Preview at http://localhost:8080
```

Docs are published automatically to [GitHub Pages](https://mazino2d.github.io/caketool) when a version tag is pushed.

---

## Publishing

Version is automatically derived from git tags via `setuptools-scm`.

```bash
# Test on TestPyPI (RC/beta/alpha tags)
git tag v1.8.0-rc1
git push origin v1.8.0-rc1

# Publish to PyPI (stable tags)
git tag v1.8.0
git push origin v1.8.0
```

GitHub Actions builds and publishes automatically on tag push.

---

## Local Development

```bash
python -m pip install -e .
python -c "from caketool import __version__; print(__version__)"
```
