Metadata-Version: 2.4
Name: td-model-store
Version: 0.1.5
Summary: Store and load ML models (XGBoost, LightGBM, CatBoost) in Treasure Data Plazma via pytd
Author-email: Gurbaksh Sharma <gurbaksh.sharma@treasuredata.com>
License: MIT
Project-URL: Homepage, https://github.com/treasure-data-ps/td_ml_libraries/tree/gurbaksh_dev/td_model_store
Project-URL: Repository, https://github.com/treasure-data-ps/td_ml_libraries
Project-URL: Documentation, https://github.com/treasure-data-ps/td_ml_libraries/tree/gurbaksh_dev/td_model_store#readme
Project-URL: Issues, https://github.com/treasure-data-ps/td_ml_libraries/issues
Keywords: treasure-data,machine-learning,xgboost,lightgbm,catboost,model-store,pytd
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pytd>=1.0.0
Requires-Dist: pandas>=1.0.0
Provides-Extra: xgboost
Requires-Dist: xgboost>=1.0.0; extra == "xgboost"
Provides-Extra: lightgbm
Requires-Dist: lightgbm>=3.0.0; extra == "lightgbm"
Provides-Extra: catboost
Requires-Dist: catboost>=1.0.0; extra == "catboost"
Provides-Extra: all
Requires-Dist: xgboost>=1.0.0; extra == "all"
Requires-Dist: lightgbm>=3.0.0; extra == "all"
Requires-Dist: catboost>=1.0.0; extra == "all"
Dynamic: license-file

# td-model-store

Store and load ML models (XGBoost, LightGBM, CatBoost) in Treasure Data DB via pytd.

## Features

- **API**: Save and load models with just a few lines of code
- **Multiple Model Types**: Supports XGBoost, LightGBM, and CatBoost
- **Chunked Storage**: Automatically handles large models by chunking data
- **Session Management**: Track model versions with session IDs
- **Model Versioning**: Load specific versions or the latest model
- **Seamless Integration**: Works directly with Treasure Data via pytd

## Installation

### Basic Installation

```bash
pip install td-model-store
```

## Quick Start

### Saving a Model

```python
from td_model_store import save_model
import xgboost as xgb

# Train your model
model = xgb.XGBClassifier()
model.fit(X_train, y_train)

# Save to Treasure Data
metadata = save_model(
    model=model,
    database="my_database",
    table="ml_model_store",
    model_name="my_xgboost_model",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

print(f"Model saved with session_id: {metadata['session_id']}")
```

### Loading a Model

```python
from td_model_store import load_model

# Load the latest version of a model
model = load_model(
    database="my_database",
    table="ml_model_store",
    model_name="my_xgboost_model",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

# Use the model for predictions
predictions = model.predict(X_test)
```

## API Reference

### `save_model()`

Save an ML model to a Treasure Data table.

#### Parameters

- **model** *(required)*: Trained model object (XGBClassifier, XGBRegressor, LGBMClassifier, LGBMRegressor, CatBoostClassifier, or CatBoostRegressor)
- **database** *(required)*: Target TD database name
- **table** *(str, default="ml_model_store")*: Target table name
- **model_name** *(str, optional)*: Name tag for the model. Defaults to `"xgboost_model"`, `"lightgbm_model"`, or `"catboost_model"` based on model type
- **session_id** *(int, optional)*: Unique session identifier. Defaults to current unix timestamp
- **apikey** *(str, required)*: TD API key. Can also be set via `TD_API_KEY` or `TDX_API_KEY*` env vars
- **endpoint** *(str, default="https://api.treasuredata.com")*: TD API endpoint. Use region-specific endpoints:
  - US: `https://api.treasuredata.com`
  - EU: `https://api.eu01.treasuredata.com`
  - Japan: `https://api.treasuredata.co.jp`
  - Asia Pacific: `https://api.ap02.treasuredata.com`
- **chunk_size** *(int, default=130000)*: Max characters per chunk (must be < 131072)

#### Returns

Dictionary containing:
- `session_id`: Session identifier
- `model_name`: Model name
- `model_type`: Model type (xgboost, lightgbm, or catboost)
- `model_format`: Serialization format
- `num_chunks`: Number of chunks
- `size_bytes`: Model size in bytes

#### Example

```python
metadata = save_model(
    model=my_model,
    database="production_db",
    table="models",
    model_name="fraud_detector_v1",
    session_id=20240420001,
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)
```

### `load_model()`

Load an ML model from a Treasure Data table.

#### Parameters

- **database** *(required)*: Source TD database name
- **table** *(str, default="ml_model_store")*: Source table name
- **model_name** *(str, optional)*: Filter by model name. If not provided, loads the latest session
- **session_id** *(int, optional)*: Load a specific session. If not provided, loads the max session_id (optionally filtered by model_name)
- **apikey** *(str, required)*: TD API key. Can also be set via `TD_API_KEY` or `TDX_API_KEY*` env vars
- **endpoint** *(str, default="https://api.treasuredata.com")*: TD API endpoint. Use region-specific endpoints:
  - US: `https://api.treasuredata.com`
  - EU: `https://api.eu01.treasuredata.com`
  - Japan: `https://api.treasuredata.co.jp`
  - Asia Pacific: `https://api.ap02.treasuredata.com`

#### Returns

The reconstructed ML model (XGBoost, LightGBM, or CatBoost)

#### Example

```python
# Load latest version of a specific model
model = load_model(
    database="production_db",
    table="models",
    model_name="fraud_detector_v1",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

# Load a specific session
model = load_model(
    database="production_db",
    table="models",
    session_id=20240420001,
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

# Load the absolute latest model (any name)
model = load_model(
    database="production_db",
    table="models",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)
```

## Complete Examples

### XGBoost Example

```python
from td_model_store import save_model, load_model
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Prepare data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = xgb.XGBClassifier(n_estimators=100, max_depth=3)
model.fit(X_train, y_train)

# Save to TD
metadata = save_model(
    model=model,
    database="ml_models",
    model_name="iris_classifier",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

# Load from TD
loaded_model = load_model(
    database="ml_models",
    model_name="iris_classifier",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

# Verify
score = loaded_model.score(X_test, y_test)
print(f"Model accuracy: {score:.3f}")
```

### LightGBM Example

```python
from td_model_store import save_model, load_model
import lightgbm as lgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Prepare data
X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = lgb.LGBMRegressor(n_estimators=100)
model.fit(X_train, y_train)

# Save to TD
metadata = save_model(
    model=model,
    database="ml_models",
    model_name="boston_price_predictor",
    session_id=20240420001,
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

# Load from TD
loaded_model = load_model(
    database="ml_models",
    session_id=20240420001,
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

# Make predictions
predictions = loaded_model.predict(X_test)
```

### CatBoost Example

```python
from td_model_store import save_model, load_model
from catboost import CatBoostClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Prepare data
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train model
model = CatBoostClassifier(iterations=100, verbose=False)
model.fit(X_train, y_train)

# Save to TD
metadata = save_model(
    model=model,
    database="ml_models",
    model_name="digit_classifier",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

print(f"Saved {metadata['num_chunks']} chunks, {metadata['size_bytes']} bytes")

# Load from TD
loaded_model = load_model(
    database="ml_models",
    model_name="digit_classifier",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

# Make predictions
predictions = loaded_model.predict(X_test)
```

## Configuration

### API Key and Endpoint

The `apikey` parameter is **required** for all operations. Specify the correct endpoint for your Treasure Data region:

```python
# US region (default)
save_model(
    model=model,
    database="my_database",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.com"
)

# EU region
save_model(
    model=model,
    database="my_database",
    apikey="your_td_api_key",
    endpoint="https://api.eu01.treasuredata.com"
)

# Japan region
save_model(
    model=model,
    database="my_database",
    apikey="your_td_api_key",
    endpoint="https://api.treasuredata.co.jp"
)

# Asia Pacific region
save_model(
    model=model,
    database="my_database",
    apikey="your_td_api_key",
    endpoint="https://api.ap02.treasuredata.com"
)
```

## Model Storage Schema

Models are stored in the following table schema:

| Column | Type | Description |
|--------|------|-------------|
| time | int | Unix timestamp |
| session_id | int | Unique session identifier |
| model_name | string | Model name tag |
| model_type | string | Model type (xgboost, lightgbm, catboost) |
| model_format | string | Serialization format (ubj, txt, cbm) |
| chunk_index | int | Chunk index (0-based) |
| total_chunks | int | Total number of chunks |
| chunk_data | string | Base64-encoded model data chunk |
| created_at | string | ISO timestamp |

## How It Works

1. **Serialization**: Models are serialized to their native format:
   - XGBoost: `.ubj` (Universal Binary JSON)
   - LightGBM: `.txt` (text format)
   - CatBoost: `.cbm` (CatBoost binary)

2. **Chunking**: Large models are split into chunks (default 130KB per chunk) to comply with Treasure Data's bulk import limits

3. **Storage**: Chunks are stored as base64-encoded strings in a Treasure Data table via pytd's bulk_import

4. **Retrieval**: When loading, chunks are reassembled, decoded, and deserialized back into the original model object

## Requirements

- Python >= 3.8
- pytd >= 1.0.0
- pandas >= 1.0.0
- At least one of: xgboost, lightgbm, or catboost

## Logging

The library uses Python's standard logging module. Enable logging to see save/load operations:

```python
import logging
logging.basicConfig(level=logging.INFO)
```

Example output:
```
INFO:td_model_store.model_store:Saved model 'my_model' (type=xgboost) to ml_models.ml_model_store | session_id=1713628800 | chunks=5 | size=12345 bytes
INFO:td_model_store.model_loader:Loaded model 'my_model' (type=xgboost) from ml_models.ml_model_store | session_id=1713628800 | chunks=5 | size=12345 bytes
```



## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## Support

For issues and questions, please open an issue on GitHub or contact the maintainers.
