Metadata-Version: 2.4
Name: tdstone2
Version: 0.1.9.2
Summary: A package for Script Table Operator that applies set theory to machine learning in Python.
Author: Denis Molin
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: teradataml>=17.20
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: sqlparse
Requires-Dist: torch
Requires-Dist: plotly
Requires-Dist: transformers
Requires-Dist: sentence-transformers
Requires-Dist: onnx
Requires-Dist: onnxruntime
Requires-Dist: optimum[onnxruntime]>=2.1.0
Dynamic: author
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# `tdstone2` Package

## Overview

`tdstone2` operationalizes Python code for machine learning and data analysis on **Teradata Vantage** using the **Script Table Operator (STO)**. It leverages Teradata's MPP architecture to run hundreds of Python scripts in parallel across hundreds of data partitions — enabling hyper-segmented model deployment with full lineage, versioning, and minimal data movement.

## Features

- **Hyper-segmented Model Deployment**: Train one independent model per data partition in parallel across Teradata AMPs
- **Scikit-learn Pipeline Integration**: Auto-generate STO scripts from sklearn Pipelines (classifier, regressor, anomaly detection)
- **Feature Engineering**: Deploy custom or reducer-based feature engineering per partition
- **Vector Embeddings**: Install HuggingFace/ONNX models and compute embeddings in-database
- **Seq2Seq Inference**: Deploy summarization and translation models (e.g., flan-t5) via STO
- **Model Lineage & Versioning**: Temporal tables track every version of every trained model
- **Two Execution Backends**: Script Table Operator (Vantage Enterprise + Vantage Cloud Enterprise) and `Apply` (Vantage Cloud Lake / OAF) — same `HyperModel` API, picked via `use_apply=True`
- **Per-batch Timing Instrumentation**: every scored row carries `SCRIPT_UUID`, `TOTAL_TIME`, `IMPORT_TIME`, `LOAD_TIME`, `PROCESS_TIME`, `PRINT_TIME`, `BATCH_NO` so you can diagnose which phase dominated wall-clock, per partition, with a single `SELECT`
- **HTML Reports**: `train(report=True)` / `score(report=True)` emit inline reports — partition-duration histograms, top-10 longest/fastest, input/output row-count distributions, model-size histogram

## Installation

```bash
pip install tdstone2
```

Requires access to a Teradata Vantage system with the Script Table Operator enabled.

---

## Quick Start

### 1. Generate Test Data

```python
import teradataml as tdml
from tdstone2.dataset_generation import GenerateEquallyDistributedDataSet

tdml.create_context(**Param)  # Param = {'host': ..., 'user': ..., 'password': ...}

# Generate a synthetic partitioned dataset (21.6M rows, 216 partitions)
dataset = GenerateEquallyDistributedDataSet(n_partitions=216, n_rows=100000)
dataset.to_sql('dataset_00', schema_name=Param['database'])
```

Output schema: `Partition_ID`, `ID`, `X1`–`X9` (features), `Y1`, `Y2` (targets), `flag`, `FOLD`

### 2. Initialize the Framework

```python
from tdstone2.tdstone import TDStone

sto = TDStone(schema_name=Param['database'], SEARCHUIFDBPATH=Param['user'])
sto.setup()  # creates repository tables + installs STO files in Vantage
```

For **Vantage Cloud Lake (OpenAF / STO-via-OAF)**:

```python
sto = TDStone(schema_name=Param['database'], SEARCHUIFDBPATH=Param['user'], oaf_env='my_env')
sto.setup()
```

For **Vantage Cloud Lake via the `Apply` backend** (recommended on Lake — STO is not always available there):

```python
sto = TDStone(
    schema_name    = Param['database'],
    use_apply      = True,                  # route train/score/FE through tdml.Apply
    apply_env_name = 'tdstone2_sklearn',    # OAF user-environment with sklearn installed
    compute_group  = 'CG_BusGrpB_ANL',      # sets QueryBand 'compute=...' for ACC routing
    connect_kwargs = {                      # explicit cluster pinning (avoids the
        'host':     Param['host'],          # multi-cluster trap when several VCL_*_HOST
        'user':     Param['user'],          # env vars are set at once)
        'password': Param['password'],
        'database': Param['database'],
    },
)
sto.setup()  # creates OFS-resident repository tables + installs Apply scripts in the env
```

The `Apply` path stores all repository tables with `STORAGE = TD_OFSSTORAGE`, replaces the
`PERIOD FOR ValidPeriod` temporal column with a plain `CREATION_DATE` (OFS rejects temporal
periods), and runs the analog `tds_*_apply.py` scripts inside the user environment. The
public `HyperModel` / `FeatureEngineering` API is identical to the STO path.

---

## Hyper-segmented Models

### Deploy a Scikit-learn Classifier

```python
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from tdstone2.tdshypermodel import HyperModel

steps = [
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier(max_depth=5, n_estimators=95))
]

model_parameters = {
    "target": 'Y2',
    "column_categorical": ['flag', 'Y2'],
    "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'flag']
}

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'my_project'},
    skl_pipeline_steps=steps,
    model_parameters=model_parameters,
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
    convert_to_onnx=False,   # set True for ONNX export
    store_pickle=True,
)

model.train()   # trains 216 independent models in parallel (one per partition)
model.score()   # scores all data using each partition's model
```

### Deploy a Scikit-learn Regressor

```python
from sklearn.ensemble import RandomForestRegressor

steps = [
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor(max_depth=5, n_estimators=95))
]

model_parameters = {
    "target": 'Y1',
    "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9']
}

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'regression'},
    skl_pipeline_steps=steps,
    model_parameters=model_parameters,
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
)
model.train()
model.score()
```

### Deploy Anomaly Detection (OneClassSVM)

```python
from sklearn.svm import OneClassSVM

steps = [
    ('scaler', StandardScaler()),
    ('anomaly', OneClassSVM(nu=0.05))
]

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'anomaly_detection'},
    skl_pipeline_steps=steps,
    model_parameters={"column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5']},
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
)
model.train()
model.score()
# Output: anomaly flag (1=inlier, -1=outlier), decision_function, anomaly_score
```

### Deploy LassoLarsCV (Feature Selection + Regression)

```python
from sklearn.linear_model import LassoLarsCV

steps = [
    ('scaler', StandardScaler()),
    ('lasso', LassoLarsCV())
]

model = HyperModel(
    tdstone=sto,
    metadata={'project': 'lasso_regression'},
    skl_pipeline_steps=steps,
    model_parameters={"target": 'Y1', "column_names_X": ['X1', 'X2', 'X3', 'X4', 'X5']},
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
)
model.train()
model.score()
```

### Retrieve Predictions

```python
# Denormalized view: predictions joined with original features
predictions = model.get_model_predictions()

# Raw (normalized) predictions table
predictions_raw = model.get_model_predictions(denormalized_view=False)

# Include per-batch timing columns (SCRIPT_UUID, TOTAL_TIME, IMPORT_TIME, LOAD_TIME,
# PROCESS_TIME, PRINT_TIME, BATCH_NO) — useful for diagnosing per-partition slowness
predictions_with_timing = model.get_model_predictions(include_timing=True)

# Trained model metadata
trained_models = model.get_trained_models()
```

### Inline HTML Reports

Pass `report=True` to `train()` or `score()` to render an inline HTML report with
partition-duration histograms, top-10 longest/fastest partitions, input/output
row-count distributions, and (for training) a model-size histogram:

```python
model.train(report=True)
model.score(report=True)
```

### Inspect the Underlying SQL

```python
# View the generated Script Table Operator SQL
print(model.mapper_scoring.generate_sto_query())
```

---

## Reload an Existing Hyper-segmented Model

```python
# List all registered hyper-models
sto.list_hyper_models()

# Reload by UUID
existing_model = HyperModel(tdstone=sto)
existing_model.download(id='0286d259-ecde-4cd0-ae4a-bcb3191383d1')

# Retrain and rescore (no code needed — everything is stored in Vantage)
existing_model.train()
existing_model.score()
```

---

## Feature Engineering

### Dimensionality Reduction (Reducer)

```python
from tdstone2.tdsfeature_engineering import FeatureEngineering

fe = FeatureEngineering(
    tdstone=sto,
    feature_engineering_type='feature engineering reducer',
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    id_fold='FOLD',
    fold_training='train',
    metadata={'project': 'pca_reduce'},
)
fe.reduce()
reduced = fe.get_reduced_features()
```

### Custom Feature Engineering

Provide a Python script that computes new columns from existing ones:

```python
fe = FeatureEngineering(
    tdstone=sto,
    feature_engineering_type='feature engineering',
    script_path='path/to/Feature_Interactions.py',
    dataset=tdml.in_schema(Param['database'], 'dataset_00'),
    id_row='ID',
    id_partition='Partition_ID',
    metadata={'project': 'interactions'},
)
fe.transform()

# All original + derived features denormalized
features = fe.get_computed_features()

# Raw (normalized) features table
features_raw = fe.get_computed_features(denormalized_view=False)
```

---

## Vector Embeddings

### Install a HuggingFace Model (ONNX)

```python
from tdstone2.tdsgenai import (
    install_model_in_vantage_from_name,
    customize_existing_model,
    install_zip_in_vantage,
    list_installed_files_byom,
    get_model_dimension,
)

# Download, convert to ONNX, customize tokenizer, upload
install_model_in_vantage_from_name(
    model_name='intfloat/multilingual-e5-small',
    model_task='sentence-similarity',
    upload=True,
    generate_zip=True,
)

list_installed_files_byom()  # verify installation
get_model_dimension(model_name='intfloat/multilingual-e5-small')  # e.g., 384
```

For **Vantage Cloud Lake (OAF)**:

```python
from tdstone2.tdsgenai_lake import install_model_in_vantage_from_name
install_model_in_vantage_from_name(model_name='intfloat/multilingual-e5-small', ...)
```

#### Compute embeddings on Vantage Cloud Lake (OAF)

```python
from tdstone2.tdsgenai_lake import compute_vector_embedding

# device='cuda' triggers a pre-flight Apply probe that verifies torch.cuda.is_available()
# in the OAF env *before* launching the real call. With cuda_strict=True (default), the
# call raises RuntimeError with an actionable diagnostic instead of silently falling back
# to CPU when torch is built for a different CUDA major than the cluster's NVIDIA driver
# supports (e.g. cu13 wheel vs CUDA 12 driver).
embeddings = compute_vector_embedding(
    model_name         = 'models--BAAI--bge-small-en-v1.5',
    dataset            = text_dataframe,
    text_column        = 'txt',
    accumulate_columns = ['id'],
    hash_columns       = ['id'],
    oaf_env            = 'bq20251218',         # see OAF env setup below
    schema_name        = Param['database'],
    table_name         = 'embeddings_bge',
    device             = 'cuda',
    cuda_strict        = True,   # raise on real CUDA failure; pass-through on Apply throttling
    compute_group      = 'GPU_Cluster',        # routes Apply to the GPU compute group
)
```

#### OAF environment setup for Tesla T4 (CUDA 12.x driver)

The OAF package mirror evolves independently of Vantage compute-node drivers. As of 2026,
the mirror's default `torch` resolves to `2.11.0+cu130`, which requires CUDA 13.0 — but
current Lake GPU nodes run driver `12.7/12.8`. Use an env that already carries
`torch 2.9.1+cu128` (e.g. `bq20251218`), or create a new one and install the cu128 build
before any other torch package touches the env.

**Required packages** (install in order via `tdml.get_env(env_name).install_lib([...])`):

| Package | Why |
|---------|-----|
| `torch==2.9.1` | cu128 wheel — matches CUDA 12.x driver |
| `sentence-transformers` | ST 5.x |
| `pydantic>=2.0` | ST 5.x requires pydantic v2; base image ships only v1 |

**Known compatibility patches** (already baked into `tds_vector_embedding_lake.py`):

- **`huggingface-hub` version guard** — the mirror pins hub at `1.2.3` but ST 5.x requires
  `>=1.5.0`. ST checks via `importlib.metadata.version`, not `__version__`, so both must be
  spoofed before importing ST.
- **`types.UnionType` in hub validator** — `transformers 5.x` uses Python 3.10+ `str | None`
  union syntax in its config dataclasses. Hub 1.2.3's `dataclasses.py` type-validator has no
  handler for `types.UnionType` and raises `TypeError: Unsupported type for field
  'transformers_version': str | None`. Fixed by registering `types.UnionType` in
  `_BASIC_TYPE_VALIDATORS` before model load.

Both patches are applied automatically inside the STO script; no user action needed.
`pydantic v2` must be installed in the env for ST 5.x's model-config schema to load.

**Validated configuration** (Tesla T4, CUDA driver 12.7, VCL2 `GPU_Cluster`):

| Component | Version |
|-----------|---------|
| torch | 2.9.1+cu128 |
| sentence-transformers | 5.4.1 |
| huggingface-hub (mirror) | 1.2.3 (spoofed to 1.9.0 at runtime) |
| transformers | 5.7.0 |
| pydantic | 2.13.3 |
| BGE model | `BAAI/bge-small-en-v1.5` (384-dim) |
| Throughput | ~1 000 rows / 69 s on a single T4 |

See `demos/notebooks Demo OAF - Vector Embedding/` for the full provisioning notebooks.

### Compute Vector Embeddings In-Database (BYOM)

```python
from tdstone2.tdsgenai import compute_vector_embedding_byom

embeddings = compute_vector_embedding_byom(
    model='tdstone2_emb_384_intfloat_multilingual_e5_small',
    dataset=text_dataframe,
    text_column='content',
    accumulate_columns=['id'],
    schema_name=Param['database'],
    table_name='embeddings',
    primary_index=['id'],
)
```

Output: table with `id` + 384-dimensional embedding vector per row.

---

## Seq2Seq Models (Summarization / Translation / Language Detection)

```python
from tdstone2.tdsgenai_seq2seq import install_seq2seq_model, run_seq2seq

# Install model (e.g., flan-t5-small for summarization)
install_seq2seq_model(
    model_name='google/flan-t5-small',
    model_task='summarization',
    upload=True,
)

# Run via Script Table Operator
results = run_seq2seq(
    dataset=text_dataframe,
    text_column='content',
    model='tdstone2_seq2seq_google_flan_t5_small',
    schema_name=Param['database'],
)
```

---

## Lineage & Registry

```python
sto.list_codes()                       # registered Python scripts/classes
sto.list_models()                      # model configs with all arguments
sto.list_mappers()                     # training/scoring/feature-engineering mappers
sto.list_hyper_models()                # hyper-segmented model registrations
sto.list_feature_engineering_models()  # feature engineering pipeline registrations
```

---

## Local Validation / Debugging

Execute the generated model code locally before deploying to Vantage:

```python
code_and_data = model.get_code_and_data(partition_id=1)

exec(code_and_data['code'])
local_model = MyModel(**code_and_data['arguments'])
df_local = code_and_data['data']
df_local['flag'] = df_local['flag'].astype('category')
df_local['Y2'] = df_local['Y2'].astype('category')
local_model.fit(df_local)
local_model.score(df_local)
```

---

## Demo Notebooks

The `demos/` folder contains end-to-end worked examples:

| Series | Location | Content |
|--------|----------|---------|
| Core workflow | `demos/notebooks/` | Data generation → setup → HyperModel → feature engineering → retrieval (01–16) |
| Scikit-learn models | `demos/notebooks Demo Hypermodel Scikit-Learn/` | Classifier, Regressor, Anomaly, LassoLarsCV (STO path) |
| **Scikit-learn models on VCL Apply** | `demos/notebooks Demo Hypermodel Scikit-Learn with OAF/` | Same flow on Vantage Cloud Lake via `use_apply=True` (VCL1 / VCL2 variants) |
| Script Table Operator | `demos/demo script csae/notebooks/` | Raw STO usage with anomaly detection |
| BYOM Vector Embedding | `demos/notebooks Demo BYOM - Vector Embedding/` | HuggingFace ONNX install + BYOM embedding |
| OAF Vector Embedding | `demos/notebooks Demo OAF - Vector Embedding/` | HuggingFace model on OpenAF platform (VCL1 + VCL2 variants, dedicated `tdstone2_embeddings` env, CUDA pre-flight) |
| STO Vector Embedding | `demos/notebooks Demo STO - Vector Embedding/` | Embedding via Script Table Operator |
| Seq2Seq | `demos/notebooks Demo Seq2Seq/` | Summarization and language detection |
| Chat / Semantic Search | `demos/demo chat/notebooks/` | Chunking, embedding, vector search |

---

## Repository Tables (Default Names)

| Object | Table |
|--------|-------|
| Code | `TDS_CODE_REPOSITORY` |
| Model | `TDS_MODEL_REPOSITORY` |
| Trained Model | `TDS_TRAINED_MODEL_REPOSITORY` |
| Mapper | `TDS_MAPPER_REPOSITORY` |
| HyperModel | `TDS_HYPER_MODEL_REPOSITORY` |
| Feature Engineering | `TDS_FEATURE_ENGINEERING_PROCESS_REPOSITORY` |

All names are overridable via `TDStone.__init__()` kwargs.
