Metadata-Version: 2.4
Name: tdfs4ds
Version: 0.2.6.0
Summary: A python package to simplify the usage of feature store using Teradata Vantage ...
Author: Denis Molin
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: teradataml>=17.20
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: plotly
Requires-Dist: tqdm
Requires-Dist: networkx
Requires-Dist: sqlparse
Requires-Dist: langchain_openai
Requires-Dist: nbformat>=4.2.0
Dynamic: author
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

![tdfs4ds logo](https://github.com/denismolin/tdfs4ds/raw/main/tdfs4ds_logo.png)

# tdfs4ds — A Feature Store Library for Data Scientists working with ClearScape Analytics

`tdfs4ds` (Teradata Feature Store for Data Scientists) is a Python package for managing temporal feature stores in Teradata Vantage databases. It provides easy-to-use functions for creating, registering, storing, and retrieving features — with full time-travel support, lineage tracking, and process operationalization.

## Installation

```bash
pip install tdfs4ds
```

## Quick Start

Import `tdfs4ds` **after** establishing a teradataml connection so the package can auto-detect your default database:

```python
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)

import tdfs4ds
# tdfs4ds.SCHEMA is auto-set from the teradataml context;
# override if needed: tdfs4ds.SCHEMA = 'my_database'

# Data domain management — use the dedicated functions:
tdfs4ds.create_data_domain('MY_PROJECT')   # create and activate a new domain
# or
tdfs4ds.select_data_domain('MY_PROJECT')   # activate an existing domain
# or
tdfs4ds.get_data_domains()                 # list all available domains (* marks the active one)
```

## Core API

| Function | Description |
|----------|-------------|
| `tdfs4ds.setup(database)` | Create feature catalog, process catalog, and follow-up tables in `database` |
| `tdfs4ds.upload_features(df, entity_id, feature_names, metadata={})` | Ingest features from a teradataml DataFrame into the feature store |
| `tdfs4ds.build_dataset(entity_id, selected_features, view_name, comment='dataset')` | Assemble a dataset view from registered features |
| `tdfs4ds.run(process_id)` | Re-execute a registered feature engineering process |
| `tdfs4ds.roll_out(...)` | Operationalize processes at scale |
| `tdfs4ds.connect(database)` | Connect to an existing feature store |

### `entity_id` must specify SQL data types (dict, not list)

```python
entity_id = {'CUSTOMER_ID': 'BIGINT', 'EVENT_DATE': 'DATE'}   # correct
entity_id = ['CUSTOMER_ID', 'EVENT_DATE']                      # wrong
```

## Walkthrough Example

### Step 1 — Set up a feature store

```python
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)

import tdfs4ds
tdfs4ds.setup(database='my_database')
```

### Step 2 — Configure the active context

```python
tdfs4ds.SCHEMA = 'my_database'   # override if not auto-detected

# Use dedicated functions to manage the data domain:
tdfs4ds.create_data_domain('DATA_QUALITY')   # create and activate (first time)
# tdfs4ds.select_data_domain('DATA_QUALITY') # activate an existing domain
# tdfs4ds.get_data_domains()                 # list all domains
```

### Step 3 — Define your feature engineering view

```python
df = tdml.DataFrame(tdml.in_schema('my_database', 'my_feature_view'))
# If teradataml created intermediate views, make them permanent first:
# tdfs4ds.crystallize_view(df)
```

### Step 4 — Upload and operationalize

```python
entity_id     = {'EVENT_DT': 'DATE', 'ID': 'BIGINT'}
feature_names = ['KPI1', 'KPI2']

tdfs4ds.upload_features(
    df=df,
    entity_id=entity_id,
    feature_names=feature_names,
    metadata={'project': 'data quality'}
)
```

This registers entities and features (if not already present), registers a feature engineering process in the process catalog, and writes the feature values into the feature store.

### Step 5 — Re-run a process

```python
# List all registered processes to find the process ID
tdfs4ds.process_catalog()

# Re-execute by process ID
tdfs4ds.run(process_id)
```

### Step 6 — Build a dataset

```python
selected_features = {
    'KPI1': '<process_uuid>',
    'KPI2': '<process_uuid>',
}

dataset = tdfs4ds.build_dataset(
    entity_id={'ID': 'BIGINT'},
    selected_features=selected_features,
    view_name='my_dataset',
    comment='Dataset for churn model'
)
```

`selected_features` maps each feature name to the UUID of the process that computed it.

## Key Configuration Variables

```python
tdfs4ds.SCHEMA                = 'my_database'        # target database (auto-set from context)
# Data domain: use tdfs4ds.create_data_domain() / select_data_domain() / get_data_domains()
tdfs4ds.FEATURE_STORE_TIME    = None                 # None = current; '2024-01-01 00:00:00' = time travel
tdfs4ds.DISPLAY_LOGS          = True                 # verbose logging
tdfs4ds.DEBUG_MODE            = False
tdfs4ds.STORE_FEATURE         = 'MERGE'              # 'MERGE' or 'UPDATE_INSERT'

# GenAI documentation
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'openai'           # or 'bedrock'
tdfs4ds.INSTRUCT_MODEL_MODEL    = 'gpt-4o'
tdfs4ds.INSTRUCT_MODEL_API_KEY  = 'sk-...'
```

## Time Travel

All catalogs and feature stores are temporal. Point-in-time queries are available via:

```python
tdfs4ds.FEATURE_STORE_TIME = '2024-01-01 00:00:00'   # query historical state
tdfs4ds.FEATURE_STORE_TIME = None                     # back to current state
```

## Package Structure

```text
tdfs4ds/
├── __init__.py                    — Global config variables & re-exported public API
├── lifecycle.py                   — setup(), connect()
├── execution.py                   — run(), upload_features(), roll_out()
├── catalog.py                     — feature_catalog(), process_catalog(), dataset_catalog()
├── data_domain.py                 — get_data_domains(), select_data_domain(), create_data_domain()
├── datasets.py                    — Utility dataset helpers
├── feature_store/
│   ├── entity_management.py       — register_entity(), remove_entity()
│   ├── feature_data_processing.py — prepare_feature_ingestion(), store_feature(), apply_collect_stats()
│   ├── feature_query_retrieval.py — get_list_features(), get_available_features(), get_feature_versions()
│   └── feature_store_management.py — register_features(), feature_store_table_creation()
├── process_store/
│   ├── process_followup.py        — followup_open(), followup_close(), follow_up_report()
│   ├── process_query_administration.py — list_processes(), get_process_id(), remove_process()
│   ├── process_registration_management.py — register_process_view()
│   └── process_store_catalog_management.py — process_store_catalog_creation()
├── dataset/
│   ├── builder.py                 — build_dataset(), build_dataset_opt(), augment_source_with_features()
│   ├── dataset.py                 — Dataset class
│   └── dataset_catalog.py        — DatasetCatalog class
├── genai/
│   └── documentation.py          — LLM-powered auto-documentation of SQL processes (OpenAI / Bedrock)
├── lineage/
│   ├── lineage.py                 — SQL query parsing, DDL analysis
│   ├── network.py                 — Dependency graph construction
│   └── indexing.py                — Lineage indexing utilities
└── utils/
    ├── query_management.py        — execute_query(), execute_query_wrapper()
    ├── filter_management.py       — FilterManager class
    ├── time_management.py         — TimeManager class
    ├── lineage.py                 — crystallize_view(), analyze_sql_query(), generate_view_dependency_network()
    ├── info.py                    — update_varchar_length(), get_column_types(), seconds_to_dhms()
    └── visualization.py           — plot_graph(), visualize_graph(), display_table()
```

## Discover Registered Features

```python
from tdfs4ds.feature_store.feature_query_retrieval import (
    get_list_entity,
    get_list_features,
    get_available_features,
    get_feature_versions,
)
```

## Requirements

- Python >= 3.6
- teradataml >= 17.20
- Active Teradata Vantage connection
- **VALIDTIME temporal tables must be enabled** on the Teradata Vantage system — all feature catalogs, process catalogs, and feature stores rely on `VALIDTIME` support
