Metadata-Version: 2.4
Name: tdfs4ds
Version: 0.2.6.5
Summary: A python package to simplify the usage of feature store using Teradata Vantage ...
Author: Denis Molin
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: teradataml>=17.20
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: plotly
Requires-Dist: tqdm
Requires-Dist: networkx
Requires-Dist: sqlparse
Requires-Dist: langchain_openai
Requires-Dist: nbformat>=4.2.0
Dynamic: author
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

![tdfs4ds logo](https://github.com/denismolin/tdfs4ds/raw/main/tdfs4ds_logo.png)

# tdfs4ds — A Feature Store Library for Data Scientists working with ClearScape Analytics

`tdfs4ds` (Teradata Feature Store for Data Scientists) is a Python package for managing temporal feature stores in Teradata Vantage databases. It provides easy-to-use functions for creating, registering, storing, and retrieving features — with full time-travel support, lineage tracking, and process operationalization.

## Installation

```bash
pip install tdfs4ds
```

## Quick Start

Import `tdfs4ds` **after** establishing a teradataml connection so the package can auto-detect your default database:

```python
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)

import tdfs4ds
# tdfs4ds.SCHEMA is auto-set from the teradataml context;
# override if needed: tdfs4ds.SCHEMA = 'my_database'

# Data domain management — use the dedicated functions:
tdfs4ds.create_data_domain('MY_PROJECT')   # create and activate a new domain
# or
tdfs4ds.select_data_domain('MY_PROJECT')   # activate an existing domain
# or
tdfs4ds.get_data_domains()                 # list all available domains (* marks the active one)
```

## Core API

| Function | Description |
|----------|-------------|
| `tdfs4ds.setup(database)` | Create feature catalog, process catalog, and follow-up tables in `database` |
| `tdfs4ds.upload_features(df, entity_id, feature_names, metadata={})` | Ingest features from a teradataml DataFrame into the feature store |
| `tdfs4ds.build_dataset(entity_id, selected_features, view_name, comment='dataset')` | Assemble a dataset view from registered features |
| `tdfs4ds.run(process_id)` | Re-execute a registered feature engineering process |
| `tdfs4ds.roll_out(...)` | Operationalize processes at scale |
| `tdfs4ds.connect(database)` | Connect to an existing feature store |

### `entity_id` must specify SQL data types (dict, not list)

```python
entity_id = {'CUSTOMER_ID': 'BIGINT', 'EVENT_DATE': 'DATE'}   # correct
entity_id = ['CUSTOMER_ID', 'EVENT_DATE']                      # wrong
```

## Walkthrough Example

### Step 1 — Set up a feature store

```python
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)

import tdfs4ds
tdfs4ds.setup(database='my_database')
```

### Step 2 — Configure the active context

```python
tdfs4ds.SCHEMA = 'my_database'   # override if not auto-detected

# Use dedicated functions to manage the data domain:
tdfs4ds.create_data_domain('DATA_QUALITY')   # create and activate (first time)
# tdfs4ds.select_data_domain('DATA_QUALITY') # activate an existing domain
# tdfs4ds.get_data_domains()                 # list all domains
```

### Step 3 — Define your feature engineering view

```python
df = tdml.DataFrame(tdml.in_schema('my_database', 'my_feature_view'))
# If teradataml created intermediate views, make them permanent first:
# tdfs4ds.crystallize_view(df)
```

### Step 4 — Upload and operationalize

```python
entity_id     = {'EVENT_DT': 'DATE', 'ID': 'BIGINT'}
feature_names = ['KPI1', 'KPI2']

tdfs4ds.upload_features(
    df=df,
    entity_id=entity_id,
    feature_names=feature_names,
    metadata={'project': 'data quality'}
)
```

This registers entities and features (if not already present), registers a feature engineering process in the process catalog, and writes the feature values into the feature store.

### Step 5 — Re-run a process

```python
# List all registered processes to find the process ID
tdfs4ds.process_catalog()

# Re-execute by process ID
tdfs4ds.run(process_id)
```

### Step 6 — Build a dataset

```python
selected_features = {
    'KPI1': '<process_uuid>',
    'KPI2': '<process_uuid>',
}

dataset = tdfs4ds.build_dataset(
    entity_id={'ID': 'BIGINT'},
    selected_features=selected_features,
    view_name='my_dataset',
    comment='Dataset for churn model'
)
```

`selected_features` maps each feature name to the UUID of the process that computed it.

## Configuration

### Programmatic (in-session)

```python
tdfs4ds.SCHEMA                = 'my_database'        # target database (auto-set from context)
# Data domain: use tdfs4ds.create_data_domain() / select_data_domain() / get_data_domains()
tdfs4ds.FEATURE_STORE_TIME    = None                 # None = current; '2024-01-01 00:00:00' = time travel
tdfs4ds.DISPLAY_LOGS          = True                 # verbose logging
tdfs4ds.DEBUG_MODE            = False
tdfs4ds.STORE_FEATURE         = 'MERGE'              # 'MERGE' or 'UPDATE_INSERT'

# GenAI documentation
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'openai'           # or 'bedrock'
tdfs4ds.INSTRUCT_MODEL_MODEL    = 'gpt-4o'
tdfs4ds.INSTRUCT_MODEL_API_KEY  = 'sk-...'           # prefer env var instead (see below)
```

### Config file (persistent per-project or per-user)

Create a `tdfs4ds.json` file in your project directory (or `~/.tdfs4ds/config.json` for user-wide defaults) to avoid repeating the setup cell in every notebook:

```json
{
    "schema": "MY_DATABASE",
    "data_domain": "MY_PROJECT",
    "display_logs": true,
    "store_feature": "MERGE",
    "varchar_size": 1024,
    "instruct_model_provider": "openai",
    "instruct_model_model": "gpt-4o",
    "instruct_model_url": null
}
```

Keys are case-insensitive. `instruct_model_api_key` is rejected from JSON config to prevent accidental commits — use a `.env` file or OS env var for credentials.

### `.env` file (local secrets and overrides)

Place a `.env` file in your project directory (or `~/.tdfs4ds/.env` for user-wide defaults). Only `TDFS4DS_*` variables are read — the file is parsed without touching `os.environ`:

```dotenv
TDFS4DS_SCHEMA=MY_DATABASE
TDFS4DS_DATA_DOMAIN=MY_PROJECT
TDFS4DS_INSTRUCT_MODEL_API_KEY=sk-...
TDFS4DS_INSTRUCT_MODEL_PROVIDER=openai
TDFS4DS_INSTRUCT_MODEL_MODEL=gpt-4o
```

Add `.env` to `.gitignore` to keep secrets out of source control. Quoted values and `export KEY=VALUE` syntax are supported.

### Environment variables

All settings can also be set via `TDFS4DS_<VAR_NAME>` OS environment variables (useful in CI/CD):

| Variable | Corresponding setting |
|---|---|
| `TDFS4DS_SCHEMA` | `tdfs4ds.SCHEMA` |
| `TDFS4DS_DATA_DOMAIN` | `tdfs4ds.DATA_DOMAIN` |
| `TDFS4DS_DISPLAY_LOGS` | `tdfs4ds.DISPLAY_LOGS` |
| `TDFS4DS_DEBUG_MODE` | `tdfs4ds.DEBUG_MODE` |
| `TDFS4DS_STORE_FEATURE` | `tdfs4ds.STORE_FEATURE` |
| `TDFS4DS_VARCHAR_SIZE` | `tdfs4ds.VARCHAR_SIZE` |
| `TDFS4DS_INSTRUCT_MODEL_PROVIDER` | `tdfs4ds.INSTRUCT_MODEL_PROVIDER` |
| `TDFS4DS_INSTRUCT_MODEL_MODEL` | `tdfs4ds.INSTRUCT_MODEL_MODEL` |
| `TDFS4DS_INSTRUCT_MODEL_URL` | `tdfs4ds.INSTRUCT_MODEL_URL` |
| `TDFS4DS_INSTRUCT_MODEL_API_KEY` | `tdfs4ds.INSTRUCT_MODEL_API_KEY` |

### `load_config()` — explicit reload

```python
# Reload from default search paths
tdfs4ds.load_config()

# Point at specific files
tdfs4ds.load_config(
    path='/configs/feature_store.json',
    dotenv_path='/project/.env.production',
)
```

### Priority chain

```
programmatic (tdfs4ds.X = value)
  > OS environment variable (TDFS4DS_X)
  > .env file (./.env or ~/.tdfs4ds/.env)
  > JSON config file (./tdfs4ds.json or ~/.tdfs4ds/config.json)
  > teradataml auto-detection (SCHEMA only)
  > built-in defaults
```

## Time Travel

All catalogs and feature stores are temporal. Point-in-time queries are available via:

```python
tdfs4ds.FEATURE_STORE_TIME = '2024-01-01 00:00:00'   # query historical state
tdfs4ds.FEATURE_STORE_TIME = None                     # back to current state
```

## Package Structure

```text
tdfs4ds/
├── __init__.py                    — Global config variables & re-exported public API
├── config.py                      — External config loading (JSON, .env, env vars); load_config()
├── lifecycle.py                   — setup(), connect()
├── execution.py                   — run(), upload_features(), roll_out()
├── catalog.py                     — feature_catalog(), process_catalog(), dataset_catalog()
├── data_domain.py                 — get_data_domains(), select_data_domain(), create_data_domain()
├── datasets.py                    — Utility dataset helpers
├── feature_store/
│   ├── entity_management.py       — register_entity(), remove_entity()
│   ├── feature_data_processing.py — prepare_feature_ingestion(), store_feature(), apply_collect_stats()
│   ├── feature_query_retrieval.py — get_list_features(), get_available_features(), get_feature_versions()
│   └── feature_store_management.py — register_features(), feature_store_table_creation()
├── process_store/
│   ├── process_followup.py        — followup_open(), followup_close(), follow_up_report()
│   ├── process_query_administration.py — list_processes(), get_process_id(), remove_process()
│   ├── process_registration_management.py — register_process_view()
│   └── process_store_catalog_management.py — process_store_catalog_creation()
├── dataset/
│   ├── builder.py                 — build_dataset(), build_dataset_opt(), augment_source_with_features()
│   ├── dataset.py                 — Dataset class
│   └── dataset_catalog.py        — DatasetCatalog class
├── genai/
│   └── documentation.py          — LLM-powered auto-documentation of SQL processes (OpenAI / Bedrock)
├── lineage/
│   ├── lineage.py                 — SQL query parsing, DDL analysis
│   ├── network.py                 — Dependency graph construction
│   └── indexing.py                — Lineage indexing utilities
└── utils/
    ├── query_management.py        — execute_query(), execute_query_wrapper()
    ├── filter_management.py       — FilterManager class
    ├── time_management.py         — TimeManager class
    ├── lineage.py                 — crystallize_view(), analyze_sql_query(), generate_view_dependency_network()
    ├── info.py                    — update_varchar_length(), get_column_types(), seconds_to_dhms()
    └── visualization.py           — plot_graph(), visualize_graph(), display_table()
```

## GenAI Documentation

The `genai` module provides two complementary ways to document the feature store.

### LLM-powered process documentation

`document_process()` calls an LLM (OpenAI, Azure, vLLM, or AWS Bedrock) to generate:
- Business-logic description of the SQL query
- Entity description and per-column annotations
- EXPLAIN-plan quality score (1–5) with warnings and recommendations

```python
import tdfs4ds
from tdfs4ds.genai import document_process

# Configure the LLM (or use TDFS4DS_INSTRUCT_MODEL_* env vars / .env file)
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'openai'
tdfs4ds.INSTRUCT_MODEL_MODEL    = 'gpt-4o'
tdfs4ds.INSTRUCT_MODEL_API_KEY  = 'sk-...'

process_info = document_process(process_id='<UUID>', show_explain_plan=True)
```

### LLM-powered dataset documentation

`document_dataset_incremental()` documents a **dataset** by walking its full lineage bottom-up:

1. Source tables — uses the business dictionary if available
2. Intermediate views — auto-documented via LLM if undocumented
3. Process views — actively calls `document_process_incremental` if undocumented
4. Feature/entity column descriptions are **propagated** from process docs (no extra LLM call)
5. A single JSON-constrained LLM call generates five structured sections for the dataset

```python
from tdfs4ds.genai import document_dataset_incremental

result = document_dataset_incremental(
    dataset_id   = '<UUID>',  # from dataset_catalog()
    force_update = False,
    upload       = True,
)

# result['DATASET_SECTIONS'] contains:
#   OVERVIEW, ENTITY, FEATURE_THEMES, BUSINESS_QUESTIONS, INTENDED_AUDIENCE
```

Each section is stored as an independent row in `FS_BUSINESS_DICTIONARY_SECTIONS` — no chunking needed for RAG retrieval.

### Business dictionary (no LLM required)

Three temporal tables store **business-oriented descriptions** for any database object, its columns, and its documentation sections. They form a 3-level hierarchy designed for chunking-free hierarchical RAG:

| Level | Table | Key | Purpose |
|-------|-------|-----|---------|
| 0 | `FS_BUSINESS_DICTIONARY_OBJECTS` | `(DATABASE_NAME, OBJECT_NAME)` | One summary per object (`OBJECT_TYPE`: `'T'`/`'V'`/`'D'`) |
| 1 | `FS_BUSINESS_DICTIONARY_SECTIONS` | `(DATABASE_NAME, OBJECT_NAME, SECTION_NAME)` | One row per documentation section per object |
| 2 | `FS_BUSINESS_DICTIONARY_COLUMNS` | `(DATABASE_NAME, TABLE_NAME, COLUMN_NAME)` | One description per column |

All tables are VALIDTIME temporal and provisioned automatically by `tdfs4ds.connect(create_if_missing=True)`.

```python
import pandas as pd
from tdfs4ds.genai import (
    upload_business_dictionary_objects,
    upload_business_dictionary_columns,
    upload_business_dictionary_sections,
)

# Level 0 — Object-level descriptions
upload_business_dictionary_objects(pd.DataFrame([
    {
        'DATABASE_NAME'       : 'MY_DB',
        'OBJECT_NAME'         : 'CUSTOMER',
        'OBJECT_TYPE'         : 'T',
        'BUSINESS_DESCRIPTION': 'Core customer table. Each row represents a unique enrolled customer.',
    },
]))

# Level 1 — Section-level descriptions (typically LLM-generated for datasets)
upload_business_dictionary_sections(pd.DataFrame([
    {
        'DATABASE_NAME'  : 'MY_DB',
        'OBJECT_NAME'    : 'DATASET_CUSTOMER',
        'SECTION_NAME'   : 'OVERVIEW',
        'SECTION_CONTENT': 'Customer-level analytical dataset combining spending and category features...',
    },
]))

# Level 2 — Column-level descriptions
upload_business_dictionary_columns(pd.DataFrame([
    {
        'DATABASE_NAME'       : 'MY_DB',
        'TABLE_NAME'          : 'CUSTOMER',
        'COLUMN_NAME'         : 'CUSTOMER_ID',
        'BUSINESS_DESCRIPTION': 'Unique customer identifier assigned at enrolment.',
    },
]))
```

All three functions validate required columns and perform a `CURRENT VALIDTIME MERGE` — re-running them updates existing descriptions and preserves the full change history.

## Discover Registered Features

```python
from tdfs4ds.feature_store.feature_query_retrieval import (
    get_list_entity,
    get_list_features,
    get_available_features,
    get_feature_versions,
)
```

## Lineage

The `lineage` module builds end-to-end dependency graphs from a SQL query or a dataset view DDL.

### Dependency graph

```python
from tdfs4ds.lineage import build_teradata_dependency_graph, plot_lineage_sankey, show_plotly_robust

# Start from a dataset view DDL (obtained via SHOW VIEW)
sql = tdml.execute_sql("SHOW VIEW DATASET_CUSTOMER").fetchall()[0][0]

graph = build_teradata_dependency_graph(sql_query=sql)
# Returns: {"nodes": {...}, "edges": [...], "roots": [...]}
```

By default (`expand_datasets_via_process_catalog=True`) dataset nodes are resolved through
the process catalog: `FEATURE_VERSION` UUIDs embedded in the dataset DDL are matched to
`PROCESS_ID` entries in `FS_V_PROCESS_CATALOG`, and edges are drawn directly to the
registered feature-engineering views.

```
DATASET_CUSTOMER  →  FEAT_ENG_CUST  →  DB_SOURCE.TRANSACTIONS
```

Set `expand_datasets_via_process_catalog=False` to connect the dataset directly to the
raw feature-store storage tables (previous behaviour).

```python
fig = plot_lineage_sankey(graph, title="Customer Dataset Lineage")
show_plotly_robust(fig)
```

### Migration manifest

`graph_to_migration_manifest` converts any lineage graph into a flat, JSON-serialisable
dict — useful for planning a feature store migration.

```python
from tdfs4ds.lineage import graph_to_migration_manifest
import json

# All databases
manifest = graph_to_migration_manifest(graph)

# Scoped to the feature store schema only (cross-boundary edges excluded)
manifest_fs = graph_to_migration_manifest(graph, filter_database=tdfs4ds.SCHEMA)
print(json.dumps(manifest_fs, indent=2))
# {
#   "views":  [{"database": "demo_user", "name": "DATASET_CUSTOMER", "type": "dataset"},
#              {"database": "demo_user", "name": "FEAT_ENG_CUST",    "type": "view"}],
#   "tables": [],
#   "edges":  [{"from": "demo_user.DATASET_CUSTOMER", "to": "demo_user.FEAT_ENG_CUST"}]
# }

with open("migration_manifest.json", "w") as f:
    json.dump(manifest_fs, f, indent=2)
```

## Requirements

- Python >= 3.6
- teradataml >= 17.20
- Active Teradata Vantage connection
- **VALIDTIME temporal tables must be enabled** on the Teradata Vantage system — all feature catalogs, process catalogs, and feature stores rely on `VALIDTIME` support
