Metadata-Version: 2.4
Name: tdfs4ds
Version: 0.2.9.1
Summary: A python package to simplify the usage of feature store using Teradata Vantage ...
Author: Denis Molin
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: teradataml>=17.20
Requires-Dist: pandas
Requires-Dist: numpy
Requires-Dist: plotly
Requires-Dist: tqdm
Requires-Dist: networkx
Requires-Dist: sqlparse
Requires-Dist: langchain_openai
Requires-Dist: langchain_aws
Requires-Dist: langchain_core
Requires-Dist: langchain_chroma
Requires-Dist: langchain_teradata
Requires-Dist: langgraph
Requires-Dist: chromadb
Requires-Dist: pydantic
Requires-Dist: gradio
Requires-Dist: nbformat>=4.2.0
Dynamic: author
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

![tdfs4ds logo](https://github.com/denismolin/tdfs4ds/raw/main/tdfs4ds_logo.png)

# tdfs4ds — A Feature Store Library for Data Scientists working with ClearScape Analytics

`tdfs4ds` (Teradata Feature Store for Data Scientists) is a Python package for managing temporal feature stores in Teradata Vantage databases. It provides easy-to-use functions for creating, registering, storing, and retrieving features — with full time-travel support, lineage tracking, and process operationalization.

## Installation

```bash
pip install tdfs4ds
```

## Quick Start

Import `tdfs4ds` **after** establishing a teradataml connection so the package can auto-detect your default database:

```python
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)

import tdfs4ds
# tdfs4ds.SCHEMA is auto-set from the teradataml context;
# override if needed: tdfs4ds.SCHEMA = 'my_database'

# Data domain management — use the dedicated functions:
tdfs4ds.create_data_domain('MY_PROJECT')   # create and activate a new domain
# or
tdfs4ds.select_data_domain('MY_PROJECT')   # activate an existing domain
# or
tdfs4ds.get_data_domains()                 # list all available domains (* marks the active one)
```

## Core API

| Function | Description |
|----------|-------------|
| `tdfs4ds.setup(database)` | Create feature catalog, process catalog, and follow-up tables in `database` |
| `tdfs4ds.upload_features(df, entity_id, feature_names, metadata={})` | Ingest features from a teradataml DataFrame into the feature store |
| `tdfs4ds.build_dataset(entity_id, selected_features, view_name, comment='dataset')` | Assemble a dataset view from registered features |
| `tdfs4ds.run(process_id)` | Re-execute a registered feature engineering process |
| `tdfs4ds.roll_out(...)` | Operationalize processes at scale |
| `tdfs4ds.connect(database)` | Connect to an existing feature store |

### `entity_id` must specify SQL data types (dict, not list)

```python
entity_id = {'CUSTOMER_ID': 'BIGINT', 'EVENT_DATE': 'DATE'}   # correct
entity_id = ['CUSTOMER_ID', 'EVENT_DATE']                      # wrong
```

## Walkthrough Example

### Step 1 — Set up a feature store

```python
import teradataml as tdml
tdml.create_context(host=..., username=..., password=...)

import tdfs4ds
tdfs4ds.setup(database='my_database')
```

### Step 2 — Configure the active context

```python
tdfs4ds.SCHEMA = 'my_database'   # override if not auto-detected

# Use dedicated functions to manage the data domain:
tdfs4ds.create_data_domain('DATA_QUALITY')   # create and activate (first time)
# tdfs4ds.select_data_domain('DATA_QUALITY') # activate an existing domain
# tdfs4ds.get_data_domains()                 # list all domains
```

### Step 3 — Define your feature engineering view

```python
df = tdml.DataFrame(tdml.in_schema('my_database', 'my_feature_view'))
# If teradataml created intermediate views, make them permanent first:
# tdfs4ds.crystallize_view(df)
```

### Step 4 — Upload and operationalize

```python
entity_id     = {'EVENT_DT': 'DATE', 'ID': 'BIGINT'}
feature_names = ['KPI1', 'KPI2']

tdfs4ds.upload_features(
    df=df,
    entity_id=entity_id,
    feature_names=feature_names,
    metadata={'project': 'data quality'}
)
```

This registers entities and features (if not already present), registers a feature engineering process in the process catalog, and writes the feature values into the feature store.

### Step 5 — Re-run a process

```python
# List all registered processes to find the process ID
tdfs4ds.process_catalog()

# Re-execute by process ID
tdfs4ds.run(process_id)
```

### Step 6 — Build a dataset

```python
selected_features = {
    'KPI1': '<process_uuid>',
    'KPI2': '<process_uuid>',
}

dataset = tdfs4ds.build_dataset(
    entity_id={'ID': 'BIGINT'},
    selected_features=selected_features,
    view_name='my_dataset',
    comment='Dataset for churn model'
)
```

`selected_features` maps each feature name to the UUID of the process that computed it.

## Configuration

### Programmatic (in-session)

```python
tdfs4ds.SCHEMA                = 'my_database'        # target database (auto-set from context)
# Data domain: use tdfs4ds.create_data_domain() / select_data_domain() / get_data_domains()
tdfs4ds.FEATURE_STORE_TIME    = None                 # None = current; '2024-01-01 00:00:00' = time travel
tdfs4ds.DISPLAY_LOGS          = True                 # verbose logging
tdfs4ds.DEBUG_MODE            = False
tdfs4ds.STORE_FEATURE         = 'MERGE'              # 'MERGE' or 'UPDATE_INSERT'

# GenAI documentation
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'openai'           # or 'bedrock', 'vllm', 'azure'
tdfs4ds.INSTRUCT_MODEL_MODEL    = 'gpt-4o'
tdfs4ds.INSTRUCT_MODEL_API_KEY  = 'sk-...'           # prefer env var instead (see below)

# Embedding model (consumer agent vector index — falls back to INSTRUCT_MODEL_* if unset)
tdfs4ds.EMBEDDING_MODEL_PROVIDER = 'vllm'
tdfs4ds.EMBEDDING_MODEL_URL      = 'https://api.example.com/v1/e5'
tdfs4ds.EMBEDDING_MODEL_MODEL    = 'text-embedding-3-small'
tdfs4ds.EMBEDDING_MODEL_DIM      = 1536

# Chroma vector store
tdfs4ds.CHROMA_MODE = 'local'                        # 'local' or 'server'
tdfs4ds.CHROMA_PATH = './tdfs4ds_chroma'             # persist directory (local mode)
```

### Config file (persistent per-project or per-user)

Create a `tdfs4ds.json` file in your project directory (or `~/.tdfs4ds/config.json` for user-wide defaults) to avoid repeating the setup cell in every notebook:

```json
{
    "schema": "MY_DATABASE",
    "data_domain": "MY_PROJECT",
    "display_logs": true,
    "store_feature": "MERGE",
    "varchar_size": 1024,
    "instruct_model_provider": "openai",
    "instruct_model_model": "gpt-4o",
    "instruct_model_url": null,
    "embedding_model_provider": "vllm",
    "embedding_model_model": "text-embedding-3-small",
    "embedding_model_dim": 1536,
    "chroma_mode": "local",
    "chroma_path": "./tdfs4ds_chroma"
}
```

Keys are case-insensitive. `instruct_model_api_key` is rejected from JSON config to prevent accidental commits — use a `.env` file or OS env var for credentials.

### `.env` file (local secrets and overrides)

Place a `.env` file in your project directory (or `~/.tdfs4ds/.env` for user-wide defaults). Only `TDFS4DS_*` variables are read — the file is parsed without touching `os.environ`:

```dotenv
TDFS4DS_SCHEMA=MY_DATABASE
TDFS4DS_DATA_DOMAIN=MY_PROJECT
TDFS4DS_INSTRUCT_MODEL_API_KEY=sk-...
TDFS4DS_INSTRUCT_MODEL_PROVIDER=openai
TDFS4DS_INSTRUCT_MODEL_MODEL=gpt-4o
TDFS4DS_EMBEDDING_MODEL_PROVIDER=vllm
TDFS4DS_EMBEDDING_MODEL_URL=https://api.example.com/v1/e5
TDFS4DS_EMBEDDING_MODEL_MODEL=text-embedding-3-small
TDFS4DS_EMBEDDING_MODEL_DIM=1536
TDFS4DS_CHROMA_MODE=local
TDFS4DS_CHROMA_PATH=./tdfs4ds_chroma
```

Add `.env` to `.gitignore` to keep secrets out of source control. Quoted values and `export KEY=VALUE` syntax are supported.

### Environment variables

All settings can also be set via `TDFS4DS_<VAR_NAME>` OS environment variables (useful in CI/CD):

| Variable | Corresponding setting |
|---|---|
| `TDFS4DS_SCHEMA` | `tdfs4ds.SCHEMA` |
| `TDFS4DS_DATA_DOMAIN` | `tdfs4ds.DATA_DOMAIN` |
| `TDFS4DS_DISPLAY_LOGS` | `tdfs4ds.DISPLAY_LOGS` |
| `TDFS4DS_DEBUG_MODE` | `tdfs4ds.DEBUG_MODE` |
| `TDFS4DS_STORE_FEATURE` | `tdfs4ds.STORE_FEATURE` |
| `TDFS4DS_VARCHAR_SIZE` | `tdfs4ds.VARCHAR_SIZE` |
| `TDFS4DS_INSTRUCT_MODEL_PROVIDER` | `tdfs4ds.INSTRUCT_MODEL_PROVIDER` |
| `TDFS4DS_INSTRUCT_MODEL_MODEL` | `tdfs4ds.INSTRUCT_MODEL_MODEL` |
| `TDFS4DS_INSTRUCT_MODEL_URL` | `tdfs4ds.INSTRUCT_MODEL_URL` |
| `TDFS4DS_INSTRUCT_MODEL_API_KEY` | `tdfs4ds.INSTRUCT_MODEL_API_KEY` |
| `TDFS4DS_EMBEDDING_MODEL_PROVIDER` | `tdfs4ds.EMBEDDING_MODEL_PROVIDER` |
| `TDFS4DS_EMBEDDING_MODEL_MODEL` | `tdfs4ds.EMBEDDING_MODEL_MODEL` |
| `TDFS4DS_EMBEDDING_MODEL_URL` | `tdfs4ds.EMBEDDING_MODEL_URL` |
| `TDFS4DS_EMBEDDING_MODEL_API_KEY` | `tdfs4ds.EMBEDDING_MODEL_API_KEY` |
| `TDFS4DS_EMBEDDING_MODEL_DIM` | `tdfs4ds.EMBEDDING_MODEL_DIM` |
| `TDFS4DS_CHROMA_MODE` | `tdfs4ds.CHROMA_MODE` |
| `TDFS4DS_CHROMA_PATH` | `tdfs4ds.CHROMA_PATH` |
| `TDFS4DS_CHROMA_HOST` | `tdfs4ds.CHROMA_HOST` |
| `TDFS4DS_CHROMA_PORT` | `tdfs4ds.CHROMA_PORT` |

### `load_config()` — explicit reload

```python
# Reload from default search paths
tdfs4ds.load_config()

# Point at specific files
tdfs4ds.load_config(
    path='/configs/feature_store.json',
    dotenv_path='/project/.env.production',
)
```

### Priority chain

```
programmatic (tdfs4ds.X = value)
  > OS environment variable (TDFS4DS_X)
  > .env file (./.env or ~/.tdfs4ds/.env)
  > JSON config file (./tdfs4ds.json or ~/.tdfs4ds/config.json)
  > teradataml auto-detection (SCHEMA only)
  > built-in defaults
```

## Time Travel

All catalogs and feature stores are temporal. Point-in-time queries are available via:

```python
tdfs4ds.FEATURE_STORE_TIME = '2024-01-01 00:00:00'   # query historical state
tdfs4ds.FEATURE_STORE_TIME = None                     # back to current state
```

## Package Structure

```text
tdfs4ds/
├── __init__.py                    — Global config variables & re-exported public API
├── config.py                      — External config loading (JSON, .env, env vars); load_config()
├── lifecycle.py                   — setup(), connect()
├── execution.py                   — run(), upload_features(), roll_out()
├── catalog.py                     — feature_catalog(), process_catalog(), dataset_catalog()
├── data_domain.py                 — get_data_domains(), select_data_domain(), create_data_domain()
├── datasets.py                    — Utility dataset helpers
├── agent/
│   ├── __init__.py                — Public exports: consumer_agent, query_optimizer, aquery_optimizer, display_optimization_result, …
│   ├── consumer_agent.py          — Intent classifier, 7 skills, feature doc resolver, LLM helpers
│   ├── graph.py                   — LangGraph StateGraph: classify → detect_domain → skill → synthesize
│   ├── query_optimizer.py         — skill_optimize_query(), query_optimizer(), aquery_optimizer(), display_optimization_result()
│   ├── embedding.py               — get_embeddings(), list_embedding_models()
│   ├── vector_index.py            — build_vector_index(), search_vector_index() (Chroma)
│   └── chatbot.py                 — launch_chatbot(), launch_chatbot_with_index() Gradio UI
├── feature_store/
│   ├── entity_management.py       — register_entity(), remove_entity()
│   ├── feature_data_processing.py — prepare_feature_ingestion(), store_feature(), apply_collect_stats()
│   ├── feature_query_retrieval.py — get_list_features(), get_available_features(), get_feature_versions()
│   └── feature_store_management.py — register_features(), feature_store_table_creation()
├── process_store/
│   ├── process_followup.py        — followup_open(), followup_close(), follow_up_report()
│   ├── process_query_administration.py — list_processes(), get_process_id(), remove_process()
│   ├── process_registration_management.py — register_process_view()
│   └── process_store_catalog_management.py — process_store_catalog_creation()
├── dataset/
│   ├── builder.py                 — build_dataset(), build_dataset_opt(), augment_source_with_features()
│   ├── dataset.py                 — Dataset class
│   └── dataset_catalog.py        — DatasetCatalog class
├── genai/
│   └── documentation.py          — LLM-powered auto-documentation of SQL processes (OpenAI / Bedrock)
├── lineage/
│   ├── lineage.py                 — SQL query parsing, DDL analysis
│   ├── network.py                 — Dependency graph construction
│   └── indexing.py                — Lineage indexing utilities
└── utils/
    ├── query_management.py        — execute_query(), execute_query_wrapper()
    ├── filter_management.py       — FilterManager class
    ├── time_management.py         — TimeManager class
    ├── lineage.py                 — crystallize_view(), analyze_sql_query(), generate_view_dependency_network()
    ├── info.py                    — update_varchar_length(), get_column_types(), seconds_to_dhms()
    └── visualization.py           — plot_graph(), visualize_graph(), display_table()
```

## GenAI Documentation

The `genai` module provides two complementary ways to document the feature store.

### LLM-powered process documentation

`document_process()` calls an LLM (OpenAI, Azure, vLLM, or AWS Bedrock) to generate:
- Business-logic description of the SQL query
- Entity description and per-column annotations
- EXPLAIN-plan quality metrics: two 1–5 scores (User score / Overall score) plus two deterministic execution counters (`n_steps`, `n_spool_objects`) parsed from the raw Teradata EXPLAIN text, with warnings and recommendations

```python
import tdfs4ds
from tdfs4ds.genai import document_process

# Configure the LLM (or use TDFS4DS_INSTRUCT_MODEL_* env vars / .env file)
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'openai'
tdfs4ds.INSTRUCT_MODEL_MODEL    = 'gpt-4o'
tdfs4ds.INSTRUCT_MODEL_API_KEY  = 'sk-...'

process_info = document_process(process_id='<UUID>', show_explain_plan=True)
```

### LLM-powered dataset documentation

`document_dataset_incremental()` documents a **dataset** by walking its full lineage bottom-up:

1. Source tables — uses the business dictionary if available
2. Intermediate views — auto-documented via LLM if undocumented
3. Process views — actively calls `document_process_incremental` if undocumented
4. Feature/entity column descriptions are **propagated** from process docs (no extra LLM call)
5. A single JSON-constrained LLM call generates five structured sections for the dataset

```python
from tdfs4ds.genai import document_dataset_incremental

result = document_dataset_incremental(
    dataset_id   = '<UUID>',  # from dataset_catalog()
    force_update = False,
    upload       = True,
)

# result['DATASET_SECTIONS'] contains:
#   OVERVIEW, ENTITY, FEATURE_THEMES, BUSINESS_QUESTIONS, INTENDED_AUDIENCE
```

Each section is stored as an independent row in `FS_BUSINESS_DICTIONARY_SECTIONS` — no chunking needed for RAG retrieval.

### Full-store documentation in one call

`document_feature_store_incremental()` documents every registered process and dataset in a single optimised pass. Objects are processed in dependency order (leaves first, roots last) so upstream context is always available. A shared pair of visited-sets ensures each process view is documented at most once, even when referenced by multiple datasets.

```python
from tdfs4ds.genai import document_feature_store_incremental

summary = document_feature_store_incremental(
    language     = 'English',
    force_update = False,
    upload       = True,
)
# summary keys: processes_documented, datasets_documented,
#               processes_skipped, datasets_skipped
```

After documentation, process descriptions are automatically mirrored to the business dictionary (object overview + column-level feature descriptions) so the consumer agent can resolve them without any extra step.

### Business dictionary (no LLM required)

Three temporal tables store **business-oriented descriptions** for any database object, its columns, and its documentation sections. They form a 3-level hierarchy designed for chunking-free hierarchical RAG:

| Level | Table | Key | Purpose |
|-------|-------|-----|---------|
| 0 | `FS_BUSINESS_DICTIONARY_OBJECTS` | `(DATABASE_NAME, OBJECT_NAME)` | One summary per object (`OBJECT_TYPE`: `'T'`/`'V'`/`'D'`) |
| 1 | `FS_BUSINESS_DICTIONARY_SECTIONS` | `(DATABASE_NAME, OBJECT_NAME, SECTION_NAME)` | One row per documentation section per object |
| 2 | `FS_BUSINESS_DICTIONARY_COLUMNS` | `(DATABASE_NAME, TABLE_NAME, COLUMN_NAME)` | One description per column |

All tables are VALIDTIME temporal and provisioned automatically by `tdfs4ds.connect(create_if_missing=True)`.

```python
import pandas as pd
from tdfs4ds.genai import (
    upload_business_dictionary_objects,
    upload_business_dictionary_columns,
    upload_business_dictionary_sections,
)

# Level 0 — Object-level descriptions
upload_business_dictionary_objects(pd.DataFrame([
    {
        'DATABASE_NAME'       : 'MY_DB',
        'OBJECT_NAME'         : 'CUSTOMER',
        'OBJECT_TYPE'         : 'T',
        'BUSINESS_DESCRIPTION': 'Core customer table. Each row represents a unique enrolled customer.',
    },
]))

# Level 1 — Section-level descriptions (typically LLM-generated for datasets)
upload_business_dictionary_sections(pd.DataFrame([
    {
        'DATABASE_NAME'  : 'MY_DB',
        'OBJECT_NAME'    : 'DATASET_CUSTOMER',
        'SECTION_NAME'   : 'OVERVIEW',
        'SECTION_CONTENT': 'Customer-level analytical dataset combining spending and category features...',
    },
]))

# Level 2 — Column-level descriptions
upload_business_dictionary_columns(pd.DataFrame([
    {
        'DATABASE_NAME'       : 'MY_DB',
        'TABLE_NAME'          : 'CUSTOMER',
        'COLUMN_NAME'         : 'CUSTOMER_ID',
        'BUSINESS_DESCRIPTION': 'Unique customer identifier assigned at enrolment.',
    },
]))
```

All three functions validate required columns and perform a `CURRENT VALIDTIME MERGE` — re-running them updates existing descriptions and preserves the full change history.

## Consumer Agent (Chatbot)

The `agent` module provides a conversational interface for business consumers. Non-technical users can ask natural-language questions about features, datasets, definitions, data freshness, usage guidance, data lineage, and calculation logic — in English or French.

### Architecture

```text
User question
  → Intent classifier (Pydantic structured output)
  → DATA_DOMAIN detector (finds which domain owns the feature; remembered across turns)
  → Skill dispatcher (7 skills)
  → Plain-language answer
```

The agent uses LangGraph `StateGraph` with `MemorySaver` for multi-turn conversations. Conversation context is persisted across turns:

| State field | What is remembered |
|---|---|
| `resolved_data_domain` | Which DATA_DOMAIN owns the last named feature/dataset |
| `resolved_object_name` | Last explicitly named feature or dataset |
| `resolved_feature_triplet` | Full resolution: feature name, entity, process ID, view name |
| `resolved_entity_name` | Entity type in focus (e.g. `CustomerID`) |
| `resolved_feature_list` | Feature names currently in focus (one or many) |
| `resolved_column_sources` | Column→source-table map from the last EXPLAIN result (used by DEFINITION drill-down) |

Follow-up questions that omit an explicit feature name (e.g. "when was it last updated?", "how is it calculated?") automatically reuse the previously resolved feature, entity, and domain — no need to repeat yourself. When a feature name is shared across multiple entity types, the remembered entity silently disambiguates without asking for clarification.

After an EXPLAIN turn, the agent lists every **variable involved** in the formula with its source table. Asking "what is `<column>`?" immediately after an EXPLAIN resolves the column through the business dictionary — even if it is not a registered feature. Vague references (e.g. "what does the date mean?") are fuzzy-matched against remembered column and table names.

Feature descriptions are resolved via the process documentation chain:
`entity → features → process_id → VIEW_NAME → FS_BUSINESS_DICTIONARY_COLUMNS`

### Quick start

```python
import tdfs4ds
from tdfs4ds.agent import launch_chatbot_with_index

# Configure LLM and embedding model
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'vllm'
tdfs4ds.INSTRUCT_MODEL_URL      = 'https://api.example.com/v1'
tdfs4ds.INSTRUCT_MODEL_API_KEY  = 'my-key'
tdfs4ds.INSTRUCT_MODEL_MODEL    = 'mistral-7b-instruct'
tdfs4ds.EMBEDDING_MODEL_URL     = 'https://api.example.com/v1/e5'
tdfs4ds.EMBEDDING_MODEL_MODEL   = 'bge-m3'

# Build vector index (incremental) then launch the Gradio chatbot — one call
demo = launch_chatbot_with_index(port=7860)
```

Or call the agent programmatically:

```python
from tdfs4ds.agent import consumer_agent

answer = consumer_agent("What features are available?", thread_id="session-1")
answer = consumer_agent("How is nb_days_since_last_transactions calculated?", thread_id="session-1")
answer = consumer_agent("When was it last updated?", thread_id="session-1")  # feature + entity remembered
answer = consumer_agent("What about for CustomerID?", thread_id="session-1")  # entity remembered, new feature group
```

### Skills

| Intent | Trigger examples | What happens |
|--------|-----------------|--------------|
| `SEARCH` | "What features analyse customer spending?" | Semantic search across vector index + feature catalog |
| `DEFINITION` | "What does total_amount measure?" | Resolves feature → process view → column doc |
| `USAGE` | "How do I use avg_amount in Tableau?" | Audience, granularity, regulatory guidance |
| `FRESHNESS` | "Is total_amount up to date?" | Checks follow-up execution history |
| `SUMMARY` | "List all available features" | Full feature list with entity and description per feature |
| `LINEAGE` | "Where does total_amount come from?" | Walks upstream dependency graph via `build_teradata_dependency_graph` |
| `EXPLAIN` | "How is total_amount calculated?" | Fetches `SHOW VIEW` DDL → LLM explains logic in plain language + lists source columns so you can drill into any variable |

### Notebooks

- `09 - Consumer Agent Chatbot with tdfs4ds.ipynb` — architecture walkthrough, 7-intent test suite, 4-turn multi-turn demo
- `10 - Launch Consumer Agent Chatbot.ipynb` — minimal 4-cell one-command launch

### Gradio trace panel

The chatbot includes a collapsible **Agent Trace** accordion showing, for each turn:
- **Intent Classification** — detected intent, object name, domain
- **DATA_DOMAIN Detection** — available domains, resolved domain, source (detected / remembered)
- **Skill Executed** — skill name and inputs
- **Skill Result** — structured output summary; errors include per-step diagnostic messages

### Model listing

```python
from tdfs4ds.agent import list_instruct_models, list_embedding_models

list_instruct_models()                          # models on INSTRUCT_MODEL_URL
list_embedding_models(sub_paths=['e5', 'code']) # models on EMBEDDING_MODEL_URL sub-paths
```

## Query Optimizer Agent

The `query_optimizer` module analyses and rewrites Teradata SQL feature-engineering queries for better performance. It is process-aware — when a `process_id` is supplied it pulls the registered SQL and any stored EXPLAIN documentation directly from the tdfs4ds process catalog, avoiding redundant LLM calls.

### Pipeline (9 steps)

| Step | What happens |
|------|--------------|
| 0 | **Process context** — SQL + stored EXPLAIN analysis fetched from the catalog (skipped when no `process_id`) |
| 0.5 | **SQL simplification** — structural compaction pass merges unnecessary nesting layers into CTE + single outer SELECT, giving the LLM a cleaner baseline; accepts the simplified form only when its EXPLAIN score ≥ original |
| 1 | **Structured EXPLAIN analysis** — `document_sql_query_explain` scores the simplified query 1–5 with `[You]`-prefixed author-actionable warnings and recommendations |
| 2 | **Lineage graph** — Primary Index + partition columns collected for every underlying object |
| 3 | **DDL fetch** — `SHOW TABLE` / `SHOW VIEW` for every referenced object |
| 4 | **Candidate generation** — LLM proposes up to `N` rewrites focused on `[You]`-actionable items (`N = tdfs4ds.QUERY_OPTIMIZER_MAX_CANDIDATES`, default 5) |
| 5 | **Candidate EXPLAIN** — `document_sql_query_explain` run per candidate |
| 6 | **Plan comparison** — LLM selects the best plan by score delta and resolved warnings |
| 7 | **FilterManager check** — partitioned-but-unfiltered objects are flagged for incremental processing |
| 8 | **Final report** — Markdown with Score Summary, Simplification section, 3-stage query comparison (Input → After Simplification → After Optimization), Candidates, Selected Optimisation, FilterManager |

### Quick start

```python
import tdfs4ds

# Configure LLM — vllm / OpenAI / Azure / Bedrock
tdfs4ds.INSTRUCT_MODEL_PROVIDER = 'vllm'   # 'openai' does not require INSTRUCT_MODEL_URL
tdfs4ds.INSTRUCT_MODEL_MODEL    = '...'
tdfs4ds.INSTRUCT_MODEL_API_KEY  = '...'
# tdfs4ds.INSTRUCT_MODEL_URL = '...'  # required for vllm/azure; omit for openai/bedrock

tdfs4ds.QUERY_OPTIMIZER_MAX_CANDIDATES       = 5     # max valid rewrites to evaluate (default 5)
tdfs4ds.QUERY_OPTIMIZER_MAX_FAILURES         = 5     # max EXPLAIN/syntax failures before stopping (default 5)
tdfs4ds.QUERY_OPTIMIZER_MAX_CANDIDATE_TOKENS = None  # cap completion tokens per candidate call (None = model default; set e.g. 4096 for small-context models)

# Process-aware — SQL and stored EXPLAIN docs pulled from the catalog
result = tdfs4ds.query_optimizer(process_id='<UUID>', thread_id='session-1')

# Or pass raw SQL directly
result = tdfs4ds.query_optimizer(
    sql_query="SELECT ... FROM db.tbl",
    thread_id='session-1',
)
```

Multi-turn follow-ups sharing the same `thread_id` use the `MemorySaver` singleton — the agent retrieves the SQL from history and re-runs the pipeline with additional context:

```python
result = tdfs4ds.query_optimizer(
    "Would adding a Secondary Index on the join column improve the plan?",
    thread_id='session-1',
)
```

Inside a Jupyter notebook use the async entry point to avoid background-thread overhead:

```python
from tdfs4ds.agent import aquery_optimizer

result = await aquery_optimizer(process_id='<UUID>', thread_id='session-async')
```

### Result keys

| Key | Content |
|-----|---------|
| `answer` | Structured Markdown optimisation report (Summary, Score Summary, Lineage, …) |
| `best_sql` | Recommended SQL (original if already optimal) |
| `score_delta` | Before/after comparison of scores + execution metrics — see below |
| `original_analysis` | Scored EXPLAIN: `explanation`, `user_score`, `global_score`, `n_steps`, `n_spool_objects`, `warnings`, `recommendations` |
| `candidates` | List of candidate dicts: `sql`, `strategy`, `rationale`, `analysis` |
| `comparison` | Plan comparison: `best_index`, `reasoning`, `business_logic_preserved` |
| `filtermanager_applicable` | `True` if a FilterManager loop was recommended |
| `process_info` | Full process catalog record (when `process_id` is supplied) |
| `steps` | Every pipeline step with inputs and outputs |

### Score comparison (`score_delta`)

After each optimization run, `score_delta` captures exactly how much the rewrite improved the query — across both LLM-assessed scores and deterministic execution-plan metrics:

```python
sd = result['score_delta']
# {
#   'optimized':                True,
#   'best_strategy':            'partition_pruning',
#   'original_user_score':      2,
#   'original_global_score':    3,
#   'best_user_score':          4,
#   'best_global_score':        4,
#   'user_score_delta':         2.0,   # +2 improvement
#   'global_score_delta':       1.0,   # +1 improvement
#   'original_n_steps':         14,
#   'best_n_steps':             9,
#   'steps_delta':              -5,    # 5 fewer execution steps
#   'original_n_spools':        6,
#   'best_n_spools':            4,
#   'spools_delta':             -2,    # 2 fewer spool materialisations
#   'business_logic_preserved': True,
# }
```

Four signals are reported side by side:

- `user_score` (1–5) — quality of what the SQL author controls.
- `global_score` (1–5) — overall plan quality, including infrastructure factors (Primary Index placement, statistics, etc.). A rewrite only improves `user_score`, never `global_score`, if the bottleneck is infrastructure rather than the SQL itself.
- `n_steps` — number of numbered execution steps in the Teradata EXPLAIN plan. Parsed deterministically from the raw EXPLAIN text.
- `n_spool_objects` — number of distinct `Spool` objects the plan materialises. A rough proxy for intermediate-result memory/IO pressure.

Negative `steps_delta` / `spools_delta` mean the rewrite is lighter than the baseline. When the rewrite improves the score but adds a small number of steps or spools (typically single-row CTEs backing a precomputed threshold), the report appends an explanatory note describing the trade-off.

The optimizer `Score Summary` table in the generated report always shows all four metrics as **Before → After** (or **Input → Simplified → Optimized** when the simplification pass changed the SQL).

### Standalone simplification

The simplification pass can be called independently of the full optimizer:

```python
result = tdfs4ds.simplify_query(sql_query="SELECT ...")
# or from a registered process
result = tdfs4ds.simplify_query(process_id='<UUID>')

# result keys: simplified_sql, original_sql, simplified (bool),
#              original_score (1-5), simplified_score (1-5)
if result["simplified"]:
    print(result["simplified_sql"])
```

### Notebook display

`display_optimization_result` renders a score comparison widget at the top of the report, followed by the full Markdown analysis. The widget shows a before/after table with colour-coded Δ Change cells (green for improvement, red for regression) and a footer line for strategy, optimization status, and business-logic preservation.

```python
from tdfs4ds.agent import display_optimization_result

display_optimization_result(result)
```

### FilterManager integration

When `filtermanager_applicable` is `True`, the report includes a ready-to-use code snippet for iterating over partitions one at a time — reducing per-run spool usage and enabling full partition elimination:

```python
fm = tdfs4ds.FilterManager(
    schema_name = tdfs4ds.SCHEMA,
    view_name   = 'TRANSACTIONS',
    col_names   = ['transaction_date'],
)

for filter_id in range(fm.nb_filters):
    fm.update(filter_id)
    tdfs4ds.run(process_id)
```

### Notebook

`11 - Query Optimizer Agent with tdfs4ds.ipynb` (`notebook dev/genai/`) walks through the full pipeline end-to-end — process-aware entry, multi-turn follow-ups, score comparison widget, and FilterManager recommendation.

## Discover Registered Features

```python
from tdfs4ds.feature_store.feature_query_retrieval import (
    get_list_entity,
    get_list_features,
    get_available_features,
    get_feature_versions,
)
```

## Lineage

The `lineage` module builds end-to-end dependency graphs from a SQL query or a dataset view DDL.

### Dependency graph

```python
from tdfs4ds.lineage import build_teradata_dependency_graph, plot_lineage_sankey, show_plotly_robust

# Start from a dataset view DDL (obtained via SHOW VIEW)
sql = tdml.execute_sql("SHOW VIEW DATASET_CUSTOMER").fetchall()[0][0]

graph = build_teradata_dependency_graph(sql_query=sql)
# Returns: {"nodes": {...}, "edges": [...], "roots": [...]}
```

By default (`expand_datasets_via_process_catalog=True`) dataset nodes are resolved through
the process catalog: `FEATURE_VERSION` UUIDs embedded in the dataset DDL are matched to
`PROCESS_ID` entries in `FS_V_PROCESS_CATALOG`, and edges are drawn directly to the
registered feature-engineering views.

```
DATASET_CUSTOMER  →  FEAT_ENG_CUST  →  DB_SOURCE.TRANSACTIONS
```

Set `expand_datasets_via_process_catalog=False` to connect the dataset directly to the
raw feature-store storage tables (previous behaviour).

```python
fig = plot_lineage_sankey(graph, title="Customer Dataset Lineage")
show_plotly_robust(fig)
```

### Migration manifest

`graph_to_migration_manifest` converts any lineage graph into a flat, JSON-serialisable
dict — useful for planning a feature store migration.

```python
from tdfs4ds.lineage import graph_to_migration_manifest
import json

# All databases
manifest = graph_to_migration_manifest(graph)

# Scoped to the feature store schema only (cross-boundary edges excluded)
manifest_fs = graph_to_migration_manifest(graph, filter_database=tdfs4ds.SCHEMA)
print(json.dumps(manifest_fs, indent=2))
# {
#   "views":  [{"database": "demo_user", "name": "DATASET_CUSTOMER", "type": "dataset"},
#              {"database": "demo_user", "name": "FEAT_ENG_CUST",    "type": "view"}],
#   "tables": [],
#   "edges":  [{"from": "demo_user.DATASET_CUSTOMER", "to": "demo_user.FEAT_ENG_CUST"}]
# }

with open("migration_manifest.json", "w") as f:
    json.dump(manifest_fs, f, indent=2)
```

## Requirements

- Python >= 3.6
- teradataml >= 17.20
- Active Teradata Vantage connection
- **VALIDTIME temporal tables must be enabled** on the Teradata Vantage system — all feature catalogs, process catalogs, and feature stores rely on `VALIDTIME` support
