Metadata-Version: 2.1
Name: teradatagenai
Version: 20.0.0.7
Summary: Teradata package for Generative-AI powered text analytics on Teradata Vantage
Home-page: https://teradata.com
Author: Teradata Corporation
License: Teradata License Agreement
Keywords: Teradata
Platform: MacOS X, Windows, Linux
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Topic :: Database :: Front-Ends
Classifier: License :: Other/Proprietary License
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: teradataml (>=20.00.00.10)
Provides-Extra: nv-ingest-client
Requires-Dist: nv-ingest-client ; extra == 'nv-ingest-client'

## Teradata Package for Generative-AI

`teradatagenai` is a Generative AI package developed by Teradata.
It offers a comprehensive suite of APIs designed for a wide range of text analytics applications 
and seamless access to the Enterprise Vector Store.
With `teradatagenai`, users can seamlessly process and analyze text data from various sources,
including emails, academic papers, social media posts, and product reviews.
This enables users to gain insights with precision and depth that rival or surpass human analysis.

For community support, please visit the [Teradata Community](https://support.teradata.com/community?id=community_forum&sys_id=14fe131e1bf7f304682ca8233a4bcb1d).

For Teradata customer support, please visit [Teradata Support](https://support.teradata.com/csm).

Copyright 2025, Teradata. All Rights Reserved.

### Table of Contents
* [Documentation](#documentation)
* [Release Notes](#release-notes)
* [Installation and Requirements](#installation-and-requirements)
* [Using the Teradata Package for Generative AI](#using-the-teradata-python-package-for-generative-ai)
* [License](#license)

## Documentation
General product information, including installation instructions, is available in the [Teradata Documentation website](https://docs.teradata.com/search/documents?query=Python+package+for+Generative-AI&sort=last_update&virtual-field=title_only&content-lang=en-US).
## Release Notes
### Version 20.00.00.07
* ##### New Features/Functionality
  * New features introduced in this release require Database version 20.00.30.XX
  * ##### Collection (Teradata Enterprise Vector Store V2)
    *  Added new `Ingestor` methods for file-based collection monitoring:
      * `get_file_store()` - Retrieve information about storage objects (tables) containing ingested file data for file-based collections. Tracks the underlying storage structures created during file ingestion.
      * `get_file_metadata()` - Retrieve metadata about individual files and their processing status for file-based collections. Tracks each file's ingestion progress including upload, extraction, and embedding generation stages. Applicable only for file-based collection types (FILE_CONTENT_BASED, FILE_EMBEDDING_BASED).
    * Added following new parameters for collection management:
      * `sort_by` - Specifies the field(s) and direction(s) to sort results by. Supports single field (str), single field with direction (tuple), or multi-field sorting (list of tuples).
      * `filter` - Specifies an advanced filtering expression. Supports operators: `=>` (greater than), `=<` (less than), `=!` (not equal), `=(val1,val2)` (in set), `=~` (contains/like), and logical operators AND, OR, NOT with parentheses grouping.
      * `search` - Specifies a full-text search query across searchable fields.
    * Added new parameters for Collection create and update operations:
      * `batch` - Specifies whether embedding should be generated in batches.
      * `infer_table_structure` - Specifies whether to detect table layout and reconstruct the table structure as HTML. Adds `text_as_html` metadata with serialized HTML table representation. Applicable for `UnstructuredIngestor`.
      * `filter_type` - Specifies the format type of the filter expression for similarity search. Supports `'python'` (Python-style filtering) or `'sql'` (SQL-style filtering expressions).
      * `data_refresh` - Specifies whether to refresh/regenerate vectors for the collection using existing backing/source tables. When set to `True`, vectors are refreshed from the existing source tables without re-ingesting the original documents. Only applicable for update operation
    * Added new parameter for `SearchParams`:
      * `mmr_threshold` - Specifies the threshold value used by the MMR (Maximal Marginal Relevance) strategy while selecting diverse and relevant results.
  * ##### Ingestor
    * Added asynchronous pipeline execution support to the `run()` method:
      * `sync` - Specifies whether to execute the pipeline synchronously or asynchronously. Default: `True`.
      * `poll_timeout` - Specifies the polling timeout in seconds for asynchronous execution. Only applicable if `sync=False`. Default: `None` (no timeout, wait indefinitely).
  * ##### S3Config
    * Added new parameter for S3-compatible storage configuration:
      * `endpoint_url` - Specifies a custom endpoint URL for S3-compatible storage services. Only applicable for Artemis.
  * ##### NVIDIA NV-Ingest Integration
    * Added support for multimodal embeddings and retrieval via image input:
      * `is_image` - Specifies whether the queries represent images rather than text. Default: `False`. Applicable to `nvingest_retrieval` and `TeradataVDB.retrieval`.

### Version 20.00.00.06
* ##### New Features/Functionality
  * ##### Collection (Teradata Enterprise Vector Store V2)
    * Added comprehensive support for Teradata Enterprise Vector Store V2 with new Collection class and methods:
      * `Collection`: New class providing modern interface for vector store operations with support for content-based, embedding-based, and file-based collections.
      * `CollectionManager`: Management interface with methods for health checks, listing collections, session management, and user disconnection.
      * CollectionManager methods:
        * `disconnect()` - Terminate user sessions and clean up connections.
        * `health()` - Check system health and service availability.
        * `list()` - List all available collections with metadata.
        * `list_sessions()` - List active user sessions and connection details.
      * Collection creation methods:
        * `create()` - Creates a new empty collection with specified configuration and schema.
        * `from_datasets()` (class method) - Creates content-based collections from tables or DataFrames.
        * `from_texts()` (class method) - Creates content-based collections from raw text or text lists. *Requires create_context.*
        * `from_embeddings()` (class method) - Creates embedding-based collections from pre-embedded data.
        * `from_documents()` (class method) - Creates file-based collections from PDF documents and directories.
      * Collection data and lifecycle methods:
        * `add_datasets()`, `add_texts()` (*requires create_context*), `add_embeddings()`, `add_documents()` - Add data to existing collections.
        * `delete_datasets()`, `delete_embeddings()`, `delete_documents()` - Remove data from collections.
        * `create()`, `update()`, `destroy()` - Collection lifecycle management.
        * `status()`, `get_details()` - Collection information and monitoring.
      * Advanced search and retrieval:
        * `similarity_search()` - Vector similarity search with flexible filtering and ranking options.
        * `similarity_search_by_vector()` - Direct vector-based similarity search.
        * `ask()` - RAG-based question answering with chat model integration.
        * `prepare_response()` - Response preparation and formatting for chat applications.
      * Collection utilities:
        * `get_indexes_embeddings()` - Retrieve embedding and indexing information. *Requires create_context.*
        * `get_model_info()` - Get model configuration details. *Requires create_context.*
        * `list_user_permissions()` - View user access permissions.
    * Added `Ingestor` class for declarative pipeline orchestration:
      * Fluent API for building ingestion pipelines with method chaining.
      * Pipeline stages: `Ingestor()`→ `extract()` → `files()`/`load()` → `embed()` → `create()` → `run()`.
      * Support for both file-based and table-based collection creation.
      * `Ingestor()` - Creates an empty collection with required configuration.
      * `extract()` - Configure extraction options for document processing (text, images, tables, metadata).
      * `files()` - Configure file-based ingestion from local storage, S3, Azure Blob, or Google Cloud.
      * `load()` - Configure table-based ingestion from existing datasets.
      * `embed()` - Configure embedding model and generation options.
      * `upsert()` - Configure indexing algorithms and finalization parameters for creating/updating collection.
      * `run()` - Execute the complete configured pipeline with progress monitoring.
    * Added comprehensive data classes for collection configuration:
      * Index configuration: `ContentBasedIndex`, `EmbeddingBasedIndex` for defining collection schemas.
      * Column specification: `ColumnInfo` for detailed column metadata, types, and source mapping.
      * Search configuration: `SearchParams` for similarity search parameters and filtering.
      * Indexing algorithms: `HNSW`, `FLAT`, `IVF_FLAT` for vector indexing configuration.
      * File source configuration: `LocalConfig`, `S3Config`, `AzureBlobConfig`, `GCPConfig` for multi-cloud file access.
      * Document processing: `BasicIngestor`, `NVIngestor`, `UnstructuredIngestor` for different extraction capabilities.
      * Schema definition: `ExtractionSchema` for defining table structures and column mappings for file-based collections.
      * TeradataAI
        * Enhanced model access capabilities supporting multiple deployment scenarios:
          * **Teradata Provided Models** - Access to pre-deployed models in Teradata infrastructure (existing implementation).
          * **Customer Credential Models** - Direct access to models using customer's cloud credentials:
            * AWS Bedrock - Access models deployed in customer's AWS account using AWS credentials.
            * Azure OpenAI - Access models deployed in customer's Azure subscription using Azure credentials.
            * NVIDIA NIM - Access NVIDIA Inference Microservices using customer NIM credentials.
            * Google Cloud (Vertex AI) - Access models deployed in customer's GCP project using GCP credentials.
          * **LiteLLM Proxy Integration** - Access models through LiteLLM proxy server for unified model management.
          * **Custom Provider Support with LiteLLM** - Access models from custom providers using LiteLLM framework.
        * **Guardrail Model Support** - Specify and configure guardrail models for content safety, topic control, and jailbreak detection to ensure safe and controlled AI outputs.
  * VectorStore/Collection can be used without requiring a database connection, authentication via `set_auth_token` is sufficient.
  * The authentication token object returned by `set_auth_token` can be supplied to the `VectorStore` `Collection`, `VSManager`, `CollectionManager` classes, enabling them to use the token for all subsequent operations.

  * ##### TextAnalyticsAI
    * Added support for `output_charset` parameter to set the charset of the result embeddings to either 'LATIN' or 'UNICDOE'.


### Version 20.00.00.05
* ##### Bug Fixes
  * ##### TextAnalytics
    * `ELE-9588`: Fixed backward compatibility issue with 'show_num_tokens' and 'refresh_credential_time' parameters
      in Database version < '20.00.28.XX'.
### Version 20.00.00.04
* ##### New Features/Functionality
  * New features introduced in this release require Database version 20.00.28.XX
  * ##### Vector Store
    * Added support for TeradataAI ONNX object as embeddings in AI Factory.
    * Added a new method `delete_by_ids` that delete specific chunks from a file in the vector store. *Requires create_context.*
    * Exposes the following new parameters for create and update:
        * `metadata_columns` - Specifies the list of input column names to be used for metadata.
        * `metadata_descriptions` - Specifies the descriptions of the metadata columns.
        * `content_safety_base_url` - Specifies the base URL for Guardrails model ensuring safe outputs from LLM.
        * `topic_control_base_url` - Specifies the base URL for Guardrails model ensuring topic control.
        * `jailbreak_detection_base_url` - Specifies the base URL for Guardrails model ensuring jailbreak detection.
        * `vlm_base_url` - Specifies the base URL for Vision Language Model when extract_caption from images is enabled.
        * `ranking_base_url` - Specifies the base URL for the service to be used for the reranker model.
        * `content_safety_model` - Specifies the guardrails model ensuring safe outputs from LLM.
        * `topic_control_model` - Specifies the guardrails model ensuring topic control.
        * `vlm_model` - Specifies the Vision Language Model to be used when extract_caption from images is enabled.
        * `guardrails` - Specifies what kind of Guardrails to apply.
        * `ranking_model` - Specifies the model to be used for reranking the search results.
        * `embedding_datatype` - Specifies the data type for storing embeddings.
        * `use_simd` - Specifies whether to use SIMD for faster processing.
        * `maximal_marginal_relevance` - Specifies whether to use Maximal Marginal Relevance (MMR) for retrieving documents.
        * `num_NodesPerGraph` - Specifies the number of nodes per graph in the HNSW graph during construction.
        * `lambda_multiplier` - Specifies lambda multiplier to control the trade-off between relevance and diversity when selecting documents.
        * `chunk_overlap` - Specifies the number of overlapping characters between two consecutive chunks during file splitting.
        * `extract_metadata_json` - Specifies whether to extract metadata in JSON format when using NVIDIA NV-Ingest.
        * `extract_caption` - Specifies whether to extract captions for images and tables when using NVIDIA NV-Ingest.
        * `overwrite_object` - Specifies whether to overwrite the existing object with the same name in the database.
        * `embedding_data_columns` - Specifies the name of the column over which the pre embedded data is generated.
        * `metadata_operation` - Specifies the operation to be performed on metadata columns during update (ADD, DELETE, MODIFY).
        * `new_vs_name` - Specifies the new name to be used for the vector store.
    * Added new parameters to pass model url parameters and ingest parameters to from_* and add_* methods during vector store creation:
      * `model_urls` - Specifies the urls and model information to be used during Vector Store creation.
      * `ingest_params` - Specifies the parameters to be used for document ingestion for NIM. Applicable only for file-based vector stores.

    * Added new parameters for similarity_search:
      * `column`: Specifies the column name which contains the question in text format.
      * `data`: Specifies the table name/DataFrame which contains the question in text format.

    * Added new parameters for ask:
      * `batch_vector_column`: Specifies the column that contains the questions in embedded form.
      * `question_vector`: Specifies the question in vector/embedded form.
      * `data`: Specifies table name or corresponding teradataml DataFrame where the question is stored (only one question/row should be present).
      * `column`: Specifies the column name which contains the question in text format.
      * `vector_column`: Specifies the column name which contains the question in embedded format.

    * Added new common parameters for similarity_search, similarity_search_by_vector and ask:
      * `top_k`: Specifies the number of top similarity matches to be generated.
      * `search_threshold`: Specifies the threshold value to consider matching tables/views while searching.
      * `search_numcluster`: Specifies the number of clusters or fraction of train_numcluster to be considered while searching. 
      * `ef_search`: Specifies the number of neighbors to consider during search in HNSW graph.  
      * `filter`: Specifies the filter to be used for filtering the results.
      * `filter_style`: Specifies whether to apply filtering before or after the similarity_search.
      * `maximal_marginal_relevance`: Specifies whether to use Maximal Marginal Relevance (MMR) for retrieving documents.
      * `lambda_multiplier`: Specifies Lambda multiplier to control the trade-off between relevance and diversity when selecting documents.

    * Added the following classes:
      * `ModelUrlParams` class to configure model and URL-related parameters for vector store creation using from_* and add_* methods on AI-Factory.
      * `IngestParams` class to configure ingestor-related parameters for file-based vector store creation using from_* and add_* methods on AI-Factory.
      * Note: Users can still pass these parameters directly while creating the vector store.

    * Added the following methods to set the search parameters based on the "search_algorithm":
      * `set_kmeans_search_params()` method to configure KMEANS search parameters for the vector store.
      * `set_hnsw_search_params()` method to configure HNSW search parameters for the vector store. 
      * `set_vectordistance_search_params()` method to configure VECTORDISTANCE search algorithm parameters for the vector store.

  * ##### TeradataAI
    * Authentication precedence order for TeradataAI has been established: the authorization object has the highest priority, followed by explicitly passed parameters, then the configuration file, and finally environment variables.
    * Added new parameter `model_operation` for AWS Bedrock to specify the operation type ('invoke' or 'converse').

  * ##### TextAnalyticsAI
    * Added support for `show_num_tokens` parameter to display token count information during text analytics operations.
    * Added support for `refresh_credential_time` parameter for AWS and Azure to control credential refresh time.
    * Enhanced `accumulate` parameter to accept list of strings for AWS, Azure, GCP, and NIM API types.

  * ##### NVIDIA NV-Ingest Integration
    * Added comprehensive integration with NVIDIA's NV-Ingest pipeline for advanced document processing and vector store creation.
    * `create_nvingest_schema`: Creates a default schema structure compatible with NVIDIA NV-Ingest processing pipeline.
    * `write_to_nvingest_vector_store`: Process and insert NV-Ingest records into a Teradata Vector Store. Handles complete pipeline from data processing to vector store creation with content type filtering options.
    * `nvingest_retrieval`: Perform vector similarity search using NVIDIA embedding models against a Teradata Vector Store. Supports single or multiple query processing with automatic embedding generation.
    * `TeradataVDB` class: Implementation of NV-Ingest abstract VDB class that helps integrate Teradata Vector Store into the NVIDIA NV-Ingest processing pipeline. It supports the following functions:
    """
      * `create_index`: Create schema for Teradata Vector Store compatible with NVIDIA NV-Ingest.
      * `write_to_index`: Write NV-Ingest extracted records to the Teradata Vector Store.
      * `retrieval`: Perform similarity search and return results from the vector store.
      * `run`: Combine vector store creation and record writing in one operation.

* ##### Bug Fixes
  * ##### TextAnalytics
    * `ELE-9226`: Fixed sample script installation for embeddings() and sentence_similarity().
    * `ELE-9406`: Fixed invalid SQL generation with authorization parameter for mask_pii()

### Version 20.00.00.03
* ##### New Features/Functionality
  * Features introduced in this release require Database version 20.00.27.XX
  * ##### Vector Store
    * Added new methods for managing and creating vector stores:
      * `from_documents(name, documents, embedding=None, **kwargs)`: Creates a file-based vector store directly from PDF documents, directories, or wildcards. Supports embedding models and chat completion models. If the store already exists, raises an error.
      * `from_texts(name, texts, embedding=None, **kwargs)`: Creates a content-based vector store from raw text or a list of texts. Supports embedding models and chat completion models. If the store already exists, raises an error.
      * `from_datasets(name, data, embedding=None, **kwargs)`: Creates a content-based vector store from tables or DataFrames, specifying data columns and optional key columns, with embedding model support. If the store already exists, raises an error.
      * `from_embeddings(name, data, **kwargs)`: Creates an embedding-based vector store from pre-embedded tables or DataFrames, specifying the embedding columns. If the store already exists, raises an error.
      * `add_documents(documents, **kwargs)`: Adds documents (PDFs, directories, or wildcards) to an existing file-based vector store. Automatically creates the store if it does not exist.
      * `add_datasets(data, **kwargs)`: Adds tables or DataFrames to a content-based vector store. Creates the store if needed.
      * `add_embeddings(data, **kwargs)`: Adds embedding data to an embedding-based vector store.
      * `add_texts(texts, **kwargs)`: Adds raw text or list of texts to a content-based vector store.
      * `delete_documents(documents, **kwargs)`: Removes specified documents from a file-based vector store.
      * `delete_datasets(data, **kwargs)`: Removes specified datasets from a content-based vector store.
      * `delete_embeddings(data, **kwargs)`: Removes embedding data from an embedding-based vector store.
    * The `name` argument for `VectorStore` initialization is now optional. If not provided, the store can be created later using the `create` method.
    * The `create` method now accepts the `name` argument, allowing users to specify or update the vector store name at creation time.
  * ##### TextAnalyticsAI & TeradataAI
    * Added support to save and retrieve external ONNX models and tokenizers in Teradata Vantage for API type 'onnx'.
      * Users can specify separate tables and schema for models and tokenizers using the following new parameters:
        * `model_table_name`, `model_schema_name`: Specify the table and schema for storing the ONNX model.
        * `tokenizer_table_name`, `tokenizer_schema_name`: Specify the table and schema for storing the tokenizer.
        * `tokenizer_id`: Optionally specify a unique identifier to save or retrieve the tokenizer (defaults to `model_id` if not provided).
        * `additional_columns_model`, `additional_columns_types_model`: Add custom metadata columns and types for the model table.
        * `additional_columns_tokenizer`, `additional_columns_types_tokenizer`: Add custom metadata columns and types for the tokenizer table.
  * ##### Hugging Face
    * **Improved  Model Installation**: Added support to install models in a new way that removes torch and transformer dependencies, improving performance.
    * **Model Detection**: Added support to detect models installed in both standard and legacy formats.
    * **Script Updates**: Enhanced user scripts with dual format model path support.
    * **Replace Parameter**: Added `replace` parameter to all methods to overwrite existing files.
    * **Cleanup Method**: New `cleanup_env()` method for removing sample files from environments.
    * **Custom Embeddings**: Added `embeddings_dim` parameter to support custom dimensions for embedding models.
    * **Output Table Info**: Added functionality to print output table information for better visibility of results.
    * **Accumulate Parameter**: Added `accumulate` parameter to include input columns in the output result.

* ##### Bug Fixes
  * ##### TextAnalytics
    * `ELE-8278`: Fixed invalid SQL generation with authorization parameter for embeddings()
    * `ELE-8384`: Incorrect error raised when pdf file does not exist during vector store creation.
    * Fixed an issue where the `is_debug` parameter was not being added to generated SQL  in TextAnalyticsAI functions.

### Version 20.00.00.02
* ##### New Features/Functionality
  * Features introduced in this release require Database version 20.00.27.XX
  * ##### Vector Store
      * Exposes a new attribute `store_type` in VectorStore class allowing user 
        to check the type of Vector Store. Supported vector store types are `metadata-based`, 
        `content-based`, `file-based` and `embedding-based`.
      * Exposes following new functions in VectorStore class.
        * `get_indexes_embeddings` - Returns DataFrame containing embedding and indexing 
                                     information of the Vector Store.
        * `get_model_info` - Returns model specific DataFrame or dict containing DataFrames
                             depending on the `search_algorithm`. 
            * If `search_algorithm` is `kmeans`, dict is returned containing the two tables mentioned below:
              * `kmeans_model` - Contains the `kmeans_model` information.
              * `centroids_model` - Contains the `centroids` information.
            * If `search_algorithm` is `hnsw`, DataFrame is returned containing:
              * `hnsw_model` - Contains the `hnsw_model` information.
          * `similarity_search_by_vector` - Performs similarity_search for 'embeddings-based' Vector Store
                                            when question is embedded and passed in `question` argument.
                                            or embedded question is present in a table and that is passed in `data` 
                                            and `column` arguments.
      * Exposes the following new parameters for create and update:
        * `extract_infographics`
        * `extract_method`
        * `hf_access_token`

  * ##### TextAnalyticsAI Functions
    * Support added in TeradataAI for a new API type 'onnx' to handle external ONNX models within Teradata Vantage.
    * Support added in TextAnalyticsAI to generate Text Embeddings with ONNX models.
    * Support added in TeradataAI and TextAnalyticsAI to work with NVidia NIM, 
      to perform a wide array of text analytic tasks including:
        * KeyPhrase Extraction
        * PII (Personally Identifiable Information) Entity Recognition
        * Masking PII Information
        * Language Detection
        * Language Translation
        * Text Summarization
        * Entity Recognition
        * Sentiment Analysis 
        * Text Classification 
        * Text Embeddings 
        * Asking LLM

* ##### Bug Fixes
  * Response code is not shown for errors raised for async operations
    like `create`, `update`, `destroy` from `status`.

### Version 20.00.00.01
* ##### New Features/Functionality
  * Features introduced in this release require Database version 20.00.26.XX
  * ##### Vector Store
    * Teradata Enterprise Vector Store is designed to store, index, and search high-dimensional vector embeddings efficiently.
    * `teradatagenai` provides the below python APIs to easily access and manage vector store and build their own NL applications using Vantage as the foundational compute/storage engine.
      * The following operations can be done:        
        * `VSManager`: Contains methods to manage vector Store.
          * `health`: Perform health check for the vector store service.
          * `list`: List all the vector stores.
          * `list_sessions`: List all the active sessions of the vector store service.
          * `disconnect`: Disconnect from the database session.
          * `list_patterns`: List all available patterns for creating metadata-based vector store.
        * `VectorStore`: Contains methods to do operations on Vector Store.
          * `create`: Creates a Vector Store.
          * `update`: Updates a Vector Store.
          * `destroy`: Destorys a Vector Store.
          * `similarity_search`: Performs similarity search in interactive/batch mode in the Vector Store for the input question
          * `prepare_response`: Prepare a natural language response to the user using the input question and similarity_results provided by VectorStore.similarity_search() method using interactive/batch mode.
          * `ask`: Performs similarity search in the vector store for the input question followed by preparing a natural language response to the user using interactive/batch mode.
          * `get_details`: Get details of the vector store.
          * `get_objects`: Get the list of objects in the metadata-based vector store.
          * `get_batch_results`: Retrieves the results when `similarity_search`, `prepare_response` and `ask` is triggered in batch mode.
          * `status`: Checks the status of the below operations: `create`, `destroy` and `update`.
        * `VSPattern`: Create/Manage patterns which provides a way to select tables/views and columns using simple regular expressions which can be used while creating metadata-based Vector Store.
          * Following operations are supported:
            * `create`: Creates a pattern by specifying the `pattern_string`.
            * `get`: Gets the list of objects that matches the `pattern_string`.
            * `delete`: Deletes the pattern.
  * ##### InDb TextAnalytics Functions
    * This version supports the integration of TextAnalyticsAI InDB functions, enabling seamless access to  LLM services like AWS Bedrock, Azure OpenAI, Google Gemini for a wide array of text analytics tasks, including:
      * KeyPhrase Extraction
      * PII (Personally Identifiable Information) Entity Recognition
      * Masking PII Information
      * Language Detection
      * Language Translation
      * Text Summarization
      * Entity Recognition
      * Sentiment Analysis 
      * Text Classification 
      * Text Embeddings 
      * Asking LLM

### Version 20.00.00.00
* `teradatagenai 20.00.00.00` marks the first release of the package.
* Features introduced in this release require Database version 20.00.25.XX
* This version supports the integration of Hugging Face models into Teradata Vantage through the BYO LLM offering, enabling seamless utilization of these models for a wide array of text analytics tasks.
    * KeyPhrase Extraction
    * PII (Personally Identifiable Information) Entity Recognition
    * Masking PII Information
    * Language Detection
    * Language Translation
    * Text Summarization
    * Entity Recognition
    * Sentiment Analysis 
    * Text Classification 
    * Text Embeddings 
    * Sentence Similarity
* The package also features a versatile `task` function capable of performing any task supported by the underlying language model (LLM). This function is highly adaptable and can be customized to meet specific requirements. Refer to the [example](#get-embeddings-and-similarity-score-for-employee-data-and-articles) for more details on its usage.

## Installation and Requirements
### Package Requirements:
* Python 3.9 or later (for all standard features)
* **NVIDIA NV-Ingest integration**:
  * Python 3.11 or later
  * To install the package with NV-Ingest support, use the following command:
    ```bash
    pip install teradatagenai[nv-ingest-client]
    ```
*Note: 32-bit Python is not supported.*

### Minimum System Requirements:
* Windows 7 (64Bit) or later
* macOS 10.9 (64Bit) or later
* Red Hat 7 or later versions
* Ubuntu 16.04 or later versions
* CentOS 7 or later versions
* SLES 12 or later versions
* VantageCloud Lake on AWS with Open Analytics Framework in order to use Teradata’s BYO LLM offering.
### Minimum Database Requirements
* Teradata Vantage with database release 20.00.26.XX or later 
* Vector Store (Data insights) service is enabled.

### Installation

Use pip to install the Teradata Package for Generative AI

Platform       | Command
-------------- | ---
macOS/Linux    | `pip install teradatagenai`
Windows        | `python -m pip install teradatagenai`

When upgrading to a new version of the `teradatagenai`, you may need to use pip install's `--no-cache-dir` option to force the download of the new version.

Platform       | Command
-------------- | ---
macOS/Linux    | `pip install --no-cache-dir -U teradatagenai`
Windows        | `python -m pip install --no-cache-dir -U teradatagenai`

## Using the Teradata Package for Generative AI:

Your Python script must import the `teradatagenai` package in order to use the Teradata Package for Generative AI. Let us walkthrough some examples to gain a better understanding. We need a common setup to load the data and import the required packages.

### Common Setup

```python
# Import the modules and create a teradataml DataFrame.
import os
import teradatagenai
from teradatagenai import TeradataAI, TextAnalyticsAI, load_data
from teradataml import DataFrame

load_data('employee', 'employee_data')
data = DataFrame('employee_data')
df_reviews = data.select(["employee_id", "employee_name", "reviews"])
df_articles = data.select(["employee_id", "employee_name", "articles"])

# Define the base directory and script path.
base_dir = os.path.dirname(teradatagenai.__file__)
sentence_similarity_script = os.path.join(base_dir, 'example-data', 'sentence_similarity.py')
```

### Analyze Sentiment of Food Reviews

In this example, we will be using the `analyze_sentiment` API to analyze the sentiment of food reviews in the `reviews` column of a `teradataml` DataFrame.

#### Using the Hugging Face model `distilbert-base-uncased-emotion`. 

```python
# Define the model name and arguments for the Hugging Face model.
model_name = 'bhadresh-savani/distilbert-base-uncased-emotion'
model_args = {
    'transformer_class': 'AutoModelForSequenceClassification',
    'task': 'text-classification'
}

# Create a TeradataAI object with the specified model.
llm = TeradataAI(api_type="hugging_face", model_name=model_name, model_args=model_args)
```

```python
# Create a TextAnalyticsAI object.
obj = TextAnalyticsAI(llm=llm)
obj.analyze_sentiment(column='reviews', data=df_reviews, delimiter="#")
```
#### Using AWS Bedrock model `anthropic.claude-v2`.

```python
# Define AWS Bedrock environment variables.
os.environ["AWS_DEFAULT_REGION"] = "<Enter AWS Region>"
os.environ["AWS_ACCESS_KEY_ID"] = "<Enter AWS Access Key ID>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<Enter AWS Secret Key>"
os.environ["AWS_SESSION_TOKEN"] = "<Enter AWS Session key>"

# Create a TeradataAI object with the specified model.
llm = TeradataAI(api_type="aws", model_name="anthropic.claude-v2")
```

```python
# Create a TextAnalyticsAI object.
obj = TextAnalyticsAI(llm=llm)
obj.analyze_sentiment(column='reviews', data=df_reviews, accumulate="reviews")
```
#### Using Azure OpenAI model `gpt-3.5-turbo`.

```python
# Define Azure OpenAI environment variables.
os.environ["AZURE_OPENAI_API_KEY"] = "<azure OpenAI API key>"
os.environ["AZURE_OPENAI_ENDPOINT"] = "https://****.openai.azure.com/"
os.environ["AZURE_OPENAI_API_VERSION"] = "2000-11-35"
os.environ["AZURE_OPENAI_DEPLOYMENT_ID"] = "<azure OpenAI engine name>"

# Create a TeradataAI object with the specified model.
llm = TeradataAI(api_type="azure", model_name="gpt-3.5-turbo")
```

```python
# Create a TextAnalyticsAI object.
obj = TextAnalyticsAI(llm=llm)
obj.analyze_sentiment(column='reviews', data=df_reviews, accumulate="reviews")
```
#### Using Google model `gemini-1.5-pro-001`.

```python
# Define Google Cloud environment variables
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "<gcp access token>"
os.environ["GOOGLE_CLOUD_PROJECT"] = "<gcp project name>"
os.environ["GOOGLE_CLOUD_REGION"] = "us-central1"

# Create a TeradataAI object with the specified model.
llm = TeradataAI(api_type="gcp", model_name="gemini-1.5-pro-001")
```

```python
# Create a TextAnalyticsAI object.
obj = TextAnalyticsAI(llm=llm)
obj.analyze_sentiment(column='reviews', data=df_reviews, accumulate="reviews")
```
#### Using NVIDIA NIM model `meta/llama-3.1-8b-instruct`.

```python
# Define Azure OpenAI environment variables.
os.environ["NIM_API_KEY"] = "<NVIDIA NIM API key>"

# Create a TeradataAI object with the specified model.
llm = TeradataAI(api_type="nim", api_base = "<nim base url>",model_name="meta/llama-3.1-8b-instruct")
```

```python
# Create a TextAnalyticsAI object.
obj = TextAnalyticsAI(llm=llm)
obj.analyze_sentiment(column='reviews', data=df_reviews, accumulate="reviews")
```

### Get Embeddings and Similarity Score for Employee Data and Articles

In this example, we will use the `task` API to perform two tasks: generating embeddings and calculating similarity scores using the Hugging Face model `all-MiniLM-L6-v2`.

#### Generate Embeddings for Employee Reviews

We will generate embeddings for employee reviews from the `articles` column of a `teradataml` DataFrame using the Hugging Face model `all-MiniLM-L6-v2`.

```python
# Define the script path for embeddings.
embeddings_script = os.path.join(base_dir, 'example-data', 'embeddings.py')

# Construct the returns argument based on the user script.
returns = OrderedDict([('text', VARCHAR(512))])
_ = [returns.update({"v{}".format(i+1): VARCHAR(1000)}) for i in range(384)]

# Use the task API to generate embeddings.
llm.task(
    column="articles",
    data=df_articles,
    script=embeddings_script,
    returns=returns,
    libs='sentence_transformers',
    delimiter='#'
)
```

#### Calculate Similarity Score

We will calculate the similarity score between employee data and articles using the Hugging Face model `all-MiniLM-L6-v2`.

```python
# Define the model name and arguments for the Hugging Face model.
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model_args = {
    'transformer_class': 'AutoModelForSequenceClassification',
    'task': 'text-similarity'
}

# Create a TeradataAI object with the specified model.
llm = TeradataAI(api_type="hugging_face", model_name=model_name, model_args=model_args)

# Use the task API to get the similarity score.
llm.task(
    column=["employee_data", "articles"],
    data=data,
    script=sentence_similarity_script,
    libs='sentence_transformers',
    returns={
        "column1": "VARCHAR(10000)",
        "column2": "VARCHAR(10000)",
        "similarity_score": "VARCHAR(10000)"
    },
    delimiter="#"
)
```

## License
Use of the Teradata package for Generative-AI is governed by the *License Agreement for the Teradata package for Generative-AI*. 
After installation, the `LICENSE` and `LICENSE-3RD-PARTY` files are located in the `teradatagenai` directory of the Python installation directory.
