module documentation

ZMS LLM API utility module

This module provides an abstract interface for Large Language Model providers. All providers follow the OpenAI /v1/chat/completions API schema for consistency.

Supported providers:

  • OpenAI (gpt-4, gpt-3.5-turbo, etc.)
  • Ollama (local deployment)
  • RAG with Qdrant vector database

Configuration properties:

  • llm.provider: 'openai', 'ollama', or 'rag' (default: 'openai')
  • llm.api.key: API key for OpenAI (if provider is 'openai')
  • llm.api.model: Model name (default: 'gpt-4o-mini' for OpenAI, 'llama2' for Ollama)
  • llm.api.endpoint: Custom endpoint URL
  • llm.ollama.host: Ollama host (default: 'http://localhost:11434')
  • llm.qdrant.host: Qdrant host (default: 'http://localhost:6333')
  • llm.qdrant.collection: Qdrant collection name (default: 'zms_docs')
  • llm.embedding.model: SentenceTransformer model (default: 'all-MiniLM-L6-v2')
  • llm.rag.top_k: Number of documents to retrieve (default: '3')
  • llm.rag.score_threshold: Minimum similarity score (0.0-1.0, default: '0.0')
  • llm.temperature: LLM temperature 0.0-2.0 (default: '0.7', RAG: 0.1 recommended)
  • llm.top_p: Nucleus sampling 0.0-1.0 (default: '0.9')
  • llm.max_tokens: Maximum tokens to generate (optional)
  • llm.num_ctx: Context window size (default: '4096')
  • llm.store: Enable storage for 'responses' API (default: False)
  • llm.timeout: Timeout for LLM responses in seconds (default: '120')
  • llm.rag.timeout: Timeout for RAG retrieval in seconds (default: '10')

Response format (OpenAI /v1/chat/completions compatible):

    {
        "id": "chatcmpl-123",
        "object": "chat.completion",
        "created": 1677652288,
        "model": "gpt-4o-mini",
        "choices": [{
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Response text"
            },
            "finish_reason": "stop"
        }],
        "usage": {
            "prompt_tokens": 10,
            "completion_tokens": 20,
            "total_tokens": 30
        }
    }

For backwards compatibility, a convenience property 'message' is also provided at the top level containing the first choice's message.

Requirements for RAG:

  • pip install sentence-transformers
  • pip install qdrant-client

License: GNU General Public License v2 or later, Organization: ZMS Publishing

Class LLMProvider Abstract base class for LLM providers
Class OllamaProvider Ollama local LLM provider (normalized to OpenAI format)
Class OpenAIProvider OpenAI API provider (v1/chat/completions compatible)
Class RAGProvider RAG (Retrieval-Augmented Generation) provider using Qdrant and Ollama
Function chat Send messages to the configured LLM provider and get a response.
Function get_ollama_models Fetch the list of locally available models from the configured Ollama server.
Function get_provider_info Get information about the currently configured LLM provider.
Variable security Undocumented
Function _generate_request_id Generate a unique request ID for tracking
Function _get_provider Factory function to get the appropriate LLM provider based on configuration.
Function _normalize_response Normalize provider-specific responses to OpenAI /v1/chat/completions format.
Constant _EMBEDDING_MODEL_CACHE Undocumented
def chat(context, messages, **kwargs): (source)

Send messages to the configured LLM provider and get a response.

This is the main entry point for LLM interactions in ZMS. All responses follow the OpenAI /v1/chat/completions format.

Parameters
context:objectZMS context object
messages:list | strList of message dicts [{"role": "user", "content": "..."}] or a string for backwards compatibility
temperatureSampling temperature 0.0-2.0 (optional)
top_pNucleus sampling 0.0-1.0 (optional)
max_tokensMaximum tokens to generate (optional)
storeEnable storage for responses API (optional)
metadataMetadata for responses API (optional)
Returns

dict

Success format:

    {
        "id": "chatcmpl-123",
        "object": "chat.completion",
        "created": 1677652288,
        "model": "gpt-4o-mini",
        "choices": [{
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Response text"
            },
            "finish_reason": "stop"
        }],
        "usage": {
            "prompt_tokens": 10,
            "completion_tokens": 20,
            "total_tokens": 30
        },
        "message": {...}  # Backwards compatibility: first choice's message
    }

Error format:

    {
        "error": {
            "code": "ERROR_CODE",
            "message": "error description"
        }
    }
Response in OpenAI /v1/chat/completions format
Notes
Configuration - Set llm.provider to one of: 'openai', 'ollama', 'rag'

For OpenAI:

  • llm.api.key: Your OpenAI API key
  • llm.api.model: Model name (default: 'gpt-4o-mini')
  • llm.api.endpoint: Custom endpoint (default: https://api.openai.com/v1/chat/completions)
  • llm.store: Enable responses API storage (default: False)

For Ollama:

  • llm.ollama.host: Ollama server URL (default: 'http://localhost:11434')
  • llm.api.model: Model name (default: 'llama2')
  • llm.temperature: LLM temperature 0.0-2.0 (default: '0.7')
  • llm.top_p: Nucleus sampling 0.0-1.0 (default: '0.9')
  • llm.num_ctx: Context window size (default: '4096')
  • llm.timeout: Timeout for LLM response in seconds (default: '120')

For RAG:

  • llm.qdrant.host: Qdrant server URL (default: 'http://localhost:6333')
  • llm.qdrant.collection: Collection name (default: 'zms_docs')
  • llm.ollama.host: Ollama server URL (default: 'http://localhost:11434')
  • llm.api.model: Model name (default: 'llama2')
  • llm.rag.top_k: Number of documents to retrieve (default: '3')
  • llm.rag.score_threshold: Minimum similarity score (0.0-1.0, default: '0.0')
  • llm.rag.timeout: Timeout for RAG retrieval in seconds (default: '10')
def get_ollama_models(context): (source)

Fetch the list of locally available models from the configured Ollama server.

Returns a dict with 'models' (list of name strings) on success, or 'error' on failure. This is used by the Config tab to populate the model dropdown for Ollama/RAG providers.

def get_provider_info(context): (source)

Get information about the currently configured LLM provider.

Args: context: ZMS context object

Returns: dict: Provider information including type, model, and endpoint

security = (source)

Undocumented

def _generate_request_id(provider, model, message): (source)

Generate a unique request ID for tracking

def _get_provider(context): (source)

Factory function to get the appropriate LLM provider based on configuration.

Args: context: ZMS context object

Returns: LLMProvider: An instance of the configured provider

def _normalize_response(response_data, provider, model, original_message): (source)

Normalize provider-specific responses to OpenAI /v1/chat/completions format.

This ensures all providers return a consistent schema compatible with the OpenAI API and the upcoming 'responses' schema.

Args:

  • response_data: Raw response from provider
  • provider: Provider name ('openai', 'ollama', 'rag')
  • model: Model name used
  • original_message: Original user message

Returns: dict: Normalized response in OpenAI format

_EMBEDDING_MODEL_CACHE: dict = (source)

Undocumented

Value
{}