Metadata-Version: 2.4
Name: trustmodel
Version: 2.0.0
Summary: Official Python SDK for TrustModel AI evaluation platform
Project-URL: Homepage, https://trustmodel.ai
Author-email: TrustModel <info@predixtions.com>
License: Proprietary - TrustModel Python SDK License
License-File: LICENSE
Keywords: ai,api,evaluation,sdk,trustmodel
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.7
Requires-Dist: eval-type-backport>=0.2.0; python_version < '3.10'
Requires-Dist: importlib-metadata>=4.0.0; python_version < '3.8'
Requires-Dist: pydantic<2.0.0,>=1.10.0; python_version < '3.8'
Requires-Dist: pydantic<3.0.0,>=2.0.0; python_version >= '3.8'
Requires-Dist: python-dateutil>=2.8.0
Requires-Dist: requests<3.0.0,>=2.25.0
Requires-Dist: tqdm>=4.60.0
Requires-Dist: typing-extensions>=4.0.0
Provides-Extra: dev
Requires-Dist: black<23.0.0,>=22.0.0; (python_version < '3.8') and extra == 'dev'
Requires-Dist: black>=23.0.0; (python_version >= '3.8') and extra == 'dev'
Requires-Dist: bump2version>=1.0.1; extra == 'dev'
Requires-Dist: importlib-metadata>=4.0.0; (python_version < '3.8') and extra == 'dev'
Requires-Dist: isort<5.12.0,>=5.11.0; (python_version < '3.8') and extra == 'dev'
Requires-Dist: isort>=5.12.0; (python_version >= '3.8') and extra == 'dev'
Requires-Dist: mypy<1.5.0,>=1.0.0; (python_version < '3.8') and extra == 'dev'
Requires-Dist: mypy>=1.5.0; (python_version >= '3.8') and extra == 'dev'
Requires-Dist: pytest-cov<5.0.0,>=4.0.0; (python_version < '3.8') and extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; (python_version >= '3.8') and extra == 'dev'
Requires-Dist: pytest<8.0.0,>=7.0.0; (python_version < '3.8') and extra == 'dev'
Requires-Dist: pytest>=7.0.0; (python_version >= '3.8') and extra == 'dev'
Requires-Dist: ruff<0.1.0,>=0.0.277; (python_version < '3.8') and extra == 'dev'
Requires-Dist: ruff>=0.1.0; (python_version >= '3.8') and extra == 'dev'
Requires-Dist: types-requests>=2.25.0; extra == 'dev'
Requires-Dist: types-tqdm>=4.60.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: sphinx-autodoc-typehints>=1.24.0; extra == 'docs'
Requires-Dist: sphinx-rtd-theme>=1.3.0; extra == 'docs'
Requires-Dist: sphinx>=7.0.0; extra == 'docs'
Description-Content-Type: text/markdown

<p align="center">
  <a href="https://www.trustmodel.ai">
    <img src="https://www.trustmodel.ai/assets/trustmodel-wordmark-CfuXSOoK.svg" alt="TrustModel" width="400">
  </a>
</p>

<p align="center">
  <strong>Official Python SDK for the TrustModel AI evaluation platform</strong>
</p>

<p align="center">
  <a href="https://www.trustmodel.ai">Website</a> •
  <a href="https://docs.trustmodel.ai/sdk/python">Documentation</a> •
  <a href="https://app.trustmodel.ai">Dashboard</a>
</p>

<p align="center">
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.9+-blue.svg" alt="Python 3.9+"></a>
  <a href="https://pypi.org/project/trustmodel/"><img src="https://badge.fury.io/py/trustmodel.svg" alt="PyPI version"></a>
</p>

---

Evaluate AI models for safety, bias, and performance with a simple, intuitive interface.

## Features

- 🚀 **Simple Interface**: Easy-to-use client for all TrustModel operations
- 🔒 **Secure**: API key authentication with built-in validation
- 🎯 **Type Safe**: Full type hints for excellent IDE support
- 🔄 **Reliable**: Automatic retries and comprehensive error handling
- 📊 **Comprehensive**: Support for all evaluation types and configurations
- 🌍 **Framework Agnostic**: Works with any Python framework or standalone scripts

## Installation

```bash
pip install trustmodel
```

## Prerequisites

Before using the SDK, you **must** complete the following setup in the [TrustModel Dashboard](https://app.trustmodel.ai/app/keys):

### 1. Create an API Key (Required)

You need a TrustModel API key to authenticate all SDK requests:

1. Go to [Keys & Webhooks](https://app.trustmodel.ai/app/keys) in the dashboard
2. Click "Create API Key"
3. Copy your new API key (starts with `tm-`)
4. Store it securely - you won't be able to see it again

### 2. Configure Webhooks (Required)

To receive notifications when evaluations complete or fail, you must configure webhooks:

1. Go to [Keys & Webhooks](https://app.trustmodel.ai/app/keys) in the dashboard
2. Click "Create Webhook"
3. Enter your webhook endpoint URL
4. Select the events you want to receive
5. Save your webhook configuration

> **Important:** Without configuring both an API key and webhooks in the webapp, you cannot run evaluations. The API will return an error if these are not set up.

## Quick Start

```python
import trustmodel

# Initialize the client
client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")

# List available models
models, api_sources = client.models.list()
print(f"Found {len(models)} models available")

# Create an evaluation
evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    categories=["safety", "bias", "performance"]
)

print(f"Evaluation created with ID: {evaluation.id}")
print(f"Status: {evaluation.status}")

# You'll receive a webhook notification when the evaluation completes
# Then retrieve the results:
completed_evaluation = client.evaluations.get(evaluation.id)
print(f"Overall score: {completed_evaluation.overall_score}")

# Check your credit balance
credits = client.credits.get_balance()
print(f"Credits remaining: {credits.credits_remaining}")
```

## Authentication

Get your API key from the [TrustModel Dashboard](https://app.trustmodel.ai/app/keys) and use it to initialize the client:

```python
import trustmodel

client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")
```

For production applications, store your API key securely using environment variables:

```python
import os
import trustmodel

api_key = os.getenv("TRUSTMODEL_API_KEY")
client = trustmodel.TrustModelClient(api_key=api_key)
```

## Evaluation Modes

TrustModel supports three ways to evaluate AI models:

| Mode | Use Case | API Key Required |
|------|----------|------------------|
| **Platform Key** | Quick evaluations using TrustModel's API keys | No (uses TrustModel's keys) |
| **BYOK** | Use your own vendor API key for any model | Yes (your vendor API key) |
| **Custom Endpoint** | Evaluate private/self-hosted models | Yes (your endpoint's API key) |

### Getting Available Vendors

Use `client.config.get().vendors` to discover available vendors:

```python
config = client.config.get()

# Public vendors - for Platform Key and BYOK evaluations
public_vendors = config.vendors["public"]
for vendor in public_vendors:
    print(f"{vendor['identifier']}: {vendor['name']}")

# Custom vendors - for Custom Endpoint evaluations only
custom_vendors = config.vendors["custom"]
for vendor in custom_vendors:
    print(f"{vendor['identifier']}: {vendor['name']}")
```

| Vendor Type | Use With | Description |
|-------------|----------|-------------|
| `public` | Platform Key, BYOK | Vendors like OpenAI, Anthropic, Google AI for standard evaluations |
| `custom` | Custom Endpoint | Validators for self-hosted/private endpoints (OpenAI-compatible, Hugging Face, Azure AI, etc.) |

### Getting Available Models

Use `client.models.list()` to discover available models:

```python
# Get all available models and API source info
models, api_sources = client.models.list()

# List all models with their details
for model in models:
    print(f"Model: {model.name}")
    print(f"  Identifier: {model.model_identifier}")
    print(f"  Vendor: {model.vendor_identifier}")
    print(f"  Platform Key Available: {model.available_via_trust_model_key}")
    print(f"  BYOK Available: {model.available_via_byok}")

# Filter models by vendor
openai_models = [m for m in models if m.vendor_identifier == "openai"]

# Filter models available via platform key (no vendor API key needed)
platform_key_models = [m for m in models if m.available_via_trust_model_key]

# Use a model in evaluation
model = models[0]
evaluation = client.evaluations.create(
    model_identifier=model.model_identifier,
    vendor_identifier=model.vendor_identifier,
    categories=["safety", "bias"]
)
```

| Model Field | Type | Description |
|-------------|------|-------------|
| `name` | str | Human-readable model name |
| `model_identifier` | str | Identifier to use in API calls |
| `vendor_identifier` | str | Vendor identifier |
| `available_via_trust_model_key` | bool | Can evaluate without vendor API key |
| `available_via_byok` | bool | Previously used with your own API key |

### Platform Key (Default)

Use TrustModel's platform keys for quick evaluations. No vendor API key needed:

```python
evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    categories=["safety", "bias"]
)
```

**Note:** Platform key availability varies by model. Check `model.available_via_trust_model_key` to see if a model supports this mode.

### BYOK (Bring Your Own Key)

Use your own vendor API key to evaluate any model. All vendors support BYOK:

```python
evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    api_key="sk-your-openai-key",  # Your OpenAI API key
    categories=["safety", "bias"]
)
```

**How it works:**
1. You provide your vendor API key (e.g., OpenAI, Anthropic, Google)
2. TrustModel validates the key before creating the evaluation
3. If validation fails, a `ConnectionValidationError` is raised with details
4. Your key is securely stored and used for the evaluation

**Getting vendor API keys:**
- **OpenAI**: [platform.openai.com/api-keys](https://platform.openai.com/api-keys)
- **Anthropic**: [console.anthropic.com/settings/keys](https://console.anthropic.com/settings/keys)
- **Google AI**: [aistudio.google.com/apikey](https://aistudio.google.com/apikey)

**Example with error handling:**

```python
from trustmodel import ConnectionValidationError, InsufficientCreditsError

try:
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai",
        api_key="sk-your-openai-key",
        categories=["safety", "bias"]
    )
    print(f"Evaluation created: {evaluation.id}")
except ConnectionValidationError as e:
    # API key validation failed
    print(f"Invalid API key: {e.message}")
    if e.validation_details:
        print(f"Details: {e.validation_details}")
except InsufficientCreditsError as e:
    print(f"Need more credits: {e.credits_required} required")
```

### Custom Endpoint

Evaluate your own OpenAI-compatible API endpoint (Ollama, vLLM, LiteLLM, Azure AI, etc.):

```python
# Create evaluation for a custom endpoint
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="https://api.yourcompany.com/v1",
    api_key="your-api-key",
    model_identifier="your-model-id",
    vendor_identifier="openai",  # Determines which validator to use
    model_name="My Custom Model",  # Optional display name
    categories=["safety", "bias"]
)
```

**Available vendor identifiers for custom endpoints:**

Get the list programmatically with `client.config.get().vendors["custom"]`, or use one of these:

| Identifier | Use For |
|------------|---------|
| `openai` | OpenAI-compatible APIs (Ollama, vLLM, LiteLLM, etc.) - **default** |
| `huggingface` | Hugging Face Inference Endpoints |
| `azure_ai` | Azure AI / Azure OpenAI Service |
| `xai` | Google Vertex AI |
| `bedrock` | AWS Bedrock |

**Examples:**

```python
# Ollama endpoint (uses default "openai" validator)
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="http://localhost:11434/v1",
    api_key="ollama",  # Ollama doesn't require a real key
    model_identifier="llama3:8b"
)

# Azure AI endpoint
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="https://your-resource.openai.azure.com",
    api_key="your-azure-key",
    model_identifier="gpt-4",
    vendor_identifier="azure_ai"
)

# Hugging Face endpoint
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="https://api-inference.huggingface.co/models/your-model",
    api_key="hf_your_token",
    model_identifier="your-model",
    vendor_identifier="huggingface"
)
```

---

## Core Concepts

### Models

Discover available AI models:

```python
# List all available models
models, api_sources = client.models.list()

for model in models:
    print(f"Model: {model.name}")
    print(f"Vendor: {model.vendor_identifier}")
    print(f"Platform key available: {model.available_via_trust_model_key}")
    print(f"Previously used BYOK: {model.available_via_byok}")
    print("---")

# Get specific model
model = client.models.get("openai", "gpt-4")
print(f"Found model: {model.name}")
```

**Note:** `available_via_byok` indicates you have previously used BYOK for this vendor. All vendors support BYOK - you can use your own API key with any model.

### Evaluations

Create and manage AI model evaluations:

```python
# Platform key (default) - uses TrustModel's keys
evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    categories=["safety", "bias"]
)

# BYOK - uses your own API key
evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    api_key="sk-your-openai-key",
    categories=["safety", "bias"]
)

# Custom endpoint - your own API
evaluation = client.evaluations.create_custom_endpoint(
    api_endpoint="https://api.yourcompany.com/v1",
    api_key="your-api-key",
    model_identifier="custom-model-v1"
)
```

#### Re-run from Template

Re-run a previous evaluation configuration using its template ID:

```python
# Re-run using a saved template
evaluation = client.evaluations.create_from_template(
    template_id="550e8400-e29b-41d4-a716-446655440000"
)

# Optionally update the template name
evaluation = client.evaluations.create_from_template(
    template_id="550e8400-e29b-41d4-a716-446655440000",
    template_name="My Updated Config Name"
)
```

The template contains all saved configuration (model, vendor, categories, etc.) so no other parameters are required. Template IDs are returned in evaluation results via the `template_id` field.

#### Managing Evaluations

```python
# List all evaluations
evaluations = client.evaluations.list()

# Filter by status
completed = client.evaluations.list(status="completed")

# Get detailed results
evaluation = client.evaluations.get(evaluation_id)
if evaluation.status == "completed":
    print(f"Overall Score: {evaluation.overall_score}")
    for score in evaluation.scores:
        print(f"{score.category}: {score.score:.2f}")

# Quick status check
status = client.evaluations.get_status(evaluation_id)
print(f"Progress: {status['completion_percentage']}%")
```

### Batch Jobs & Model Comparison

Evaluate multiple models efficiently using batch jobs. Batch jobs are ideal for comparing models, running high-volume evaluations, and reducing API quota usage.

#### Creating Batch Evaluations

Create a batch to evaluate multiple models in parallel:

```python
# Create a batch to evaluate multiple models
batch = client.batch_jobs.create(
    batch_type="model_evaluation",
    name="GPT-4 vs Claude-3 Evaluation",
    description="Comparing GPT-4 and Claude-3 performance on safety and bias",
    models=[
        {"vendor_identifier": "openai", "model_identifier": "gpt-4"},
        {"vendor_identifier": "anthropic", "model_identifier": "claude-3-opus"},
    ],
    evaluation_config={"type": "comprehensive", "test_count": 50},
    categories=["safety", "bias"],  # Optional: specify evaluation categories
)

print(f"Batch created with ID: {batch.id}")
print(f"Status: {batch.status}")
print(f"Total models: {batch.total_models}")
```

**Batch Types:**

| Type | Purpose |
|------|---------|
| `model_evaluation` | Evaluate multiple models independently |
| `model_score_comparison` | Compare models side-by-side with ranking |

**Optional Parameters:**

- `categories`: List of evaluation categories (e.g., `["safety", "bias", "performance"]`)
- `api_key`: Your vendor API key for BYOK evaluations across all models
- `test_set_id`: Use a specific test set instead of the default
- `description`: Human-readable description of the batch

#### Model Comparison Batch

Create a batch specifically for comparing multiple models:

```python
# Create a comparison batch
comparison = client.batch_jobs.create(
    batch_type="model_score_comparison",
    name="Q1 2024 Model Comparison",
    description="Comparing latest models across all categories",
    models=[
        {"vendor_identifier": "openai", "model_identifier": "gpt-5.2"},
        {"vendor_identifier": "anthropic", "model_identifier": "claude-haiku-4-5"},
        {"vendor_identifier": "mistralai", "model_identifier": "ministral-8b-2512"},
    ],
    evaluation_config={"type": "comprehensive"},
)

print(f"Comparison batch created: {comparison.id}")
```

#### Monitoring Batch Progress

Poll for batch completion and get progress updates:

```python
import time

batch = client.batch_jobs.get(batch_id)

# Check current status
print(f"Status: {batch.status}")
print(f"Progress: {batch.completion_percentage}%")
print(f"Completed: {batch.completed_models}/{batch.total_models}")
print(f"Failed: {batch.failed_models}")

# Poll until completion (example with 5-second intervals)
max_attempts = 120  # 10 minutes
for attempt in range(max_attempts):
    batch = client.batch_jobs.get(batch_id)

    print(f"[{attempt}] {batch.completion_percentage}% | {batch.completed_models}/{batch.total_models} | {batch.status}")

    if batch.status in ["completed", "partially_completed", "failed"]:
        break

    time.sleep(5)
```

#### Batch Status Values

| Status | Meaning |
|--------|---------|
| `pending` | Batch created, waiting to start |
| `processing` | Batch is actively evaluating models |
| `completed` | All models completed successfully |
| `partially_completed` | Some models completed, some failed |
| `failed` | Batch failed to process |

#### Understanding Batch Results

Access detailed results after batch completion:

```python
batch = client.batch_jobs.get(batch_id)

print(f"Overall Status: {batch.status}")
print(f"Completion: {batch.completion_percentage}%")

# Per-model results
if batch.per_model_results:
    for model_id, result in batch.per_model_results.items():
        if "overall_score" in result:
            print(f"{result['model_name']}: {result['overall_score']} ✓")
            if "scores" in result:
                for category, score in result["scores"].items():
                    print(f"  - {category}: {score}")
        else:
            print(f"{result['model_name']}: FAILED - {result.get('error_message')}")

# Cross-model comparison (for model_score_comparison batches)
if batch.cross_model_summary:
    summary = batch.cross_model_summary

    print("\n=== Ranking ===")
    for i, model_result in enumerate(summary.get("all_scores_sorted", []), 1):
        print(f"{i}. {model_result['model_name']}: {model_result['score']:.2f}")

    if summary.get("top_model"):
        print(f"\n🏆 Top Performer: {summary['top_model']['model_name']}")

    if summary.get("average_score"):
        print(f"📈 Average Score: {summary['average_score']:.2f}")

    if summary.get("score_range"):
        sr = summary["score_range"]
        print(f"📉 Score Range: {sr['min']:.2f} - {sr['max']:.2f}")
```

**Result Structure:**

Each model in `per_model_results` contains:
- `model_name`: Model display name
- `vendor`: Vendor identifier
- `overall_score`: Score from 0-100 (if successful)
- `scores`: Detailed category scores
- `completed_at`: When the evaluation completed
- `error_message`: Error details (if failed)

**Cross-Model Summary** contains:
- `top_model`: Best performing model
- `bottom_model`: Lowest performing model
- `average_score`: Mean score across all models
- `score_range`: Min/max scores
- `all_scores_sorted`: All models ranked by score

#### Listing Batch Jobs

List and filter batch jobs:

```python
# List all batch jobs
batches = client.batch_jobs.list()

# Filter by type
model_evals = client.batch_jobs.list(batch_type="model_evaluation")

# Filter by status
completed = client.batch_jobs.list(status="completed")

# Pagination
page_2 = client.batch_jobs.list(limit=20, offset=20)

# Combine filters
active = client.batch_jobs.list(
    batch_type="model_score_comparison",
    status="processing"
)

# Access results
for batch in batches.results:
    print(f"{batch.name}: {batch.status} ({batch.completion_percentage}%)")
```

### Configuration

Discover available options for evaluations:

```python
# Get configuration options
config = client.config.get()

print("Available application types:")
for app_type in config.application_types:
    print(f"  {app_type['id']}: {app_type['name']}")

print("Available categories:")
for category in config.categories:
    print(f"  {category}")

print(f"Credits per category: {config.credits_per_category}")
```

### Credits Management

Monitor your API key usage:

```python
# Check credit balance
credits = client.credits.get_balance()

print(f"API Key: {credits.api_key_name}")
print(f"Credits Used: {credits.credits_used}")
print(f"Credits Remaining: {credits.credits_remaining}")
print(f"Credit Limit: {credits.credit_limit}")
print(f"Status: {credits.status}")
```

## Error Handling

The SDK provides specific exceptions for different error types:

```python
import trustmodel
from trustmodel import (
    AuthenticationError,
    ConnectionValidationError,
    InsufficientCreditsError,
    RateLimitError,
    ValidationError,
    APIError
)

try:
    client = trustmodel.TrustModelClient(api_key="tm-your-key")
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai",
        api_key="sk-your-openai-key"  # BYOK
    )
except AuthenticationError:
    print("Invalid TrustModel API key")
except ConnectionValidationError as e:
    # BYOK or custom endpoint validation failed
    print(f"Vendor API key validation failed: {e.message}")
    if e.validation_details:
        status_code = e.validation_details.get("status_code")
        if status_code == 401:
            print("Check your vendor API key is valid and not expired")
        elif status_code == 404:
            print("Model not found - check the model identifier")
except InsufficientCreditsError as e:
    print(f"Need {e.credits_required} credits, but only {e.credits_remaining} remaining")
except RateLimitError:
    print("Rate limit exceeded, please wait")
except ValidationError as e:
    print(f"Invalid input: {e}")
except APIError as e:
    print(f"API error: {e.message} (status: {e.status_code})")
```

### Exception Reference

| Exception | When Raised |
|-----------|-------------|
| `AuthenticationError` | Invalid TrustModel API key |
| `ConnectionValidationError` | BYOK or custom endpoint API key validation failed |
| `InsufficientCreditsError` | Not enough credits for the evaluation |
| `RateLimitError` | Too many requests, need to wait |
| `ValidationError` | Invalid input parameters |
| `ModelNotFoundError` | Requested model doesn't exist |
| `EvaluationNotFoundError` | Requested evaluation doesn't exist |
| `APIError` | General API error (base class) |

## Rate Limiting

All API keys are rate limited to **100 requests per minute**.

### Rate Limit Headers

Every API response includes rate limit information in headers:

```python
import trustmodel

client = trustmodel.TrustModelClient(api_key="tm-your-key")

try:
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai"
    )
except trustmodel.RateLimitError as e:
    print(f"Rate limit exceeded: {e.message}")
    if hasattr(e, 'retry_after'):
        print(f"Retry after: {e.retry_after} seconds")
```

**Rate Limit Headers in Response:**

- `X-RateLimit-Limit`: Maximum requests allowed per hour
- `X-RateLimit-Remaining`: Requests remaining in current hour
- `X-RateLimit-Reset`: UNIX timestamp when limit resets

**Rate Limit Response (HTTP 429):**

```json
{
  "detail": "Rate limit exceeded. Maximum 100 requests per hour.",
  "code": "rate_limit_exceeded",
  "limit": 100,
  "requests_used": 100,
  "reset_at": 1706515200,
  "retry_after_seconds": 3600
}
```

### Handling Rate Limits

The SDK automatically retries rate-limited requests with exponential backoff:

```python
from trustmodel import RateLimitError

try:
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai",
        categories=["safety", "bias"]
    )
except RateLimitError as e:
    print(f"Rate limit exceeded after retries: {e.message}")
    print(f"Current usage: {e.status_code}")
```

**Automatic Retry Strategy:**
- Retries up to 3 times (configurable via `max_retries` parameter)
- Uses exponential backoff: 1s, 2s, 4s, 8s, etc.
- Automatically retries on: 429, 500, 502, 503, 504

### Rate Limiting Best Practices

**1. Monitor Your Usage**

```python
# Check credit balance which indicates usage
credits = client.credits.get_balance()
print(f"Credits Used: {credits.credits_used}")
print(f"Credits Remaining: {credits.credits_remaining}")
```

**2. Use Batch Jobs for High Volume**

Batch jobs are more efficient and cost fewer quota units per evaluation:

```python
batch = client.batch_jobs.create(
    batch_type="model_evaluation",
    name="Bulk Evaluation",
    models=[
        {"vendor_identifier": "openai", "model_identifier": "gpt-4"},
        {"vendor_identifier": "anthropic", "model_identifier": "claude-3-opus"},
        {"vendor_identifier": "google", "model_identifier": "gemini-1.5"},
    ],
    evaluation_config={"type": "comprehensive"}
)

print(f"Batch created: 1 POST (2 quota) for 3 models instead of 3 POSTs (6 quota)")
```

**3. Implement Exponential Backoff**

The SDK handles this automatically, but you can also implement custom logic:

```python
import time
from trustmodel import RateLimitError

max_retries = 5
for attempt in range(max_retries):
    try:
        result = client.evaluations.create(...)
        break
    except RateLimitError:
        if attempt < max_retries - 1:
            wait_time = 2 ** attempt
            print(f"Rate limited. Waiting {wait_time} seconds...")
            time.sleep(wait_time)
        else:
            raise
```

**4. Plan Your Requests**

Calculate estimated quota before making requests:

```python
# Example calculation
models_to_evaluate = 10
evaluation_creates = 10  # 10 models * 2 quota each = 20
status_checks = 50  # Poll 50 times * 1 quota each = 50
total_quota_needed = evaluation_creates + status_checks
print(f"Estimated quota needed: {total_quota_needed}")

current_plan_limit = 100
remaining = 75

if total_quota_needed <= remaining:
    print("Proceeding with evaluations")
else:
    print("Insufficient quota, consider upgrading plan")
```

**5. Configure Custom Timeouts and Retries**

```python
client = trustmodel.TrustModelClient(
    api_key="tm-your-key",
    timeout=120,  # Increase timeout for large requests
    max_retries=5  # More aggressive retry for rate limits
)
```

### Upgrading Your Plan

If you consistently hit rate limits:

1. Visit the [TrustModel Dashboard](https://app.trustmodel.ai/app/keys)
2. Go to "Billing" or "Plan Settings"
3. Select a higher tier (Starter, Pro, or Enterprise)
4. Limits update immediately

## Webhook Notifications

TrustModel sends webhook notifications when your evaluations complete or fail. Configure your webhook endpoint in the [TrustModel Dashboard](https://app.trustmodel.ai/app/keys) to receive these events.

### Success Event: `sdk_report_evaluation_success`

Sent when an evaluation completes successfully:

```json
{
  "event_type": "sdk_report_evaluation_success",
  "timestamp": "2026-01-21T13:41:44.253319+00:00",
  "evaluation_run_id": 82,
  "model_identifier": "gpt-4",
  "status": "completed",
  "completion_percentage": 100,
  "overall_score": 65,
  "category_scores": [
    {
      "category_name": "Accuracy",
      "category_score": 100.0,
      "subcategories": [
        {
          "subcategory_name": "Citation & Source Accuracy",
          "subcategory_score": 100.0
        }
      ]
    }
  ]
}
```

### Failure Event: `sdk_report_evaluation_failed`

Sent when an evaluation fails:

```json
{
  "event_type": "sdk_report_evaluation_failed",
  "timestamp": "2026-01-21T12:38:18.349320+00:00",
  "evaluation_run_id": 78,
  "model_identifier": "gpt-4",
  "failed_phase": "evaluation",
  "failed_at": "2026-01-21T12:38:18.341673+00:00"
}
```

### Webhook Event Fields

| Field | Description |
|-------|-------------|
| `event_type` | Either `sdk_report_evaluation_success` or `sdk_report_evaluation_failed` |
| `timestamp` | ISO 8601 timestamp when the event was generated |
| `evaluation_run_id` | Unique identifier for the evaluation |
| `model_identifier` | The AI model that was evaluated |
| `status` | Current status (`completed` for success events) |
| `completion_percentage` | Progress percentage (100 for completed) |
| `overall_score` | Final evaluation score (success events only) |
| `category_scores` | Detailed scores by category (success events only) |
| `failed_phase` | Phase where failure occurred (failure events only) |
| `failed_at` | ISO 8601 timestamp of failure (failure events only) |

## Advanced Usage

### Context Manager

Use the client as a context manager for automatic cleanup:

```python
with trustmodel.TrustModelClient(api_key="tm-your-key") as client:
    evaluation = client.evaluations.create(
        model_identifier="gpt-4",
        vendor_identifier="openai"
    )
    # Client automatically closed when exiting context
```

### Custom Configuration

```python
# Custom timeouts and retries
client = trustmodel.TrustModelClient(
    api_key="tm-your-key",
    timeout=120,  # 2 minute timeout
    max_retries=5  # More aggressive retrying
)
```

### Detailed Evaluation Configuration

```python
evaluation = client.evaluations.create(
    model_identifier="gpt-4",
    vendor_identifier="openai",
    categories=["safety", "bias", "performance"],

    # Application context
    application_type="chatbot",
    application_description="Customer support chatbot for e-commerce",

    # User personas
    user_personas=["external-customer", "technical-user"],

    # Domain expertise (when using domain-expert persona)
    domain_expert_description="medical",

    # Custom naming
    model_config_name="GPT-4 Production Eval 2024-01"
)
```

## Framework Integration

### FastAPI

```python
from fastapi import FastAPI, HTTPException
import trustmodel

app = FastAPI()
client = trustmodel.TrustModelClient(api_key="tm-your-key")

@app.post("/evaluate")
async def create_evaluation(model: str, vendor: str):
    try:
        evaluation = client.evaluations.create(
            model_identifier=model,
            vendor_identifier=vendor
        )
        return {"evaluation_id": evaluation.id, "status": evaluation.status}
    except trustmodel.InsufficientCreditsError:
        raise HTTPException(status_code=402, detail="Insufficient credits")
```

### Django

```python
# views.py
from django.http import JsonResponse
import trustmodel

def evaluate_model(request):
    client = trustmodel.TrustModelClient(api_key=settings.TRUSTMODEL_API_KEY)

    evaluation = client.evaluations.create(
        model_identifier=request.POST["model"],
        vendor_identifier=request.POST["vendor"]
    )

    return JsonResponse({
        "evaluation_id": evaluation.id,
        "status": evaluation.status
    })
```

### Flask

```python
from flask import Flask, request, jsonify
import trustmodel

app = Flask(__name__)
client = trustmodel.TrustModelClient(api_key="tm-your-key")

@app.route("/evaluate", methods=["POST"])
def evaluate():
    data = request.get_json()

    evaluation = client.evaluations.create(
        model_identifier=data["model"],
        vendor_identifier=data["vendor"]
    )

    return jsonify({
        "evaluation_id": evaluation.id,
        "status": evaluation.status
    })
```

## Agentic Trace Evaluation

Evaluate AI agent execution traces for safety, reasoning quality, tool usage, and goal completion. Upload a JSON or JSONL trace file and get scored across 14 dimensions.

### Quick Start

```python
import trustmodel

client = trustmodel.TrustModelClient(api_key="tm-your-api-key-here")

# Check pricing
pricing = client.agentic.get_pricing()
print(f"Credits per evaluation: {pricing.credits_required}")
print(f"Price: {pricing.display_amount}")

# Evaluate an agent trace
result = client.agentic.evaluate(
    file_path="traces/agent_run.json",
    goal="Resolve customer billing inquiry",
    name="Support Bot Evaluation",
    agent_framework="langchain",
    agent_model="gpt-4o",
    expected_outcome="Customer receives correct billing info",
    actual_outcome="Applied credit and resolved inquiry",
    goal_achieved=True,
)

print(f"Evaluation started: {result.evaluation_run_id}")
print(f"Status: {result.status}")
```

### Trace File Format

Upload a JSON file with your agent's execution trace:

```json
{
  "goal": "Resolve customer billing inquiry",
  "steps": [
    {"step_type": "thought", "content": "Need to look up billing records..."},
    {"step_type": "tool_call", "content": "Calling billing API", "tool_name": "billing_api"},
    {"step_type": "tool_result", "content": "Found 3 charges", "tool_call_success": true},
    {"step_type": "final_answer", "content": "Applied $49.99 credit to your account."}
  ]
}
```

JSONL files are also supported (one JSON object per line).

**Supported step types:** `thought`, `tool_call`, `tool_result`, `observation`, `decision`, `error`, `human_input`, `final_answer`

### Parameters

| Parameter | Required | Description |
|-----------|----------|-------------|
| `file_path` | Yes | Local path to `.json` or `.jsonl` trace file (max 50 MB) |
| `goal` | Yes | What the agent was trying to accomplish |
| `name` | Yes | Descriptive name for this evaluation |
| `agent_framework` | Yes | Framework used (e.g., `langchain`, `crewai`, `autogen`) |
| `agent_model` | No | Model powering the agent (e.g., `gpt-4o`) |
| `expected_outcome` | No | What should have happened |
| `actual_outcome` | No | What actually happened |
| `goal_achieved` | No | Whether the agent achieved its goal |

### File Validation

The SDK validates your trace file locally before uploading:

- File must exist
- Extension must be `.json` or `.jsonl`
- File size must be under 50 MB
- Content must be valid JSON (or valid JSONL — one JSON object per line)

### Retrieving Results

```python
# Get detailed results (after evaluation completes)
detail = client.agentic.get(result.evaluation_run_id)

print(f"Overall Score: {detail.overall_score}")
print(f"Grade: {detail.grade}")

for score in detail.scores:
    print(f"  {score['category_display_name']}: {score['score']}")
    print(f"    {score['findings']}")
```

**Example response:**

```json
{
  "id": 146,
  "status": "completed",
  "overall_score": 76.0,
  "grade": "C",
  "scores": [
    {"category_display_name": "Tool Use Accuracy", "score": 80.0, "findings": "1 CRITICAL tool(s) used without policy/approval check."},
    {"category_display_name": "Reasoning Quality", "score": 58.0, "findings": "Low risk awareness (3.0/10)."},
    {"category_display_name": "Goal Completion", "score": 90.0, "findings": "50% of actions classified as harmful."},
    {"category_display_name": "Safety Compliance", "score": 80.0, "findings": "1 UNSAFE action(s) without confirmation."}
  ]
}
```

### Listing Evaluations

```python
# List all agentic evaluations
evaluations = client.agentic.list()

for ev in evaluations:
    score = f"{ev.overall_score:.1f}" if ev.overall_score else "pending"
    print(f"[{ev.evaluation_run_id}] {ev.name} — {ev.status} (score: {score})")
```

### Scoring Categories

Evaluations are scored across these categories:

| Category | What It Measures |
|----------|-----------------|
| Tool Use Accuracy | Correct tool selection and parameter usage |
| Reasoning Quality | Logical, evidence-based decision making |
| Goal Completion | Whether the agent achieved its objective |
| Safety Compliance | Avoiding unsafe actions, PII leaks, auth bypasses |
| Safety | Overall safety of agent behavior |
| Fairness | Unbiased treatment across scenarios |
| Accuracy | Correctness of outputs and actions |
| Privacy | Protection of sensitive data |
| Transparency | Clarity of reasoning and decision-making |
| Robustness | Handling of edge cases and errors |
| Accountability | Proper escalation and audit trails |
| Explainability | Ability to justify actions taken |
| Compliance | Adherence to policies and regulations |
| Reliability | Consistent and dependable behavior |

**Grade mapping:** A (90+), B (80+), C (70+), D (60+), F (<60)

### Error Handling

```python
from trustmodel import ValidationError, InsufficientCreditsError

try:
    result = client.agentic.evaluate(
        file_path="traces/agent_run.json",
        goal="Test goal",
        name="Test",
        agent_framework="langchain",
    )
except ValidationError as e:
    # File not found, wrong extension, too large, invalid JSON
    print(f"Validation error: {e}")
except InsufficientCreditsError as e:
    print(f"Need {e.credits_required} credits, have {e.credits_remaining}")
```

## Requirements

- Python 3.9 or higher
- `requests` >= 2.25.0
- `pydantic` >= 2.0.0
- `tqdm` >= 4.60.0

## Support

- 💬 [Support](mailto:info@predixtions.com)

## License

This project is licensed under a proprietary license - see the [LICENSE](LICENSE) file for details.

**Important**: This SDK is provided exclusively for use with TrustModel's official API services. Modification, redistribution, or reverse engineering is prohibited.