Metadata-Version: 2.4
Name: variably-sdk
Version: 2.9.0
Summary: Official Python SDK for Variably feature flags, LLM experimentation, and prompt optimization platform
Author: Variably
Author-email: Variably <support@variably.com>
License: MIT
Project-URL: Homepage, https://github.com/variably/variably-python-sdk
Project-URL: Documentation, https://docs.variably.com/sdks/python
Project-URL: Repository, https://github.com/variably/variably-python-sdk
Project-URL: Issues, https://github.com/variably/variably-python-sdk/issues
Keywords: feature-flags,experimentation,a-b-testing,variably,llm,prompt-experimentation,llmops
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.25.0
Requires-Dist: typing-extensions>=3.7.4; python_version < "3.8"
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.18.0; extra == "dev"
Requires-Dist: pytest-cov>=2.10; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: flake8>=3.8; extra == "dev"
Requires-Dist: mypy>=0.800; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: responses>=0.18.0; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest>=6.0; extra == "test"
Requires-Dist: pytest-cov>=2.10; extra == "test"
Requires-Dist: responses>=0.18.0; extra == "test"
Dynamic: author
Dynamic: requires-python

# Variably Python SDK

Official Python SDK for Variably — LLM evaluation, experimentation, and prompt optimization.

## Installation

```bash
pip install variably-sdk
```

For Docker/Kubernetes deployments, add to your `requirements.txt`:

```
variably-sdk>=2.6.1
```
## Quick Start — Observe Mode

Add **one line** to your existing LLM app and get **multi-dimension evaluation** across 40+ metrics in 6 categories: Quality, Safety, Semantic, Grounding, Coherence, and Advanced.

No experiment setup. No prompt migration. Just log and see scores.

### 1. Set your environment variables

```bash
VARIABLY_API_KEY=vbl_your_key_here
VARIABLY_BASE_URL=https://api.variably.tech
```

### 2. Add one line after your LLM call

```python
from variably import observe

# Your existing code (unchanged)
response = your_llm_call(user_query)

# Add this line:
observe(prompt=user_query, response=response)
```

### Auto-extract tokens & model from provider response

```python
# OpenAI
from openai import OpenAI
from variably import observe

client = OpenAI()
completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": user_query}],
)

observe(
    prompt=user_query,
    response=completion.choices[0].message.content,
    provider_response=completion,  # auto-extracts model, tokens
)
```

```python
# Anthropic
import anthropic
from variably import observe

client = anthropic.Anthropic()
message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": user_query}],
)

observe(
    prompt=user_query,
    response=message.content[0].text,
    provider_response=message,  # auto-extracts model, tokens
)
```

### RAG applications — grounding & hallucination scoring

```python
observe(
    prompt=user_query,
    response=llm_answer,
    provider_response=completion,
    reference_materials=[
        {"id": "chunk-1", "content": "Retrieved text...", "source": "docs.pdf"},
        {"id": "chunk-2", "content": "Another chunk...", "source": "faq.md"},
    ],
    retrieval_query=user_query,
)
```

### Multi-turn chat — coherence scoring

```python
observe(
    prompt=latest_user_message,
    response=llm_answer,
    provider_response=completion,
    conversation_history=[
        {"role": "user", "content": "What is diabetes?"},
        {"role": "assistant", "content": "Diabetes is a chronic condition..."},
        {"role": "user", "content": "What are the symptoms?"},
    ],
    session_id="conv-123",
)
```

### RAG applications — grounding & hallucination scoring

```python
observe(
    prompt=user_query,
    response=llm_answer,
    provider_response=completion,
    reference_materials=[
        {"id": "chunk-1", "content": "Retrieved text...", "source": "docs.pdf"},
        {"id": "chunk-2", "content": "Another chunk...", "source": "faq.md"},
    ],
    retrieval_query=user_query,
)
```

### Multi-turn chat — coherence scoring

```python
observe(
    prompt=latest_user_message,
    response=llm_answer,
    provider_response=completion,
    conversation_history=[
        {"role": "user", "content": "What is diabetes?"},
        {"role": "assistant", "content": "Diabetes is a chronic condition..."},
        {"role": "user", "content": "What are the symptoms?"},
    ],
    session_id="conv-123",
)
```

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `prompt` | str | Yes | The user's input / question |
| `response` | str | Yes | The LLM's generated response |
| `provider_response` | object | No | Raw OpenAI/Anthropic/Google response — auto-extracts model, tokens |
| `model` | str | No | Model name (auto-extracted if provider_response given) |
| `provider` | str | No | "openai", "anthropic", etc. (auto-detected) |
| `latency_ms` | int | No | Response generation time in milliseconds |
| `prompt_tokens` | int | No | Input token count (auto-extracted if provider_response given) |
| `completion_tokens` | int | No | Output token count (auto-extracted) |
| `cost` | float | No | Cost in USD |
| `reference_materials` | list[dict] | No | RAG chunks: `[{"id", "content", "source"}]` — enables grounding scoring |
| `retrieval_query` | str | No | Query sent to retriever — enables retrieval quality scoring |
| `conversation_history` | list[dict] | No | Prior turns: `[{"role", "content"}]` — enables coherence scoring |
| `tags` | list[str] | No | Grouping labels, e.g. `["production", "rag"]` |
| `user_id` | str | No | Your user's ID |
| `session_id` | str | No | Conversation session ID (groups multi-turn) |
| `metadata` | dict | No | Any extra key-value data |

---

## Prompt Experimentation

Variably provides two modes for LLM prompt experimentation:

### BYOR (Bring Your Own Runtime)

You call your own LLM. Variably handles variant allocation and 41-dimensional evaluation.

```python
from variably import VariablyClient
import time

client = VariablyClient({"api_key": "your-api-key"})

user_context = {"user_id": "user-123"}
input_variables = {"query": "What are the symptoms of Type 2 diabetes?"}

# Step 1: Get the allocated variant
variant = client.get_variant("rag-prompt-experiment", user_context, input_variables)
print(f"Variant: {variant.variant_key}, Model: {variant.model}")

# Step 2: Call your LLM with the variant's prompt template
prompt = variant.prompt_template.format(**input_variables)
start = time.time()
llm_response = call_your_llm(prompt, model=variant.model)  # your LLM call
latency = int((time.time() - start) * 1000)

# Step 3: Submit the response for 41-dimensional evaluation
result = client.submit_response(
    experiment_key="rag-prompt-experiment",
    variant_key=variant.variant_key,
    executed_prompt=prompt,
    response=llm_response,
    user_context=user_context,
    input_variables=input_variables,
    provider=variant.provider,
    model=variant.model,
    latency_ms=latency,
)
print(f"Submitted: {result.status}")
```

### Managed Execution

Variably selects the variant, calls the LLM, and evaluates — all in one call.

```python
response = client.evaluate_prompt(
    experiment_key="rag-prompt-experiment",
    user_context={"user_id": "user-123"},
    input_variables={"query": "What are the symptoms of Type 2 diabetes?"},
    evaluation_mode="full",  # "full" | "fast"
)

print(f"Content: {response.content}")
print(f"Model: {response.model}, Latency: {response.latency_ms}ms")
print(f"Tokens: {response.token_usage}")
print(f"Quality Score: {response.quality_score}")
```

### Managed Execution with Streaming (v2.1.0+)

Same as managed execution, but tokens stream in real-time — ideal for chatbot UIs.

```python
from variably import VariablyClient

client = VariablyClient({"api_key": "your-api-key"})

stream = client.evaluate_prompt_stream(
    experiment_key="rag-prompt-experiment",
    user_context={"user_id": "user-123"},
    input_variables={"query": "What are the symptoms of Type 2 diabetes?"},
)

# Tokens arrive one-by-one for real-time display
for token in stream:
    print(token, end="", flush=True)

print()  # newline after stream ends

# After iteration, metadata is available (token usage, latency, quality score)
meta = stream.metadata
if meta:
    print(f"Model: {meta.model}, Latency: {meta.latency_ms}ms")
    print(f"Tokens: {meta.token_usage}")
```

### Context-Aware Evaluation (Better RAG Quality) — v2.2.0+

For RAG chatbots, passing conversation history and retrieved chunks enables **groundedness scoring, hallucination detection, and conversational coherence** — dimensions that are impossible to evaluate in isolation.

The `evaluation_context` parameter is **not sent to the LLM** — it's only used by Variably's evaluator for richer scoring.

```python
# Step 1: Collect conversation history from your session
workflow_history = [
    {"role": "user", "content": "What causes diabetes?"},
    {"role": "assistant", "content": "Key factors include genetics, diet..."},
    {"role": "user", "content": "What about potatoes?"},
]

# Step 2: Collect retrieved RAG chunks (after your retrieval step)
reference_materials = [
    {
        "id": "chunk-001",
        "content": "Unhealthy diets high in refined sugars, fats...",
        "source": "Kenya National Clinical Guidelines",
        "type": "chunk",
        "relevance_score": 0.89,
    },
    {
        "id": "chunk-002",
        "content": "Modifiable risk factors include obesity...",
        "source": "Kenya National Clinical Guidelines",
        "type": "chunk",
        "relevance_score": 0.82,
    },
]

# Step 3: Pass evaluation_context in your evaluate call
response = client.evaluate_prompt(
    experiment_key="rag-prompt-experiment",
    user_context={"user_id": "user-123"},
    input_variables={"query": "What about potatoes?", "context": context_text},
    evaluation_mode="full",
    evaluation_context={
        "reference_materials": reference_materials,
        "workflow_history": workflow_history,
        "retrieval_query": "potato consumption glycemic index diabetes risk",
    },
)

# Same works with streaming
stream = client.evaluate_prompt_stream(
    experiment_key="rag-prompt-experiment",
    user_context={"user_id": "user-123"},
    input_variables={"query": "What about potatoes?", "context": context_text},
    evaluation_context={
        "reference_materials": reference_materials,
        "workflow_history": workflow_history,
    },
)
for token in stream:
    print(token, end="", flush=True)
```

**What this enables:**

| Dimension | Description | Requires |
|-----------|-------------|----------|
| `faithfulness` | % of claims grounded in retrieved chunks | `reference_materials` |
| `hallucination_rate` | % of claims with no source in context | `reference_materials` |
| `context_utilization` | % of relevant chunks actually used | `reference_materials` |
| `attribution_accuracy` | Do citations map to correct chunks? | `reference_materials` |
| `conversation_consistency` | No contradictions with prior turns | `workflow_history` |
| `context_retention` | Maintains topic awareness across turns | `workflow_history` |
| `transparency` | Discloses when going beyond source material | `reference_materials` |

**BYOR mode** also supports `evaluation_context` — pass it in `submit_response()`:

```python
result = client.submit_response(
    experiment_key="my-experiment",
    variant_key=variant.variant_key,
    executed_prompt=prompt,
    response=llm_response,
    user_context=user_context,
    input_variables=input_variables,
    provider=variant.provider,
    model=variant.model,
    latency_ms=latency,
    evaluation_context={
        "reference_materials": reference_materials,
        "workflow_history": workflow_history,
    },
)
```

#### evaluation_context Schema

| Field | Type | Description |
|-------|------|-------------|
| `reference_materials` | `list[dict]` | RAG chunks / source documents for groundedness scoring |
| `reference_materials[].id` | `str` | Unique chunk identifier |
| `reference_materials[].content` | `str` | Chunk text content |
| `reference_materials[].source` | `str` (optional) | Source document URL or name |
| `reference_materials[].type` | `str` (optional) | e.g. `"chunk"`, `"document"` |
| `reference_materials[].relevance_score` | `float` (optional) | Retriever similarity score |
| `workflow_history` | `list[dict]` | Conversation turns for coherence scoring |
| `workflow_history[].role` | `str` | `"user"` or `"assistant"` |
| `workflow_history[].content` | `str` | Message content |
| `retrieval_query` | `str` (optional) | The rewritten query sent to the retriever |

See [Context-Aware RAG Evaluation](../../docs/concepts/LLM%20Evaluation/context-aware-rag-evaluation.md) for the full concept doc with architecture diagrams and integration examples.

#### Integration with LangGraph / FastAPI streaming

```python
from fastapi.responses import StreamingResponse

async def stream_with_variably(query: str, session_id: str):
    """Yield NDJSON events from Variably streaming evaluation."""
    stream = client.evaluate_prompt_stream(
        experiment_key="my-experiment",
        user_context={"user_id": session_id},
        input_variables={"query": query},
    )

    for token in stream:
        yield json.dumps({"type": "token", "content": token}) + "\n"

    # Send final metadata
    if stream.metadata:
        yield json.dumps({
            "type": "stream_end",
            "content": stream.metadata.content,
        }) + "\n"

@app.post("/api/chat")
async def chat(request: ChatRequest):
    return StreamingResponse(
        stream_with_variably(request.message, request.session_id),
        media_type="application/x-ndjson",
    )
```

### Backend API: SSE Streaming Endpoint

The streaming endpoint uses Server-Sent Events (SSE). Here's the raw API:

**Endpoint:** `POST /api/v1/internal/sdk/prompt-experiments/evaluate-stream`

**Headers:**
```
X-API-Key: your-api-key
Content-Type: application/json
```

**Request body** (same as non-streaming evaluate):
```json
{
  "experiment_key": "rag-prompt-experiment",
  "user_context": {
    "userId": "user-123",
    "sessionId": "sess-456"
  },
  "input_variables": {
    "query": "What are the symptoms of Type 2 diabetes?"
  },
  "evaluation_context": {
    "reference_materials": [{"id": "chunk-1", "content": "...", "source": "...", "type": "chunk"}],
    "workflow_history": [{"role": "user", "content": "..."}],
    "retrieval_query": "diabetes symptoms type 2"
  }
}
```

**Response** (SSE stream):
```
event: token
data: {"content": "Type"}

event: token
data: {"content": " 2"}

event: token
data: {"content": " diabetes"}

event: token
data: {"content": " symptoms"}

event: token
data: {"content": " include..."}

event: metadata
data: {"experiment_id": "exp-123", "variant_id": "variant-a", "execution_id": "eval-789", "provider": "anthropic", "model": "claude-3-5-haiku-20241022", "prompt_tokens": 150, "completion_tokens": 85, "total_tokens": 235, "cost_usd": 0.000425, "latency_ms": 1250}

event: done
data: {}
```

**curl example:**
```bash
curl -N -X POST http://localhost:8080/api/v1/internal/sdk/prompt-experiments/evaluate-stream \
  -H "X-API-Key: your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "experiment_key": "rag-prompt-experiment",
    "user_context": {"userId": "user-123", "sessionId": "sess-456"},
    "input_variables": {"query": "What are the symptoms of Type 2 diabetes?"}
  }'
```

**Error handling:** If an error occurs during streaming, an error event is sent:
```
event: error
data: {"message": "LLM generation failed: rate limit exceeded"}
```

## Configuration

```python
from variably import VariablyConfig, VariablyClient

config = VariablyConfig(
    api_key="your-api-key",
    base_url="https://api.variably.com",  # default: http://localhost:8080
    environment="production",  # default: development
    timeout=5000,  # timeout in milliseconds, default: 5000
    retry_attempts=3,  # default: 3
    enable_analytics=True,  # default: True
    cache={
        "ttl": 300,  # TTL in seconds, default: 300 (5 minutes)
        "max_size": 1000,  # default: 1000
        "enabled": True  # default: True
    },
    log_level="INFO"  # DEBUG, INFO, WARNING, ERROR
)

client = VariablyClient(config)
```

## Advanced Usage

### Environment Variables

You can create a client using environment variables:

```python
from variably import create_client_from_env

# Uses these environment variables:
# VARIABLY_API_KEY (required)
# VARIABLY_BASE_URL
# VARIABLY_ENVIRONMENT
# VARIABLY_TIMEOUT
# VARIABLY_RETRY_ATTEMPTS
# VARIABLY_ENABLE_ANALYTICS
# VARIABLY_LOG_LEVEL

client = create_client_from_env()
```

### Different Flag Types

```python
# Boolean flags
bool_value = client.evaluate_flag_bool("feature-enabled", False, user_context)

# String flags
string_value = client.evaluate_flag_string("theme", "light", user_context)

# Number flags
number_value = client.evaluate_flag_number("max-items", 10, user_context)

# JSON flags
json_value = client.evaluate_flag_json("config", {"timeout": 5000}, user_context)

# Get full evaluation details
result = client.evaluate_flag("feature-flag", "default", user_context)
print(f"Value: {result.value}, Reason: {result.reason}, Cache Hit: {result.cache_hit}")
```

### Batch Evaluation

```python
flags = client.evaluate_flags([
    "feature-a",
    "feature-b", 
    "feature-c"
], user_context)

print(flags["feature-a"].value)
```

### Event Tracking

```python
from datetime import datetime

# Single event
client.track({
    "name": "purchase_completed",
    "user_id": "user-123",
    "properties": {
        "amount": 99.99,
        "currency": "USD",
        "items": ["item-1", "item-2"]
    },
    "timestamp": datetime.utcnow()  # optional, auto-generated if not provided
})

# Batch events
client.track_batch([
    {"name": "page_view", "user_id": "user-123", "properties": {"page": "/home"}},
    {"name": "button_click", "user_id": "user-123", "properties": {"button": "cta"}}
])
```

### Cache Management

```python
# Clear cache
client.clear_cache()

# Get cache stats
stats = client.cache.get_stats()
print(stats)  # {"size": 10, "max_size": 1000, "enabled": True, "ttl": 300}
```

### Metrics

```python
# Get SDK metrics
metrics = client.get_metrics()
print(metrics)
# {
#     "api_calls": 25,
#     "cache_hits": 15,
#     "cache_misses": 10,
#     "errors": 1,
#     "average_latency": 45.2,
#     "cache_hit_rate": 0.6,
#     "error_rate": 0.04,
#     "flags_evaluated": 20,
#     "gates_evaluated": 5,
#     "events_tracked": 12,
#     "start_time": "2023-10-01T12:00:00Z",
#     "uptime_seconds": 3600
# }
```

### Context Manager

```python
# Use with context manager for automatic cleanup
with VariablyClient({"api_key": "your-api-key"}) as client:
    result = client.evaluate_flag_bool("feature", False, user_context)
    # client.close() is called automatically
```

### Custom Logger

```python
from variably import VariablyClient, create_logger

# Create custom logger
logger = create_logger(
    name="my-app",
    level="DEBUG",
    structured=True,  # JSON logging
    silent=False
)

# Client will use the custom logger
client = VariablyClient({
    "api_key": "your-api-key",
    "log_level": "DEBUG"
})
```

## Error Handling

```python
from variably import (
    VariablyError,
    NetworkError,
    AuthenticationError,
    ValidationError,
    RateLimitError,
    TimeoutError,
    ConfigurationError
)

try:
    result = client.evaluate_flag("my-flag", False, user_context)
except AuthenticationError:
    print("Invalid API key")
except NetworkError as e:
    print(f"Network error: {e.status_code}")
except ValidationError as e:
    print(f"Validation error in field: {e.field}")
except RateLimitError as e:
    print(f"Rate limited, retry after {e.retry_after} seconds")
except TimeoutError:
    print("Request timed out")
except ConfigurationError as e:
    print(f"Configuration error in parameter: {e.parameter}")
except VariablyError as e:
    print(f"Variably SDK error: {e}")
```

## Type Hints

The SDK includes full type hints for better IDE support:

```python
from typing import Dict, Any
from variably import VariablyClient, UserContext, FlagResult

user_context: UserContext = {
    "user_id": "user-123",
    "email": "user@example.com",
    "attributes": {
        "plan": "premium",
        "signup_date": "2023-01-01"
    }
}

result: FlagResult = client.evaluate_flag("feature", False, user_context)
```

## Async Support

For async applications, you can wrap the synchronous client:

```python
import asyncio
from concurrent.futures import ThreadPoolExecutor
from variably import VariablyClient

class AsyncVariablyClient:
    def __init__(self, config):
        self.client = VariablyClient(config)
        self.executor = ThreadPoolExecutor(max_workers=4)
    
    async def evaluate_flag_bool(self, flag_key, default_value, user_context):
        loop = asyncio.get_event_loop()
        return await loop.run_in_executor(
            self.executor,
            self.client.evaluate_flag_bool,
            flag_key, default_value, user_context
        )
    
    async def close(self):
        self.client.close()
        self.executor.shutdown(wait=True)

# Usage
async def main():
    client = AsyncVariablyClient({"api_key": "your-api-key"})
    
    result = await client.evaluate_flag_bool("feature", False, {
        "user_id": "user-123"
    })
    
    await client.close()

asyncio.run(main())
```

## Development

### Setup

```bash
# Install development dependencies
pip install -e ".[dev]"
```

### Testing

```bash
pytest
```

### Code Quality

```bash
# Format code
black src/ tests/

# Sort imports
isort src/ tests/

# Lint
flake8 src/ tests/

# Type check
mypy src/
```

## Publishing to PyPI

### Prerequisites

1. Create a PyPI account at https://pypi.org/account/register/
2. Generate an API token at https://pypi.org/manage/account/token/
   - Scope: select "Entire account" for first upload, or project-specific after that
3. Install build tools:
   ```bash
   pip3 install build twine
   ```

> **Note:** `build` and `twine` install to user site-packages and may not be on your PATH.
> Always use `python3 -m build` and `python3 -m twine` instead of bare `build`/`twine`.

### Configure PyPI credentials

Create `~/.pypirc`:

```ini
[distutils]
index-servers = pypi

[pypi]
username = __token__
password = pypi-YOUR_API_TOKEN_HERE
```

Secure the file:

```bash
chmod 600 ~/.pypirc
```

### Build and publish

The version in the build output (e.g., `variably_sdk-2.0.0-py3-none-any.whl`) comes directly from `pyproject.toml`'s `version` field. PyPI rejects re-uploads of the same version — you must bump the version to publish again.

```bash
# 1. Clean previous builds
rm -rf dist/ build/ src/*.egg-info

# 2. Build sdist and wheel
python3 -m build

# 3. Verify the package (optional but recommended)
python3 -m twine check dist/*

# 4. Upload to TestPyPI first (optional, for dry-run)
python3 -m twine upload --repository testpypi dist/*

# 5. Upload to PyPI
python3 -m twine upload dist/*
```

### Verify the published package

```bash
pip3 install variably-sdk==2.1.0
python3 -c "from variably import VariablyClient, PromptVariant; print('OK')"
```

### Version bumping checklist

When releasing a new version, update these three files then clean-build-publish:

1. `src/variably/version.py` — `__version__`
2. `pyproject.toml` — `version`
3. `src/variably/http_client.py` — `User-Agent` header string

```bash
# Example: bumping from 2.0.0 to 2.0.1
# After updating the 3 files above:
rm -rf dist/ build/ src/*.egg-info
python3 -m build
python3 -m twine upload dist/*
```

## Requirements

- Python 3.7+
- requests >= 2.25.0

## License

MIT License - see LICENSE file for details.
