Metadata-Version: 2.4
Name: pauly4010-evalai-sdk
Version: 1.9.0
Summary: AI Evaluation Platform SDK — traces, evaluations, assertions, and workflow tracing for LLM apps
Project-URL: Homepage, https://v0-ai-evaluation-platform-nu.vercel.app
Project-URL: Repository, https://github.com/pauly7610/ai-evaluation-platform
Project-URL: Issues, https://github.com/pauly7610/ai-evaluation-platform/issues
Author: pauly4010
License-Expression: MIT
Keywords: ai,anthropic,evaluation,llm,monitoring,observability,openai,testing,tracing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: httpx<1,>=0.27
Requires-Dist: pydantic<3,>=2.0
Provides-Extra: all
Requires-Dist: anthropic>=0.20; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: rich>=13; extra == 'all'
Requires-Dist: typer>=0.12; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.20; extra == 'anthropic'
Provides-Extra: cli
Requires-Dist: rich>=13; extra == 'cli'
Requires-Dist: typer>=0.12; extra == 'cli'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Description-Content-Type: text/markdown

# pauly4010-evalai-sdk

> **Evaluation infrastructure for AI systems.** Trace, test, and judge every LLM call — in five lines of Python.

[![PyPI](https://img.shields.io/pypi/v/pauly4010-evalai-sdk)](https://pypi.org/project/pauly4010-evalai-sdk/)
[![Python](https://img.shields.io/pypi/pyversions/pauly4010-evalai-sdk)](https://pypi.org/project/pauly4010-evalai-sdk/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
[![Typed](https://img.shields.io/badge/typing-typed-blue)](https://peps.python.org/pep-0561/)

## Quickstart (30 seconds)

```bash
pip install pauly4010-evalai-sdk
```

```python
from evalai_sdk import expect

result = expect("The capital of France is Paris.").to_contain("Paris")
print(result.passed)  # True
```

That's it. No API key needed for local assertions. When you're ready to send traces to the platform:

```python
from evalai_sdk import AIEvalClient, CreateTraceParams

client = AIEvalClient(api_key="sk-...")
trace = await client.traces.create(CreateTraceParams(name="chat-quality"))
```

## Why EvalAI?

| What you get | How it works |
|---|---|
| **20+ assertions** | `expect(output).to_contain("Paris")`, `.to_not_contain_pii()`, `.to_have_sentiment("positive")` |
| **Test suites** | Define cases, run them, get pass/fail + scores |
| **Workflow tracing** | Track multi-agent handoffs, decisions, and costs |
| **OpenAI / Anthropic** | Drop-in tracing wrappers — one line to instrument |
| **Regression gates** | Block deploys when eval scores drop |
| **Snapshot testing** | Save and compare outputs over time |
| **CLI** | `evalai run`, `evalai gate`, `evalai ci` |

## Install

```bash
pip install pauly4010-evalai-sdk                        # Core
pip install "pauly4010-evalai-sdk[openai]"              # + OpenAI tracing
pip install "pauly4010-evalai-sdk[anthropic]"           # + Anthropic tracing
pip install "pauly4010-evalai-sdk[all]"                 # Everything
```

## Assertions

20+ built-in checks for LLM output quality, safety, and structure:

```python
from evalai_sdk import expect

# Content
expect("The capital of France is Paris.").to_contain("Paris")
expect("Hello World").to_not_contain_pii()
expect("Thank you for your help.").to_be_professional()

# Sentiment & similarity
expect("Great product!").to_have_sentiment("positive")

# Structure
expect('{"name": "Alice"}').to_be_valid_json()
expect(0.95).to_be_between(0.0, 1.0)
expect("Hello world").to_have_length(min=5, max=100)

# Safety
expect("Clean response here").to_not_contain_pii()
```

Standalone functions work too:

```python
from evalai_sdk import contains_keywords, has_no_toxicity, matches_pattern

assert contains_keywords("quick brown fox", ["quick", "fox"])
assert has_no_toxicity("Thank you for your help.")
assert matches_pattern("abc-123", r"\w+-\d+")
```

## Test Suites

```python
from evalai_sdk import create_test_suite
from evalai_sdk.types import TestSuiteCase, TestSuiteConfig

suite = create_test_suite("safety-checks", TestSuiteConfig(
    evaluator=my_llm_function,
    test_cases=[
        TestSuiteCase(name="greeting", input="Hello", expected_output="Hi there!"),
        TestSuiteCase(name="pii-check", input="Describe yourself",
                      assertions=[{"type": "not_contains_pii"}]),
    ],
))

result = await suite.run()
print(f"{result.passed_count}/{result.total} passed")
```

## OpenAI Integration

One line to trace every OpenAI call:

```python
from openai import AsyncOpenAI
from evalai_sdk import AIEvalClient
from evalai_sdk.integrations.openai import trace_openai

traced = trace_openai(AsyncOpenAI(), AIEvalClient.init())
response = await traced.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Explain gravity"}]
)
# ^ Automatically traced with latency, tokens, and output
```

Or evaluate a batch of prompts with built-in assertions:

```python
from evalai_sdk import openai_chat_eval, OpenAIChatEvalCase

result = await openai_chat_eval(
    name="chat-quality",
    model="gpt-4",
    cases=[
        OpenAIChatEvalCase(
            input="Explain gravity in one sentence.",
            assertions=[{"type": "contains_keywords", "value": ["gravity", "force"]}],
        ),
    ],
)
print(f"{result.passed_count}/{result.total} passed — score: {result.score:.2f}")
```

## Anthropic Integration

```python
from anthropic import AsyncAnthropic
from evalai_sdk import AIEvalClient
from evalai_sdk.integrations.anthropic import trace_anthropic

traced = trace_anthropic(AsyncAnthropic(), AIEvalClient.init())
response = await traced.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain gravity"}]
)
```

## Workflow Tracing

Track multi-agent systems with handoffs, decisions, and cost:

```python
from evalai_sdk import AIEvalClient, WorkflowTracer
from evalai_sdk.types import HandoffType, CostCategory, RecordCostParams

client = AIEvalClient.init()
tracer = WorkflowTracer(client)

await tracer.start_workflow("research-pipeline")
span = await tracer.start_agent_span("researcher", {"query": "AI trends"})
await tracer.end_agent_span(span, {"findings": "..."})

await tracer.record_handoff("researcher", "writer", handoff_type=HandoffType.DELEGATION)
await tracer.record_cost(RecordCostParams(
    agent_name="researcher", category=CostCategory.LLM_INPUT, amount=0.05, tokens=1500
))

await tracer.end_workflow()
print(f"Total cost: ${tracer.get_total_cost():.2f}")
```

## Regression Gates

Block deployments when eval scores drop:

```python
from evalai_sdk import evaluate_regression, to_pass_gate

report = evaluate_regression(current_results, baseline)
assert to_pass_gate(report), f"Regression detected: {report.summary}"
```

## CLI

```bash
evalai init                    # Scaffold eval config
evalai run --dir ./evals       # Run all evaluations
evalai gate --baseline b.json  # Regression gate
evalai ci                      # Run + gate (CI mode)
evalai doctor                  # Check setup
evalai discover                # Find eval files
```

## Reliability

| Feature | Detail |
|---|---|
| **Python** | 3.9, 3.10, 3.11, 3.12, 3.13 |
| **Dependencies** | Only `httpx` + `pydantic` (2 packages) |
| **Async** | Native `async/await` throughout, sync wrappers available |
| **Type hints** | Full `py.typed` — works with mypy and Pyright |
| **Errors** | Structured errors: `RateLimitError`, `AuthenticationError`, `NetworkError`, `ValidationError` |
| **Rate handling** | Built-in `RateLimiter` with configurable tiers |
| **Caching** | `RequestCache` with TTL and LRU eviction |
| **Batching** | `batch_process()` with concurrency control |
| **Pagination** | Async `PaginatedIterator` with cursor support |

## API Reference

| Module | Methods |
|---|---|
| `client.traces` | `create`, `list`, `get`, `update`, `delete`, `create_span`, `list_spans` |
| `client.evaluations` | `create`, `get`, `list`, `update`, `delete`, `create_test_case`, `list_test_cases`, `create_run`, `list_runs`, `get_run` |
| `client.llm_judge` | `evaluate`, `create_config`, `list_configs`, `list_results`, `get_alignment` |
| `client.annotations` | `create`, `list`, `tasks.create`, `tasks.list`, `tasks.get`, `tasks.items.create`, `tasks.items.list` |
| `client.developer` | `get_usage`, `get_usage_summary`, `api_keys.*`, `webhooks.*` |

## Examples

See the [`examples/python/`](https://github.com/pauly7610/ai-evaluation-platform/tree/main/examples/python) directory for runnable scripts and Jupyter notebooks:

- **[OpenAI Eval](examples/python/openai_eval.ipynb)** — Trace and evaluate OpenAI chat completions
- **[RAG Eval](examples/python/rag_eval.ipynb)** — Evaluate retrieval-augmented generation pipelines
- **[Agent Eval](examples/python/agent_eval.ipynb)** — Test and trace multi-agent workflows

## Links

- [Platform](https://v0-ai-evaluation-platform-nu.vercel.app) | [GitHub](https://github.com/pauly7610/ai-evaluation-platform) | [TypeScript SDK](https://www.npmjs.com/package/@pauly4010/evalai-sdk)

## License

MIT
