Metadata-Version: 2.4
Name: evaluatorq
Version: 1.6.0
Summary: An evaluation framework library for Python that provides a flexible way to run parallel evaluations and optionally integrate with the Orq AI platform.
Project-URL: Homepage, https://github.com/orq-ai/evaluatorq
Project-URL: Repository, https://github.com/orq-ai/evaluatorq
Project-URL: Documentation, https://orq-ai.github.io/evaluatorq/
License: MIT
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: httpx>=0.28.1
Requires-Dist: loguru>=0.6.0
Requires-Dist: openai<3.0.0,>=1.92.0
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=14.2.0
Requires-Dist: typer>=0.12.0
Provides-Extra: all
Requires-Dist: crewai>=0.70.0; (python_version >= '3.11') and extra == 'all'
Requires-Dist: huggingface-hub>=0.20.0; extra == 'all'
Requires-Dist: langchain<2.0.0,>=1.0.0; extra == 'all'
Requires-Dist: langgraph>=0.2.0; extra == 'all'
Requires-Dist: openai-agents>=0.1.0; extra == 'all'
Requires-Dist: opentelemetry-api>=1.20.0; extra == 'all'
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.20.0; extra == 'all'
Requires-Dist: opentelemetry-sdk>=1.20.0; extra == 'all'
Requires-Dist: opentelemetry-semantic-conventions>=0.41b0; extra == 'all'
Requires-Dist: orq-ai-sdk>=4.4.7; extra == 'all'
Requires-Dist: plotly>=5.18.0; extra == 'all'
Requires-Dist: pydantic-ai-slim[openai]>=1.0.0; extra == 'all'
Requires-Dist: python-fasthtml<0.15.0,>=0.12.0; extra == 'all'
Requires-Dist: streamlit>=1.30.0; extra == 'all'
Requires-Dist: uvicorn>=0.30.0; extra == 'all'
Requires-Dist: vl-convert-python>=1.6.0; extra == 'all'
Requires-Dist: watchdog>=4.0.0; extra == 'all'
Provides-Extra: crewai
Requires-Dist: crewai>=0.70.0; (python_version >= '3.11') and extra == 'crewai'
Provides-Extra: dashboard
Requires-Dist: python-fasthtml<0.15.0,>=0.12.0; extra == 'dashboard'
Requires-Dist: uvicorn>=0.30.0; extra == 'dashboard'
Requires-Dist: vl-convert-python>=1.6.0; extra == 'dashboard'
Provides-Extra: langchain
Requires-Dist: langchain<2.0.0,>=1.0.0; extra == 'langchain'
Provides-Extra: langgraph
Requires-Dist: langgraph>=0.2.0; extra == 'langgraph'
Provides-Extra: openai-agents
Requires-Dist: openai-agents>=0.1.0; extra == 'openai-agents'
Provides-Extra: orq
Requires-Dist: orq-ai-sdk>=4.4.7; extra == 'orq'
Provides-Extra: otel
Requires-Dist: opentelemetry-api>=1.20.0; extra == 'otel'
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.20.0; extra == 'otel'
Requires-Dist: opentelemetry-sdk>=1.20.0; extra == 'otel'
Requires-Dist: opentelemetry-semantic-conventions>=0.41b0; extra == 'otel'
Provides-Extra: pydantic-ai
Requires-Dist: pydantic-ai-slim[openai]>=1.0.0; extra == 'pydantic-ai'
Provides-Extra: redteam
Requires-Dist: huggingface-hub>=0.20.0; extra == 'redteam'
Requires-Dist: plotly>=5.18.0; extra == 'redteam'
Requires-Dist: streamlit>=1.30.0; extra == 'redteam'
Requires-Dist: vl-convert-python>=1.6.0; extra == 'redteam'
Requires-Dist: watchdog>=4.0.0; extra == 'redteam'
Provides-Extra: simulation
Requires-Dist: orq-ai-sdk>=4.4.7; extra == 'simulation'
Requires-Dist: plotly>=5.18.0; extra == 'simulation'
Requires-Dist: streamlit>=1.30.0; extra == 'simulation'
Requires-Dist: vl-convert-python>=1.6.0; extra == 'simulation'
Requires-Dist: watchdog>=4.0.0; extra == 'simulation'
Description-Content-Type: text/markdown

# evaluatorq

An evaluation framework library for Python that provides a flexible way to run parallel evaluations and optionally integrate with the Orq AI platform.

## Why evaluatorq?

Orq's built-in experiment runner works well for evaluating deployments hosted on the platform, but it has limits: you can only target Orq-managed agents and deployments, and evaluation logic is constrained to what the UI exposes.

`evaluatorq` was built to remove those limits. It gives you a full Python evaluation loop you control entirely — bring your own agent, your own data, your own scorers. The Orq platform becomes optional infrastructure for storing results and datasets, not a hard requirement.

**When to use evaluatorq instead of the built-in experiment runner:**

- Your agent runs outside Orq (LangChain, LangGraph, OpenAI Agents SDK, a custom HTTP service, anything)
- You need custom evaluation logic — LLM-as-judge, multi-criteria rubrics, programmatic checks, or external APIs
- You want CI/CD integration with pass/fail signals and exit codes
- You need to compare multiple agent implementations side by side in the same run
- You want full observability via OpenTelemetry into exactly what ran and how long it took

The library is deliberately lightweight: async-first, typed end-to-end, and usable standalone or wired into Orq for result storage and dataset management.

## 🎯 Features

- **Parallel Execution**: Run multiple evaluation jobs concurrently with progress tracking
- **Flexible Data Sources**: Support for inline data, async iterables, and Orq platform datasets
- **Type-safe**: Fully typed with Python type hints and Pydantic models with runtime validation
- **Rich Terminal UI**: Beautiful progress indicators and result tables powered by Rich
- **Orq Platform Integration**: Seamlessly fetch and evaluate datasets from Orq AI (optional)
- **OpenTelemetry Tracing**: Built-in observability with automatic span creation for jobs and evaluators
- **Pass/Fail Tracking**: Evaluators can return pass/fail status for CI/CD integration
- **Built-in Evaluators**: Common evaluators like `string_contains_evaluator` included
- **Integrations**: LangChain, LangGraph, OpenAI Agents SDK, and custom callable support
- **[Red Teaming](src/evaluatorq/redteam/README.md)**: Adaptive OWASP-mapped adversarial security testing for AI agents
- **[Agent Simulation](src/evaluatorq/simulation/README.md)**: Multi-turn, persona-driven conversation testing with an LLM judge

## 📖 Table of Contents

- [Installation](#-installation)
- [Getting Started](#-getting-started)
- [Quick Start](#-quick-start)
- [Integrations](#-langchain-integration)
- [Red Teaming](#-red-teaming)
- [Agent Simulation](#-agent-simulation)
- [Configuration](#-configuration)
- [Orq Platform Integration](#-orq-platform-integration)
- [OpenTelemetry Tracing](#-opentelemetry-tracing)
- [Pass/Fail Tracking](#-passfail-tracking)
- [API Reference](#-api-reference)

## 📥 Installation

```bash
pip install evaluatorq
# or
uv add evaluatorq
# or
poetry add evaluatorq
```

### Optional Dependencies

If you want to use the Orq platform integration:

```bash
pip install orq-ai-sdk
# or
pip install evaluatorq[orq]
```

For OpenTelemetry tracing (optional):

```bash
pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp-proto-http opentelemetry-semantic-conventions
# or
pip install evaluatorq[otel]
```

For LangChain/LangGraph integration:

```bash
pip install langchain
# or
pip install evaluatorq[langchain]
```

## 🏁 Getting Started

New to evaluatorq? Follow this path to get up and running:

| Step | What you'll learn | Example |
|------|------------------|---------|
| 1. **Basic eval** | Run your first evaluation with inline data | [`pass_fail_simple.py`](examples/lib/basics/pass_fail_simple.py) |
| 2. **Multiple jobs** | Run multiple jobs in parallel on each data point | [`example_runners.py`](examples/lib/basics/example_runners.py) |
| 3. **Reusable patterns** | Create reusable jobs and evaluators | [`eval_reuse.py`](examples/lib/basics/eval_reuse.py) |
| 4. **Datasets** | Load data from the Orq platform | [`dataset_example.py`](examples/lib/datasets/dataset_example.py) |
| 5. **Structured scores** | Return multi-dimensional metrics | [`structured_rubric_eval.py`](examples/lib/structured/structured_rubric_eval.py) |
| 6. **LangChain agent** | Evaluate a LangChain/LangGraph agent | [`langchain_integration_example.py`](examples/lib/integrations/langchain/langchain_integration_example.py) |

> **Tip:** Start with step 1 and work your way up. Each example builds on concepts from the previous one.

## 🚀 Quick Start

### Basic Usage

```python
import asyncio
from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult

@job("text-analyzer")
async def text_analyzer(data: DataPoint, row: int):
    """Analyze text data and return analysis results."""
    text = data.inputs["text"]
    analysis = {
        "length": len(text),
        "word_count": len(text.split()),
        "uppercase": text.upper(),
    }

    return analysis

async def length_check_scorer(params):
    """Evaluate if output length is sufficient."""
    output = params["output"]
    passes_check = output["length"] > 10

    return EvaluationResult(
        value=1 if passes_check else 0,
        explanation=(
            "Output length is sufficient"
            if passes_check
            else f"Output too short ({output['length']} chars, need >10)"
        )
    )

async def main():
    await evaluatorq(
        "text-analysis",
        data=[
            DataPoint(inputs={"text": "Hello world"}),
            DataPoint(inputs={"text": "Testing evaluation"}),
        ],
        jobs=[text_analyzer],
        evaluators=[
            {
                "name": "length-check",
                "scorer": length_check_scorer,
            }
        ],
    )

if __name__ == "__main__":
    asyncio.run(main())
```

> **Tip:** The `@job()` decorator preserves the job name in error messages. Always prefer `@job("name")` over raw functions for better debugging.

### Using Orq Platform Datasets

```python
import asyncio
from evaluatorq import evaluatorq, job, DataPoint, DatasetIdInput

@job("processor")
async def processor(data: DataPoint, row: int):
    """Process each data point from the dataset."""
    result = await process_data(data)
    return result

async def accuracy_scorer(params):
    """Calculate accuracy by comparing output with expected results."""
    data = params["data"]
    output = params["output"]

    score = calculate_score(output, data.expected_output)

    if score > 0.8:
        explanation = "High accuracy match"
    elif score > 0.5:
        explanation = "Partial match"
    else:
        explanation = "Low accuracy match"

    return {"value": score, "explanation": explanation}

async def main():
    # Requires ORQ_API_KEY environment variable
    await evaluatorq(
        "dataset-evaluation",
        data=DatasetIdInput(dataset_id="your-dataset-id"),  # From Orq platform
        jobs=[processor],
        evaluators=[
            {
                "name": "accuracy",
                "scorer": accuracy_scorer,
            }
        ],
    )

if __name__ == "__main__":
    asyncio.run(main())
```

> **Tip:** Use `parallelism` to control how many data points are processed concurrently. Start with a low value (3-5) when calling external APIs to avoid rate limits.

### Advanced Features

#### Multiple Jobs

Run multiple jobs in parallel for each data point:

```python
from evaluatorq import job

@job("preprocessor")
async def preprocessor(data: DataPoint, row: int):
    result = await preprocess(data)
    return result

@job("analyzer")
async def analyzer(data: DataPoint, row: int):
    result = await analyze(data)
    return result

@job("transformer")
async def transformer(data: DataPoint, row: int):
    result = await transform(data)
    return result

await evaluatorq(
    "multi-job-eval",
    data=[...],
    jobs=[preprocessor, analyzer, transformer],
    evaluators=[...],
)
```

#### The `@job()` Decorator

The `@job()` decorator provides two key benefits:

1. **Eliminates boilerplate** - No need to manually wrap returns with `{"name": ..., "output": ...}`
2. **Preserves job names in errors** - When a job fails, the error will include the job name for better debugging

**Decorator pattern (recommended):**
```python
from evaluatorq import job

@job("text-processor")
async def process_text(data: DataPoint, row: int):
    # Clean return - just the data!
    return {"result": data.inputs["text"].upper()}
```

**Functional pattern (for lambdas):**
```python
from evaluatorq import job

# Simple transformations with lambda
uppercase_job = job("uppercase", lambda data, row: data.inputs["text"].upper())
word_count_job = job("word-count", lambda data, row: len(data.inputs["text"].split()))
```

#### Deployment Helper

Easily invoke Orq deployments within your evaluation jobs:

```python
from evaluatorq import evaluatorq, job, invoke, deployment, DatasetIdInput

# Simple one-liner with invoke()
@job("summarizer")
async def summarize_job(data, row):
    text = data.inputs["text"]
    return await invoke("my-deployment", inputs={"text": text})

# Full response with deployment()
@job("analyzer")
async def analyze_job(data, row):
    response = await deployment(
        "my-deployment",
        inputs={"text": data.inputs["text"]},
        metadata={"source": "evaluatorq"},
    )
    print("Raw:", response.raw)
    return response.content

# Chat-style with messages
@job("chatbot")
async def chat_job(data, row):
    return await invoke(
        "chatbot",
        messages=[{"role": "user", "content": data.inputs["question"]}],
    )

# Thread tracking for conversations
@job("assistant")
async def conversation_job(data, row):
    return await invoke(
        "assistant",
        inputs={"query": data.inputs["query"]},
        thread={"id": "conversation-123"},
    )
```

The `invoke()` function returns the text content directly, while `deployment()` returns an object with both `content` and `raw` response for more control.

#### Built-in Evaluators

Use the included evaluators for common use cases:

```python
from evaluatorq import evaluatorq, job, string_contains_evaluator, DatasetIdInput

@job("country-lookup")
async def country_lookup_job(data, row):
    country = data.inputs["country"]
    return await invoke("country-capitals", inputs={"country": country})

await evaluatorq(
    "country-unit-test",
    data=DatasetIdInput(dataset_id="your-dataset-id"),
    jobs=[country_lookup_job],
    evaluators=[string_contains_evaluator()],  # Checks if output contains expected_output
    parallelism=6,
)
```

Available built-in evaluators:

- **`string_contains_evaluator()`** - Checks if output contains expected_output (case-insensitive by default)
- **`exact_match_evaluator()`** - Checks if output exactly matches expected_output

```python
# Case-sensitive matching
strict_evaluator = string_contains_evaluator(case_insensitive=False)

# Custom name
my_evaluator = string_contains_evaluator(name="my-contains-check")
```

#### Automatic Error Handling

The `@job()` decorator automatically preserves job names even when errors occur:

```python
from evaluatorq import job

@job("risky-job")
async def risky_operation(data: DataPoint, row: int):
    # If this raises an error, the job name "risky-job" will be preserved
    result = await potentially_failing_operation(data)
    return result

await evaluatorq(
    "error-handling",
    data=[...],
    jobs=[risky_operation],
    evaluators=[...],
)

# Error output will show: "Job 'risky-job' failed: <error details>"
# Without @job decorator, you'd only see: "<error details>"
```

#### Async Data Sources

```python
import asyncio

# Create an array of coroutines for async data
async def get_data_point(i: int) -> DataPoint:
    await asyncio.sleep(0.01)  # Simulate async data fetching
    return DataPoint(inputs={"value": i})

data_promises = [get_data_point(i) for i in range(1000)]

await evaluatorq(
    "async-eval",
    data=data_promises,
    jobs=[...],
    evaluators=[...],
)
```

#### Structured Evaluation Results

Evaluators can return structured, multi-dimensional metrics using `EvaluationResultCell`. This is useful for metrics like BERT scores, ROUGE-N scores, or any evaluation that produces multiple sub-scores.

##### Multi-criteria Rubric

Return multiple quality sub-scores in a single evaluator:

```python
from evaluatorq import evaluatorq, job, DataPoint, EvaluationResult, EvaluationResultCell

@job("echo")
async def echo_job(data: DataPoint, row: int):
    return data.inputs["text"]

async def rubric_scorer(params):
    text = str(params["output"])
    return EvaluationResult(
        value=EvaluationResultCell(
            type="rubric",
            value={
                "relevance": min(len(text) / 100, 1),
                "coherence": 0.9 if "." in text else 0.4,
                "fluency": 0.85 if len(text.split()) > 5 else 0.5,
            },
        ),
        explanation="Multi-criteria quality rubric",
    )

await evaluatorq(
    "structured-rubric",
    data=[
        DataPoint(inputs={"text": "The quick brown fox jumps over the lazy dog."}),
        DataPoint(inputs={"text": "Hi"}),
    ],
    jobs=[echo_job],
    evaluators=[{"name": "rubric", "scorer": rubric_scorer}],
)
```

##### Sentiment Distribution

Break down sentiment across categories:

```python
async def sentiment_scorer(params):
    text = str(params["output"]).lower()
    positive_words = ["good", "great", "excellent", "happy", "love"]
    negative_words = ["bad", "terrible", "awful", "sad", "hate"]
    pos_count = sum(1 for w in positive_words if w in text)
    neg_count = sum(1 for w in negative_words if w in text)
    total = max(pos_count + neg_count, 1)

    return EvaluationResult(
        value=EvaluationResultCell(
            type="sentiment",
            value={
                "positive": pos_count / total,
                "negative": neg_count / total,
                "neutral": 1 - (pos_count + neg_count) / total,
            },
        ),
        explanation="Sentiment distribution across categories",
    )
```

##### Safety Scores with Pass/Fail

Combine structured scores with pass/fail tracking for CI/CD:

```python
async def safety_scorer(params):
    text = str(params["output"]).lower()
    categories = {
        "hate_speech": 0.8 if "hate" in text else 0.1,
        "violence": 0.7 if ("kill" in text or "fight" in text) else 0.05,
        "profanity": 0.5 if "damn" in text else 0.02,
    }

    return EvaluationResult(
        value=EvaluationResultCell(
            type="safety",
            value=categories,
        ),
        pass_=all(score < 0.5 for score in categories.values()),
        explanation="Content safety severity scores per category",
    )
```

See the runnable Python examples in the `examples/` directory:

- [`structured_rubric_eval.py`](examples/lib/structured/structured_rubric_eval.py) - Multi-criteria quality rubric
- [`structured_sentiment_eval.py`](examples/lib/structured/structured_sentiment_eval.py) - Sentiment distribution breakdown
- [`structured_safety_eval.py`](examples/lib/structured/structured_safety_eval.py) - Safety scores with pass/fail tracking

> **Note:** Structured results display as `[structured]` in the terminal summary table but are preserved in full when sent to the Orq platform and OpenTelemetry spans.

#### Controlling Parallelism

```python
await evaluatorq(
    "parallel-eval",
    data=[...],
    jobs=[...],
    evaluators=[...],
    parallelism=10,  # Run up to 10 jobs concurrently
)
```

#### Dashboard Organization with `path`

Use the `path` parameter to organize evaluation results into folders on the Orq dashboard:

```python
await evaluatorq(
    "my-evaluation",
    path="MyProject/Evaluations/Unit Tests",
    data=[...],
    jobs=[...],
    evaluators=[...],
)
```

> **Tip:** Use paths like `"Team/Sprint-42/Feature-X"` to keep experiments organized across teams and sprints.

See [`path_organization.py`](examples/lib/structured/path_organization.py) for a complete example.

#### Evaluation Description

Add a description to document the purpose of each evaluation run:

```python
await evaluatorq(
    "model-comparison",
    description="Compare GPT-4o vs Claude on customer support responses",
    data=[...],
    jobs=[...],
    evaluators=[...],
)
```

#### Disable Progress Display

```python
# Get raw results without terminal output
results = await evaluatorq(
    "silent-eval",
    data=[...],
    jobs=[...],
    evaluators=[...],
    print_results=False,  # Disable progress and table display
)

# Process results programmatically
for result in results:
    print(result.data_point.inputs)
    for job_result in result.job_results:
        print(f"{job_result.job_name}: {job_result.output}")
```

## 🔧 Configuration

### Environment Variables

- `ORQ_API_KEY`: API key for Orq platform integration (required for dataset access and sending results). Also enables automatic OTEL tracing to Orq.
- `ORQ_BASE_URL`: Base URL for Orq platform (default: `https://my.orq.ai`)
- `OTEL_EXPORTER_OTLP_ENDPOINT`: Custom OpenTelemetry collector endpoint (overrides default Orq endpoint)
- `OTEL_EXPORTER_OTLP_HEADERS`: Headers for OTEL exporter (format: `key1=value1,key2=value2`)
- `ORQ_DISABLE_TRACING`: Set to `1` or `true` to disable automatic tracing
- `ORQ_DEBUG`: Enable debug logging for tracing setup
- `EVALUATORQ_CAPTURE_MESSAGE_CONTENT`: Whether to write LLM message text (prompts + responses) onto trace spans. **Defaults to `true`** so the Orq dashboard's input/output panels render. Set to `false` (or `0`) to keep raw message content — including any PII — out of traces while still recording token usage, model, and latency. Applies to both agent simulation and red teaming spans.
- `EVALUATORQ_SPAN_MAX_TEXT_CHARS`: Max characters of message text (both inputs and outputs) stored per span attribute before truncation (marker `... [truncated]`). **Defaults to capturing all content (no truncation).** Set a positive integer (e.g. `8192`) to cap; `-1`, `0`, or unset all mean capture all. Applies identically to the Python (agent-sim + red teaming) and TypeScript simulation tracing layers.

### Evaluation Parameters

Parameters are validated at runtime using Pydantic. The `evaluatorq` function supports three calling styles:

```python
from evaluatorq import evaluatorq, EvaluatorParams

# 1. Keyword arguments (recommended)
await evaluatorq(
    "my-eval",
    data=[...],
    jobs=[...],
    parallelism=5,
)

# 2. Dict style
await evaluatorq("my-eval", {
    "data": [...],
    "jobs": [...],
    "parallelism": 5,
})

# 3. EvaluatorParams instance
await evaluatorq("my-eval", EvaluatorParams(
    data=[...],
    jobs=[...],
    parallelism=5,
))
```

#### Parameter Reference

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `data` | `list[DataPoint]` \| `list[Awaitable[DataPoint]]` \| `DatasetIdInput` | **required** | Data to evaluate |
| `jobs` | `list[Job]` | **required** | Jobs to run on each data point |
| `evaluators` | `list[Evaluator]` \| `None` | `None` | Evaluators to score job outputs |
| `parallelism` | `int` (≥1) | `1` | Number of concurrent jobs |
| `print_results` | `bool` | `True` | Display progress and results table |
| `description` | `str` \| `None` | `None` | Optional evaluation description |
| `path` | `str` \| `None` | `None` | Path for organizing results on the Orq dashboard (e.g., `"Project/Category"`) |

## 📊 Orq Platform Integration

### Automatic Result Sending

When the `ORQ_API_KEY` environment variable is set, evaluatorq automatically sends evaluation results to the Orq platform for visualization and analysis.

```python
# Results are automatically sent when ORQ_API_KEY is set
await evaluatorq(
    "my-evaluation",
    data=[...],
    jobs=[...],
    evaluators=[...],
)
```

#### What Gets Sent

When the `ORQ_API_KEY` is set, the following information is sent to Orq:
- Evaluation name
- Dataset ID (when using Orq datasets)
- Job results with outputs and errors
- Evaluator scores with values and explanations
- Execution timing information

Note: Evaluator explanations are included in the data sent to Orq but are not displayed in the terminal output to keep the console clean.

#### Result Visualization

After successful submission, you'll see a console message with a link to view your results:

```
📊 View your evaluation results at: <url to the evaluation>
```

The Orq platform provides:
- Interactive result tables
- Score statistics
- Performance metrics
- Historical comparisons

## 🔍 OpenTelemetry Tracing

Evaluatorq automatically creates OpenTelemetry spans for observability into your evaluation runs.

### Span Hierarchy

```
orq.job (independent root per job execution)
└── orq.evaluation (child span per evaluator)
```

### Auto-Enable with Orq

When `ORQ_API_KEY` is set, traces are automatically sent to the Orq platform:

```bash
ORQ_API_KEY=your-api-key python my_eval.py
```

### Custom OTEL Endpoint

Send traces to any OpenTelemetry-compatible backend:

```bash
OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector:4318 \
OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token" \
python my_eval.py
```

### Disable Tracing

If you want to disable tracing even when `ORQ_API_KEY` is set:

```bash
ORQ_DISABLE_TRACING=1 python my_eval.py
```

## ✅ Pass/Fail Tracking

Evaluators can return a `pass_` field to indicate pass/fail status:

```python
async def quality_scorer(params):
    """Quality check evaluator with pass/fail."""
    output = params["output"]
    score = calculate_quality(output)

    return {
        "value": score,
        "pass_": score >= 0.8,  # Pass if meets threshold
        "explanation": f"Quality score: {score}",
    }
```

**CI/CD Integration:** When any evaluator returns `pass_: False`, the process exits with code 1. This enables fail-fast behavior in CI/CD pipelines.

**Pass Rate Display:** The summary table shows pass rate when evaluators use the `pass_` field:

```
┌──────────────────────┬─────────────────┐
│ Pass Rate            │ 75% (3/4)       │
└──────────────────────┴─────────────────┘
```

## 🔗 LangChain Integration

Evaluatorq provides integration with LangChain and LangGraph agents, converting their outputs to the OpenResponses format for standardized evaluation.

### Overview

The LangChain integration allows you to:
- Wrap LangChain agents created with `create_agent()` for use in evaluatorq jobs
- Wrap LangGraph compiled graphs for stateful agent evaluation
- Automatically convert agent outputs to OpenResponses format
- Evaluate agent behavior using standard evaluatorq evaluators

### System Instructions

Use the `instructions` parameter to inject a system prompt into the agent. It can be a static string or a callable that builds instructions dynamically from the dataset row:

```python
from evaluatorq.integrations.langchain_integration import wrap_langchain_agent

# Static instructions
agent_job = wrap_langchain_agent(
    agent,
    name="my-agent",
    instructions="You are a helpful weather assistant.",
)

# Dynamic instructions from dataset inputs
agent_job = wrap_langchain_agent(
    agent,
    name="research-agent",
    instructions=lambda data: (
        f"Research the topic: {data.inputs['topic']}. "
        f"Focus on {data.inputs['focus']}."
    ),
)
```

### Input Modes

The wrapper reads the user input from `data.inputs` in three ways:

- **`prompt`** (default): `data.inputs["prompt"]` — a single string, sent as one user message.
- **`messages`**: `data.inputs["messages"]` — a list of `{"role": ..., "content": ...}` dicts, sent as-is.
- **Both**: when both are present, `messages` are sent first, followed by `prompt` as a final user message.

Change the prompt key with the `prompt_key` parameter (e.g., `prompt_key="question"`).

### Examples

Complete examples are available in the examples folder:

- **LangChain Agent**: [`langchain_integration_example.py`](examples/lib/integrations/langchain/langchain_integration_example.py) — Basic agent with weather tools using `wrap_langchain_agent`
- **LangGraph Agent**: [`langgraph_integration_example.py`](examples/lib/integrations/langchain/langgraph_integration_example.py) — LangGraph compiled graph with StateGraph pattern
- **LangGraph Research Agent (advanced)**: [`langgraph_research_eval.py`](examples/lib/integrations/langchain/langgraph_research_eval.py) — Dataset-driven research agent with dynamic `instructions` and multi-criteria evaluators

> **Tip:** Pass the `instructions` parameter to `wrap_langchain_agent` for dynamic system prompts — no need to write a custom job function.

## 🔴 Red Teaming

Run adversarial attacks against an LLM or agent and measure how well it resists. Attacks are generated dynamically by an attacker LLM and mapped to OWASP vulnerability categories (LLM Top 10 and Agentic Security Initiative).

```bash
pip install evaluatorq[redteam]
```

Point `red_team()` at an orq.ai agent — `"agent:<key>"` auto-selects the ORQ backend and discovers the agent's tools, memory, and system prompt:

```python
from evaluatorq.redteam import red_team

report = await red_team(
    "agent:my-agent-key",
    categories=["LLM01", "ASI01", "ASI02"],  # injection + agentic tool/memory abuse
    max_dynamic_datapoints=5,
    max_turns=3,
)
print(f"Resistance rate: {report.summary.resistance_rate:.0%}")
print(f"Vulnerabilities found: {report.summary.vulnerabilities_found}")
```

No deployment? Red-team a raw model with `OpenAIModelTarget("openai/gpt-5.4-mini", system_prompt=...)` instead of the `"agent:<key>"` string. (Model IDs route through the ORQ router — use the `openai/` prefix; drop it if you target OpenAI directly.)

**Target types:** `"agent:<key>"` (ORQ agent), `"deployment:<key>"` (ORQ deployment), `OpenAIModelTarget(...)` (raw model, Python API only). Agents from **external frameworks** (LangGraph, OpenAI Agents SDK, custom callables) are wrapped into a target too.

**Learn more:**

- 📓 [Red teaming intro notebook](examples/red_teaming_intro.ipynb) — runnable 5-minute SDK walkthrough
- 📘 [Concepts, modes, full parameters, external frameworks, CLI](src/evaluatorq/redteam/README.md)
- 🧪 [Runnable example scripts](examples/redteam/) — static datasets, hybrid mode, multi-target, custom hooks

> **Note:** The built-in frameworks (OWASP LLM Top 10, OWASP ASI) and their vulnerabilities, evaluators, and attack strategies are not runtime-extendable yet. Adding custom vulnerabilities currently requires modifying the package source. A runtime registration API is planned for a future release.

## 🎭 Agent Simulation

Stress-test an agent against *real users* before they do. A **user-simulator LLM** plays a persona pursuing a goal across a multi-turn conversation, and a **judge LLM** scores each run against your criteria. The non-adversarial counterpart to red teaming.

```bash
pip install evaluatorq[simulation]
```

Define who the user is (`Persona`) and what they want (`Scenario`), then simulate — against a hosted orq.ai agent (`target="agent:<key>"`) or a local callable:

```python
from evaluatorq.simulation import simulate
from evaluatorq.simulation.types import CommunicationStyle, Criterion, Persona, Scenario

results = await simulate(
    evaluation_name="support-agent-sim",
    target="agent:my-support-agent",  # hosted Orq agent; or target_callback=<async fn> for a local agent
    personas=[Persona(
        name="Impatient Customer",
        patience=0.2,
        assertiveness=0.8,
        politeness=0.4,
        technical_level=0.3,
        communication_style=CommunicationStyle.terse,
        background="Received the wrong item and wants a refund urgently",
    )],
    scenarios=[Scenario(
        name="Wrong Item Refund",
        goal="Get a full refund for the wrong item received",
        criteria=[Criterion(description="Agent asks for order details", type="must_happen")],
    )],
    max_turns=8,
)
result = results[0]
print(f"Goal achieved: {result.goal_achieved} (score {result.goal_completion_score:.2f})")
```

No personas yet? `generate_and_simulate(agent_description=...)` invents personas and scenarios for you. Runs exit non-zero on failure by default (`exit_on_failure=True`) — drop straight into CI.

**Learn more:**

- 📓 [Agent simulation intro notebook](examples/agent_simulation_intro.ipynb) — runnable 5-minute SDK walkthrough
- 📘 [Concepts, entry points, datasets, CLI](src/evaluatorq/simulation/README.md)
- 🧪 [Runnable example scripts](examples/agent_simulation/) — orq agent & deployment targets, tool-using agents, hardening loop, CI gating

## 📚 API Reference

### `evaluatorq(name, params?, *, data?, jobs?, evaluators?, parallelism?, print_results?, description?) -> EvaluatorqResult`

Main async function to run evaluations.

#### Signature:

```python
async def evaluatorq(
    name: str,
    params: EvaluatorParams | dict[str, Any] | None = None,
    *,
    data: DatasetIdInput | Sequence[Awaitable[DataPoint] | DataPoint] | None = None,
    jobs: list[Job] | None = None,
    evaluators: list[Evaluator] | None = None,
    parallelism: int = 1,
    print_results: bool = True,
    description: str | None = None,
) -> EvaluatorqResult
```

#### Parameters:

- `name`: String identifier for the evaluation run
- `params`: (Optional) `EvaluatorParams` instance or dict with evaluation parameters
- `data`: List of DataPoint objects, awaitables, or `DatasetIdInput`
- `jobs`: List of job functions to run on each data point
- `evaluators`: Optional list of evaluator configurations
- `parallelism`: Number of concurrent jobs (default: 1, must be ≥1)
- `print_results`: Whether to display progress and results (default: True)
- `description`: Optional description for the evaluation run

> **Note:** Parameters can be passed either via the `params` argument (as dict or `EvaluatorParams`) or as keyword arguments. Keyword arguments take precedence over `params` values.

#### Returns:

`EvaluatorqResult` - List of `DataPointResult` objects containing job outputs and evaluator scores.

### Types

```python
from typing import Any, Callable, Awaitable
from pydantic import BaseModel, Field
from typing_extensions import TypedDict

# Output type alias
Output = str | int | float | bool | dict[str, Any] | None

class DataPoint(BaseModel):
    """A data point for evaluation."""
    inputs: dict[str, Any]
    expected_output: Output | None = None

EvaluationResultCellValue = str | float | dict[str, "str | float | dict[str, str | float]"]

class EvaluationResultCell(BaseModel):
    """Structured evaluation result with multi-dimensional metrics."""
    type: str
    value: dict[str, EvaluationResultCellValue]

class EvaluationResult(BaseModel):
    """Result from an evaluator."""
    value: str | float | bool | EvaluationResultCell
    explanation: str | None = None
    pass_: bool | None = None  # Optional pass/fail indicator for CI/CD integration

class EvaluatorScore(BaseModel):
    """Score from an evaluator for a job output."""
    evaluator_name: str
    score: EvaluationResult
    error: str | None = None

class JobResult(BaseModel):
    """Result from a job execution."""
    job_name: str
    output: Output
    error: str | None = None
    evaluator_scores: list[EvaluatorScore] | None = None

class DataPointResult(BaseModel):
    """Result for a single data point."""
    data_point: DataPoint
    error: str | None = None
    job_results: list[JobResult] | None = None

# Type aliases
EvaluatorqResult = list[DataPointResult]

class DatasetIdInput(BaseModel):
    """Input for fetching a dataset from Orq platform."""
    dataset_id: str

class EvaluatorParams(BaseModel):
    """Parameters for running an evaluation (validated at runtime)."""
    data: DatasetIdInput | Sequence[Awaitable[DataPoint] | DataPoint]
    jobs: list[Job]
    evaluators: list[Evaluator] | None = None
    parallelism: int = Field(default=1, ge=1)
    print_results: bool = True
    description: str | None = None

class JobReturn(TypedDict):
    """Job return structure."""
    name: str
    output: Output

Job = Callable[[DataPoint, int], Awaitable[JobReturn]]

class ScorerParameter(TypedDict):
    """Parameters passed to scorer functions."""
    data: DataPoint
    output: Output

Scorer = Callable[[ScorerParameter], Awaitable[EvaluationResult | dict[str, Any]]]

class Evaluator(TypedDict):
    """Evaluator configuration."""
    name: str
    scorer: Scorer

# Deployment helper types
@dataclass
class DeploymentResponse:
    """Response from a deployment invocation."""
    content: str  # Text content of the response
    raw: Any      # Raw API response

# Invoke deployment and get text content
async def invoke(
    key: str,
    inputs: dict[str, Any] | None = None,
    context: dict[str, Any] | None = None,
    metadata: dict[str, Any] | None = None,
    thread: dict[str, Any] | None = None,  # Must include 'id' key
    messages: list[dict[str, str]] | None = None,
) -> str: ...

# Invoke deployment and get full response
async def deployment(
    key: str,
    inputs: dict[str, Any] | None = None,
    context: dict[str, Any] | None = None,
    metadata: dict[str, Any] | None = None,
    thread: dict[str, Any] | None = None,  # Must include 'id' key
    messages: list[dict[str, str]] | None = None,
) -> DeploymentResponse: ...

# Built-in evaluators
def string_contains_evaluator(
    case_insensitive: bool = True,
    name: str = "string-contains",
) -> Evaluator: ...

def exact_match_evaluator(
    case_insensitive: bool = False,
    name: str = "exact-match",
) -> Evaluator: ...
```

## 🛠️ Development

```bash
# Install dependencies
uv sync

# Run type checking
uv run basedpyright

# Format code
uv run ruff format

# Lint code
uv run ruff check

# Serve the documentation site locally (live-reload at http://127.0.0.1:8000/evaluatorq/)
uv run --group docs mkdocs serve

# Build the documentation site (strict — fails on warnings, as CI does)
uv run --group docs mkdocs build --strict
```
