Metadata-Version: 2.4
Name: cat-experiments
Version: 0.0.1
Summary: Standalone evaluation engine for LLM applications
Author-email: CAT Cafe Team <team@cat-cafe.dev>
License: Apache-2.0
Keywords: ai,evaluation,llm,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.12
Requires-Dist: typing-extensions>=4.5.0
Provides-Extra: cat-cafe
Requires-Dist: cat-cafe-sdk; extra == 'cat-cafe'
Provides-Extra: dev
Requires-Dist: arize-phoenix-client>=1.22.0; extra == 'dev'
Requires-Dist: black; extra == 'dev'
Requires-Dist: cat-cafe-sdk; extra == 'dev'
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: opentelemetry-api>=1.36.0; extra == 'dev'
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.36.0; extra == 'dev'
Requires-Dist: opentelemetry-sdk>=1.36.0; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: phoenix
Requires-Dist: arize-phoenix-client>=1.22.0; extra == 'phoenix'
Provides-Extra: tracing
Requires-Dist: opentelemetry-api>=1.36.0; extra == 'tracing'
Requires-Dist: opentelemetry-exporter-otlp-proto-http>=1.36.0; extra == 'tracing'
Requires-Dist: opentelemetry-sdk>=1.36.0; extra == 'tracing'
Description-Content-Type: text/markdown

# cat-experiments

Standalone evaluation engine for LLM applications.

![Cat Experiments](cat_experiments_small.png)

A flexible, DataFrame-compatible evaluation system that works standalone or integrates with cat-cafe server infrastructure.

## Features

- **Flexible Data Models**: Support any dataset structure with dictionary-based input/output
- **Deterministic Preview Runs**: Limit execution to an exact number of examples with `preview_examples` and `preview_seed`
- **Explicit Repetitions**: Run each example multiple times and track repetition metadata end-to-end
- **Comprehensive Evaluators**: Built-in evaluators for tool call correctness and more
- **Modern Python**: Targets Python 3.12+ with modern typing features
- **Async Support**: Full async/await support for evaluation pipelines
- **Tool Call Evaluation**: Advanced matching algorithms for tool call correctness

## Quick Start

```python
from cat.experiments import (
    DatasetExample,
    TestCase,
    ExperimentConfig,
    ExperimentRunner,
    basic_tool_correctness_evaluator,
)

# Describe your dataset
dataset = [
    DatasetExample(
        input={"messages": [{"role": "user", "content": "Hello"}]},
        output={"messages": [{"role": "assistant", "content": "Hi there!"}]},
    )
]

# Define the system under test
def my_llm_function(example: DatasetExample) -> str:
    return "Hi there!"

# Execute a small preview with two repetitions per example
runner = ExperimentRunner()
summary = runner.run(
    dataset=dataset,
    task=my_llm_function,
    evaluators=[basic_tool_correctness_evaluator],
    config=ExperimentConfig(
        name="Smoke Test",
        preview_examples=1,
        repetitions=2,
    ),
)

print(summary.total_examples)  # => 2 example runs (1 example × 2 repetitions)
```

If you prefer the lower-level APIs, `generate()` now accepts `TestCase` objects so you can decide exactly which example/repetition pairs to process:

```python
from cat.experiments import TestCase, generate, evaluate

runs = [TestCase(example=dataset[0], repetition_number=1)]
contexts = generate(runs, my_llm_function)
results = evaluate(contexts, [basic_tool_correctness_evaluator])
```

## Phoenix Integration

To mirror the Phoenix “Run Experiments” tutorial while remaining offline-friendly, plug the `PhoenixExperimentListener` into the cat-experiments runner. Because Phoenix support depends on the optional `phoenix-client`, import it explicitly:

```python
from cat.experiments.adapters.phoenix import PhoenixExperimentListener, PhoenixSyncConfig
```

```bash
# Configure phoenix-client per its docs (set env vars, config files, etc.)
export CAT_EVALS_DATASET=support-ticket-demo
python packages/cat-experiments/examples/phoenix_experiment_example.py
```

The script in `packages/cat-experiments/examples/phoenix_experiment_example.py` shows how to:

- Fetch a dataset with `phoenix-client`
- Convert it to `DatasetExample` objects
- Run a cat.experiments experiment (task + evaluator)
- Stream runs/evaluations back to Phoenix using the `PhoenixExperimentListener`

If the named dataset does not exist, the script will automatically create a sample
support-ticket dataset so you can get started immediately.

### CAT Cafe Integration

CAT Cafe users can mirror the server-side experiment records directly from cat-experiments by
attaching `CatCafeExperimentListener`. A minimal setup:

```python
from cat_cafe.sdk.client import CATCafeClient
from cat.experiments.adapters import CatCafeExperimentListener, CatCafeSyncConfig
from cat.experiments import ExperimentRunner, ExperimentConfig

client = CATCafeClient(base_url="http://localhost:8000")
listener = CatCafeExperimentListener(client, config=CatCafeSyncConfig(submission_mode="on_completion"))

runner = ExperimentRunner()
runner.listeners.append(listener)
runner.run(dataset=examples, task=my_task, evaluators=[my_evaluator],
           config=ExperimentConfig(name="My CAT experiment", dataset_id="dataset-123"))
```

Each completed example is transformed into CAT Cafe's experiment result schema, and the
listener automatically calls `start_experiment`, `submit_results`, and `complete_experiment`
so the run appears in the CAT Cafe UI.

To see a full working example that seeds a dataset and streams a run to CAT Cafe, run:

```bash
export CAT_BASE_URL=http://localhost:8000
export CAT_DATASET=cat-experiments-support-demo
uv run packages/cat-experiments/examples/cat_cafe_experiment_example.py
```

The script follows the same offline-friendly pattern as the Phoenix example, automatically
creating a sample dataset if the name is not found.

### Runner Builders

If you prefer not to wire listeners manually, use the builder helpers:

```python
from cat.experiments import (
    build_local_runner,
    build_phoenix_runner,
    build_cat_cafe_runner,
)

local_runner = build_local_runner()
cat_runner = build_cat_cafe_runner()
phoenix_runner = build_phoenix_runner()
```

Each factory returns an `ExperimentRunner` with the matching adapter configured plus the local
storage adapter, so you can immediately call `runner.run(...)` without additional plumbing.

### Resume Cached Experiments

When runs are cached locally, you can resume unfinished repetitions without touching Phoenix or
CAT Cafe:

```python
from cat.experiments.adapters import LocalCacheResumeCoordinator

coordinator = LocalCacheResumeCoordinator()
plan = coordinator.build_task_resume_plan("exp_123")

if plan.has_work:
    coordinator.resume_task_runs(
        experiment_id="exp_123",
        task=test_function,
        evaluators=[my_evaluator],
    )
```

The local storage adapter captures `config.json`, `examples.jsonl`, and `runs.jsonl` per experiment so the
resume coordinator can replay only the pending `(example, repetition)` pairs.

For an end-to-end walkthrough that stays entirely on disk, run the local storage example:

```bash
uv run packages/cat-experiments/examples/local_storage_evaluator_example.py
```

It writes runs via `LocalStorageExperimentListener`, then uses `LocalEvaluationCoordinator` plus
`ExperimentRunner.rerun_evaluators()` to append a new evaluator without re-running the task phase.

### Re-run Evaluators Later

To mirror Phoenix's "persist first, evaluate later" flow, both the local cache and CAT Cafe adapters
now expose evaluation coordinators that rehydrate recorded runs before executing new evaluators.

```python
from cat.experiments import ExperimentRunner
from cat.experiments.adapters import (
    LocalEvaluationCoordinator,
    CatCafeEvaluationCoordinator,
    PhoenixEvaluationCoordinator,
)
from cat_cafe.sdk.client import CATCafeClient
from phoenix.client import Client as PhoenixClient

local_eval = LocalEvaluationCoordinator()
local_eval.run_evaluators(
    experiment_id="exp_123",
    evaluators=[accuracy_evaluator, safety_check],
)

cat_eval = CatCafeEvaluationCoordinator(CATCafeClient())
cat_eval.run_evaluators(
    experiment_id="exp_456",
    evaluators=[hallucination_score],
)

runner = ExperimentRunner()
runner.rerun_evaluators(
    experiment_id="exp_123",
    evaluators=[latency_grade],
    backend=local_eval,
)

phoenix_eval = PhoenixEvaluationCoordinator(PhoenixClient())
phoenix_eval.run_evaluators(
    experiment_id="exp_789",
    evaluators=[cost_score],
)
```

`LocalEvaluationCoordinator` updates the cached `runs.jsonl` with the new metrics, while
`CatCafeEvaluationCoordinator` automatically resubmits the enriched results to CAT Cafe so the UI can
display the added evaluators without rerunning any tasks. `ExperimentRunner.rerun_evaluators` centralizes
the evaluate-only flow so you can plug in any backend that knows how to fetch/persist runs.

## Core Components

- `DatasetExample` – Flexible dataset storage
- `TestCase` – Execution plan objects that pair an example with a `repetition_number` before running
- `EvaluationContext` – Rich evaluation context with tool call support
- `EvaluationMetric` – Structured evaluation results
- `generate()` / `evaluate()` – Core evaluation pipeline functions
- `ExperimentRunner` / `AsyncExperimentRunner` – High-level orchestration with preview + repetition controls
- Built-in evaluators for common evaluation tasks

## Architecture

This package is designed to be standalone and framework-agnostic, focusing purely on evaluation logic without server dependencies.

## Tracing & Instrumentation

Cat-evals ships OpenTelemetry helpers (install with `pip install cat-experiments[tracing]`)
such as `capture_agent_trace()` and
`ExperimentTraceCapture`, but they **do not** activate OpenInference instrumentors for
you. Configure any instrumentation you need (for example
`openinference.instrumentation.openai.OpenAIInstrumentor().instrument()`) before entering
the capture context:

```python
from openinference.instrumentation.openai import OpenAIInstrumentor
from cat.experiments.tracing import capture_experiment_trace

OpenAIInstrumentor().instrument()

with capture_experiment_trace(example_id="ex-1", experiment_id="exp-123") as (root_span, capture):
    ...
```

This keeps cat-experiments lightweight while ensuring clients stay in control of which SDKs are
instrumented.

### Enabling the OTEL run observer

Tracing is now wired through a generic observer plugin system. After installing the
`tracing` extra, importing `cat.experiments.tracing` automatically registers the OTEL
observer so tool calls and trace identifiers are captured for each run.

You can build your own observers by implementing `cat.experiments.observers.RunObserver`
and calling `register_observer()`.
