Metadata-Version: 2.4
Name: prompt-tester
Version: 0.2.0
Summary: Decorator-based integration testing for LLM prompts
License: MIT
Project-URL: Repository, https://github.com/mkho292/prompt-tester
Project-URL: Issues, https://github.com/mkho292/prompt-tester/issues
Keywords: llm,testing,prompts,ai,anthropic,gemini
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40.0; extra == "anthropic"
Provides-Extra: google
Requires-Dist: google-genai>=1.0.0; extra == "google"
Provides-Extra: all
Requires-Dist: anthropic>=0.40.0; extra == "all"
Requires-Dist: google-genai>=1.0.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: anthropic>=0.40.0; extra == "dev"
Requires-Dist: google-genai>=1.0.0; extra == "dev"
Dynamic: license-file

# prompt-tester

A Python library for testing LLM prompts with statistical reliability using the `@prompt_run` decorator.

## When to use this

prompt-tester is designed for **integration tests** — tests that run against a real model with real API calls. It is not a mocking or unit-testing framework for LLM code.

The sweet spot is testing prompts that are part of a larger system:

- **Custom tools and function calling** — verify the model actually calls the right tool with the right arguments under realistic conditions
- **MCP-connected agents** — test prompts that drive an agentic loop against your live MCP server, not a stub
- **RAG pipelines** — assert that retrieved context is used correctly in the final response
- **Multi-step workflows** — validate that each prompt stage produces output fit for the next stage

---

## Installation

```bash
pip install prompt-tester
```

---

## Setup

### API keys

```bash
ANTHROPIC_API_KEY=<your key>   # required for Anthropic models
GOOGLE_API_KEY=<your key>      # required for Gemini models
```

Keys are loaded automatically via `python-dotenv` from a `.env` file in your project root, or from environment variables directly.

### Judge model

Call `configure()` once before your tests run — at the top of a test module, or wherever fits your project's test setup. Both parameters are required:

```python
import prompt_tester
from prompt_tester import Model, Provider

prompt_tester.configure(
    judge_model    = Model.GEMINI_2_5_FLASH,
    judge_provider = Provider.GOOGLE,
)
```

If `configure()` has not been called, `@prompt_run` will raise a `ConfigurationError` with instructions when the test runs.

| Function | Description |
|---|---|
| `configure(judge_model, judge_provider)` | Set the judge model — both params required |
| `reset()` | Clear configuration |

---

## Why multiple runs matter

LLM outputs are non-deterministic. A single test run is a point-in-time sample, not a reliable signal — the model may have produced the right answer by chance, or failed due to noise. Running a prompt N times and requiring a minimum pass rate gives you a statistically meaningful signal.

Use the `runs` and `pass_threshold` parameters on `@prompt_run` — each run executes in its own thread, and the decorator asserts the pass rate when all runs are done:

```python
from pathlib import Path
import prompt_tester
from prompt_tester import prompt_run, Model, Provider

prompt_tester.configure(
    judge_model    = Model.GEMINI_2_5_FLASH,
    judge_provider = Provider.GOOGLE,
)

PROMPT = Path("prompts/compactor.md").read_text()
INPUT  = "Alice leads Project Phoenix. Budget: $2M. Deadline: Q3."

@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.GEMINI_3_1_FLASH_LITE,
    subject_provider = Provider.GOOGLE,
    template_vars    = {"text": INPUT},
    runs             = 5,
    pass_threshold   = 0.8,   # at least 4 of 5 runs must pass
)
def test_compactor(run):
    assert len(run.output) < len(run.template_vars["text"]) * 0.6

    alice, budget, deadline = run.ask_all([
        "Is Alice mentioned in the output?",
        "Is the $2M budget mentioned?",
        "Is the Q3 deadline mentioned?",
    ])
    assert alice.passed,    alice.reasoning
    assert budget.passed,   budget.reasoning
    assert deadline.passed, deadline.reasoning
```

- Each run gets its own thread — all 5 fire concurrently, so wall-clock time is roughly one run's worth.
- Failed runs are printed with their run number before the final assertion so you can see exactly which ones failed and why.
- `pass_threshold` uses `math.ceil` internally — with 5 runs and 0.8, you need at least 4 passes.
- Tuning: raise `pass_threshold` toward `1.0` for hard requirements; lower it for prompts with known variance. Start at `0.8` and tighten once you have a baseline.

### pytest.mark.parametrize alternative

If you want pytest to report each run as a **separate test item** in the output, use `pytest.mark.parametrize` with `runs=1` (the default):

```python
@pytest.mark.parametrize("_", range(5))
@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.GEMINI_3_1_FLASH_LITE,
    subject_provider = Provider.GOOGLE,
    template_vars    = {"text": INPUT},
)
def test_compactor_always_concise(run, _):
    assert len(run.output) < len(run.template_vars["text"]) * 0.6
```

This gives you individual pass/fail per run in pytest's output but no pass-rate control — one failure fails the whole parametrized group. Use it when each run has different inputs, or when you want strict all-or-nothing behaviour.

---

## `@prompt_run`

Runs a prompt and injects the result into your test function as a `PromptRun`. Assert on anything — raw output, token counts, costs, or judge verdicts. With `runs > 1` each run executes in its own thread and the decorator handles the pass-rate assertion.

```python
from pathlib import Path
from prompt_tester import prompt_run, Model, Provider

PROMPT = Path("prompts/compactor.md").read_text()
INPUT  = "Alice leads Project Phoenix. Budget: $2M. Deadline: Q3."

@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.GEMINI_3_1_FLASH_LITE,
    subject_provider = Provider.GOOGLE,
    template_vars    = {"text": INPUT},
    runs             = 5,
    pass_threshold   = 0.8,
)
def test_compactor_is_concise(run):
    # Assert on raw output
    assert len(run.output) < len(run.template_vars["text"]) * 0.6

    # Ask the judge a single yes/no question (one API call)
    alice = run.ask("Is Alice mentioned in the output?")
    assert alice.passed, alice.reasoning

    # Ask multiple questions in one API call
    budget, deadline = run.ask_all([
        "Is the $2M budget mentioned?",
        "Is the Q3 deadline mentioned?",
    ])
    assert budget.passed,   budget.reasoning
    assert deadline.passed, deadline.reasoning

    # Assert on API metadata
    assert run.cost_usd is not None
    assert run.stop_reason in ("end_turn", "STOP")
```

### Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `target_prompt` | `str` | required | Prompt text. Use `{key}` placeholders for `template_vars`. |
| `subject_model` | `Model \| str` | required | Model to run the prompt against. |
| `subject_provider` | `Provider \| str` | required | Provider for `subject_model`: `Provider.ANTHROPIC` or `Provider.GOOGLE`. |
| `template_vars` | `dict` | `{}` | Values substituted into `{key}` placeholders before the call. |
| `max_tokens` | `int` | `2048` | Maximum output tokens. |
| `runs` | `int` | `1` | Number of times to run the prompt. Each run executes in its own thread. |
| `pass_threshold` | `float` | `1.0` | Fraction of runs that must pass (`runs > 1` only). `0.8` = 80% must pass. Uses `math.ceil` so 5 runs × 0.8 requires 4 passes. |
| `run_fn` | `callable \| None` | `None` | Custom executor for tool use, MCP, or agentic loops. See below. |

---

## Tool use and MCP agents

`@prompt_run` sends a single prompt and records the response. If your prompt drives an **agentic loop** — calling tools, querying an MCP server, or doing multiple model turns before producing a final answer — use the `run_fn` parameter.

`run_fn` replaces the built-in provider call entirely. It receives the rendered prompt and is responsible for running the full loop. When it returns, `run.output` holds the final model text and the judge evaluates that output via `run.ask()` as normal.

When `run_fn` is set, `subject_provider` is **metadata only** — it is recorded on the `PromptRun` for observability but does not control which SDK is invoked. Your `run_fn` owns the actual API calls.

```python
(prompt: str, model: str, max_tokens: int) -> CompletionResult
```

The point is not just "did a tool get called" — it is asserting on the **quality of the final answer** that emerged from the whole agentic process. Tool call checks are one assertion among many; the judge evaluates the end result.

### Example — Anthropic + MCP tools

```python
import anthropic
from prompt_tester import prompt_run, Model, Provider
from prompt_tester import CompletionResult

MCP_TOOLS = [...]   # tool schemas from your MCP server

def run_with_mcp(prompt: str, model: str, max_tokens: int) -> CompletionResult:
    client   = anthropic.Anthropic()
    messages = [{"role": "user", "content": prompt}]

    total_input = total_output = 0

    while True:
        response = client.messages.create(
            model      = model,
            max_tokens = max_tokens,
            tools      = MCP_TOOLS,
            messages   = messages,
        )
        total_input  += response.usage.input_tokens
        total_output += response.usage.output_tokens

        if response.stop_reason != "tool_use":
            final_text = next(
                b.text for b in response.content if hasattr(b, "text")
            )
            return CompletionResult(
                text          = final_text,
                input_tokens  = total_input,
                output_tokens = total_output,
                stop_reason   = response.stop_reason,
                model_used    = response.model,
                request_id    = response.id,
            )

        # Execute tool calls and feed results back
        # mcp_client = your MCP client connection (e.g. via mcp.ClientSession)
        tool_results = []
        for block in response.content:
            if block.type == "tool_use":
                result = mcp_client.call_tool(block.name, block.input)  # noqa: F821
                tool_results.append({
                    "type":        "tool_result",
                    "tool_use_id": block.id,
                    "content":     result.content,
                })

        messages.append({"role": "assistant", "content": response.content})
        messages.append({"role": "user",      "content": tool_results})


PROMPT = "Use the search tool to find the capital of France, then summarise what you found."

@prompt_run(
    target_prompt    = PROMPT,
    subject_model    = Model.CLAUDE_SONNET_4_6,
    subject_provider = Provider.ANTHROPIC,
    run_fn           = run_with_mcp,
)
def test_agent_answers_correctly(run):
    # The judge evaluates the final answer — after all tool calls have completed.
    correct, cited = run.ask_all([
        "Does the response state that Paris is the capital of France?",
        "Does the response mention a source or search result?",
    ])
    assert correct.passed, correct.reasoning
    assert cited.passed,   cited.reasoning

    assert run.stop_reason == "end_turn"
    assert run.input_tokens > 0
```

### What `run_fn` receives and must return

| | Type | Description |
|---|---|---|
| `prompt` | `str` | Rendered prompt — `template_vars` already substituted |
| `model` | `str` | `subject_model` value (string) |
| `max_tokens` | `int` | `max_tokens` from the decorator |
| **return** | `CompletionResult` | Import from `prompt_tester` |

`CompletionResult` only requires `text`, `input_tokens`, and `output_tokens`. All other fields (`stop_reason`, `model_used`, `request_id`, etc.) are optional and appear on the `PromptRun` for metadata and assertions.

---

## `PromptRun` fields

**Input**

| Field | Type | Description |
|---|---|---|
| `prompt` | `str` | Rendered prompt text — `template_vars` placeholders already substituted. |
| `template_vars` | `dict[str, Any]` | Key/value pairs substituted into the prompt. |

**Model response**

| Field | Type | Description |
|---|---|---|
| `output` | `str` | The model's response text. |
| `stop_reason` | `str \| None` | Why generation stopped. Anthropic: `"end_turn"`, `"max_tokens"`, `"stop_sequence"`. Gemini: `"STOP"`, `"MAX_TOKENS"`, `"SAFETY"`, `"RECITATION"`, `"OTHER"`. |
| `safety_filtered` | `bool` | `True` if the provider blocked the response. Output will be empty. |

**Model identity**

| Field | Type | Description |
|---|---|---|
| `model` | `str` | The `subject_model` value you passed. |
| `provider` | `str` | The `subject_provider` value you passed. |
| `model_used` | `str \| None` | Actual model ID reported by the provider. May differ if the provider resolves an alias. |
| `model_version` | `str \| None` | Provider version string. Populated by Gemini; `None` for Anthropic. |
| `request_id` | `str \| None` | Provider request ID for log correlation. |

**Token usage**

| Field | Type | Description |
|---|---|---|
| `input_tokens` | `int` | Tokens in the prompt. |
| `output_tokens` | `int` | Tokens in the response. |
| `cached_input_tokens` | `int` | Input tokens served from cache. `0` when not used. |
| `cache_creation_tokens` | `int` | Tokens written to cache (Anthropic only). |
| `thoughts_tokens` | `int` | Reasoning tokens (Gemini thinking models only). |

**Cost**

| Field | Type | Description |
|---|---|---|
| `cost_usd` | `float \| None` | Total cost in USD. `None` if the model is not in the pricing table. |

```python
run.to_dict()   # all fields as a plain dict — safe to log or serialise
```

---

## Judge verdicts — `ask()`, `ask_all()`, `ask_parallel()`

```python
# One question — one API call
verdict = run.ask("Is Alice mentioned?")
assert verdict.passed, verdict.reasoning

# Multiple questions in one API call (cheapest, slight cross-contamination risk)
alice, budget = run.ask_all([
    "Is Alice mentioned?",
    "Is the $2M budget mentioned?",
])
assert alice.passed,  alice.reasoning
assert budget.passed, budget.reasoning

# Multiple questions fired in parallel — one API call each, concurrently
alice, budget = run.ask_parallel([
    "Is Alice mentioned?",
    "Is the $2M budget mentioned?",
])
assert alice.passed,  alice.reasoning
assert budget.passed, budget.reasoning
```

| | `ask(q)` | `ask_all(qs)` | `ask_parallel(qs)` |
|---|---|---|---|
| API calls | 1 | 1 for all | 1 per question, concurrent |
| Wall-clock time | — | fastest | same as 1 call |
| Cost | — | lowest | higher (N calls) |
| Cross-contamination | none | low | none |
| Return type | `JudgeVerdict` | `list[JudgeVerdict]` | `list[JudgeVerdict]` |

**When to use which:**
- `ask_all` — many independent checks where cost matters and cross-contamination is acceptable.
- `ask_parallel` — questions that must be fully isolated from each other, without the latency of sequential calls.
- `ask` — a single question, or when you need to branch on the result before asking the next.

### `JudgeVerdict` fields

| Field | Type | Description |
|---|---|---|
| `question` | `str` | The question you asked. |
| `answer` | `str` | `"yes"` or `"no"`. |
| `passed` | `bool` | `True` when the answer is `"yes"`. |
| `snippet` | `str \| None` | Verbatim excerpt the judge cited as evidence. |
| `reasoning` | `str \| None` | The judge's stated reasoning. |
| `judge_model` | `str` | Model ID of the judge. |
| `judge_provider` | `str \| None` | Provider of the judge. |
| `judge_input_tokens` | `int` | Input tokens for this verdict. |
| `judge_output_tokens` | `int` | Output tokens for this verdict. |
| `judge_cost_usd` | `float \| None` | Cost for this verdict. |

```python
verdict.to_dict()   # all fields as a plain dict
```

---

## Judge configuration

Call `configure()` once before your tests run. Both parameters are required — there is no default judge:

```python
import prompt_tester
from prompt_tester import Model, Provider

prompt_tester.configure(
    judge_model    = Model.GEMINI_2_5_FLASH,
    judge_provider = Provider.GOOGLE,
)
```

---

## Cost tracking

Token counts and USD cost are available on every `PromptRun`:

```python
run.input_tokens / run.output_tokens / run.cost_usd    # prompt call
verdict.judge_input_tokens / verdict.judge_cost_usd    # judge call
```

Prices are defined in `prompt_tester/config.py` in the `_PRICES` table, keyed by `Model` enum. To add a model not in the table, add a `Model` member and a corresponding `_PRICES` entry.

---

## Architecture

```
prompt_tester/
  __init__.py   Public API: prompt_run, PromptRun, JudgeVerdict, configure, reset
  config.py     Model/Provider enums, pricing table, configure() / reset()
  decorator.py  @prompt_run — runs prompt, injects PromptRun into test function
  models.py     Pydantic types: PromptRun, JudgeVerdict, UsageMetrics
  judge.py      Judge.evaluate() and evaluate_all() — LLM-as-judge
  llm/
    providers/
      base.py        LLMProvider ABC + CompletionResult (unified return type)
      anthropic.py   Anthropic SDK wrapper
      gemini.py      Google GenAI SDK wrapper
```
