Skip to content

A/B Testing

Compare two agent configurations on the same scenario set and identify the winner by LLM-judged scores.

from meshflow.eval.ab_test import ABTest, ABVariant

control = ABVariant("gpt-4o",     provider=prod_provider)
variant = ABVariant("claude-haiku", provider=new_provider)

ab = ABTest(control=control, variant=variant)
result = await ab.run(["What is HIPAA?", "Explain SOC 2 Type II."])

print(result.winner)   # "claude-haiku" | "gpt-4o" | "tie"
print(result.summary())

ABVariant

One side of an A/B test. Accepts either a MeshFlow Agent or a raw provider.

Field Type Description
name str Display name for reports
agent Any MeshFlow Agent instance (optional)
provider Any Raw LLM provider (optional; used when agent is None)
model str Model name forwarded to the provider (default claude-haiku-4-5)
system_prompt str System prompt injected for provider-direct calls
control = ABVariant(
    name="v1-sonnet",
    agent=production_agent,
)

variant = ABVariant(
    name="v2-haiku",
    provider=anthropic_provider,
    model="claude-haiku-4-5",
    system_prompt="You are a concise compliance assistant.",
)

ABTest

ab = ABTest(
    control=control,
    variant=variant,
    judge=None,    # auto-creates LLMJudge; uses EchoProvider when no API key
    rubric="Grade for accuracy and conciseness.",
)

run(scenarios, *, rubric="") -> ABTestResult

Run every scenario through both variants concurrently, grade both outputs with LLMJudge, and return aggregate results.

scenarios = [
    "What is the HIPAA minimum necessary standard?",
    "Explain SOC 2 Type II in plain English.",
    "List the five GDPR data subject rights.",
]

result = await ab.run(scenarios, rubric="Grade for regulatory accuracy.")

Both variants receive the same scenario text. The control and variant outputs are graded concurrently, so the total wall time scales with the number of scenarios rather than doubling.

ABTestResult

result.control_name      # "gpt-4o"
result.variant_name      # "claude-haiku"
result.control_avg       # float — mean judge score for control (0–1)
result.variant_avg       # float — mean judge score for variant
result.delta             # float — variant_avg − control_avg
result.winner            # "control" | "variant" | "tie" (|delta| < 0.02 = tie)
result.effect_size       # float — Cohen's d approximation over per-scenario deltas
result.control_win_rate  # float — fraction of scenarios where control scored higher
result.variant_win_rate  # float — fraction of scenarios where variant scored higher
result.turn_results      # list[ABTurnResult]
result.total_duration_ms # float
result.summary()         # one-line human-readable summary
result.to_dict()         # serialisable dict

winner Logic

Condition winner
abs(delta) < 0.02 "tie"
delta > 0 variant_name
delta < 0 control_name

effect_size

Cohen's d computed over per-scenario deltas. Values above 0.5 indicate a practically meaningful difference.

ABTurnResult

Per-scenario scores stored in ABTestResult.turn_results.

Field Type Description
scenario str The scenario text (first 200 chars)
control_output str Control variant's response
variant_output str Variant's response
control_score float Control judge score
variant_score float Variant judge score
control_reasoning str Judge's reasoning for control
variant_reasoning str Judge's reasoning for variant
delta float variant_score − control_score

Full Example

import asyncio
from meshflow.eval.ab_test import ABTest, ABVariant
from meshflow.agents.providers import AnthropicProvider

async def main():
    control = ABVariant(
        name="sonnet-prod",
        provider=AnthropicProvider(),
        model="claude-sonnet-4-6",
        system_prompt="You are a compliance analyst. Be thorough.",
    )
    variant = ABVariant(
        name="haiku-fast",
        provider=AnthropicProvider(),
        model="claude-haiku-4-5",
        system_prompt="You are a compliance analyst. Be concise.",
    )

    ab = ABTest(control=control, variant=variant)
    result = await ab.run(
        scenarios=[
            "Explain HIPAA's minimum necessary standard.",
            "What are GDPR's data subject rights?",
            "Summarise SOC 2 Type II requirements.",
        ],
        rubric="Grade for regulatory accuracy and clarity.",
    )

    print(result.summary())
    # [HAIKU-FAST WINS] sonnet-prod=0.841 vs haiku-fast=0.879
    # (Δ=+0.038, effect=1.12) over 3 scenarios

    for t in result.turn_results:
        print(f"  Δ={t.delta:+.3f}  {t.scenario[:60]}")

asyncio.run(main())

Statistical Interpretation

effect_size Interpretation
< 0.2 Negligible difference
0.2–0.5 Small effect
0.5–0.8 Medium effect
> 0.8 Large effect; consider shipping the winner