Metadata-Version: 2.4
Name: whatifd
Version: 0.2.0
Summary: Open experiment runner for LLM behavior changes. Fork production traces, replay with a proposed change, score the diff, emit a PR-ready verdict report.
Project-URL: Homepage, https://github.com/victoralfred/whatifd
Project-URL: Documentation, https://github.com/victoralfred/whatifd/blob/main/DESIGN.md
Project-URL: Issues, https://github.com/victoralfred/whatifd/issues
Author: victoralfred
License: Apache-2.0
License-File: LICENSE
Keywords: agents,ci,evaluation,llm,observability,regression-testing,sre
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Requires-Dist: anyio>=4.3
Requires-Dist: httpx>=0.28.1
Requires-Dist: psutil>=6.0; sys_platform != 'win32'
Requires-Dist: pydantic>=2.6
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=15.0.0
Requires-Dist: typer>=0.12
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.97.0; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.152.4; extra == 'dev'
Requires-Dist: mypy>=1.20.2; extra == 'dev'
Requires-Dist: pre-commit>=4.4.1; extra == 'dev'
Requires-Dist: pytest-asyncio>=1.3.0; extra == 'dev'
Requires-Dist: pytest-cov>=7.1.0; extra == 'dev'
Requires-Dist: pytest-recording>=0.13; extra == 'dev'
Requires-Dist: pytest>=9.0.3; extra == 'dev'
Requires-Dist: ruff>=0.15.12; extra == 'dev'
Requires-Dist: types-psutil>=6.0; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev'
Requires-Dist: vcrpy<9.0,>=8.0; extra == 'dev'
Provides-Extra: inspect
Requires-Dist: inspect-ai>=0.3.216; extra == 'inspect'
Provides-Extra: langfuse
Requires-Dist: langfuse>=4.5.1; extra == 'langfuse'
Description-Content-Type: text/markdown

# whatifd

[![CI](https://github.com/victoralfred/whatifd/actions/workflows/ci.yml/badge.svg)](https://github.com/victoralfred/whatifd/actions/workflows/ci.yml)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![Status](https://img.shields.io/badge/status-alpha-yellow.svg)](#status)

> **whatifd's product is the verdict's defensibility.** Fork production traces, replay with a proposed change, score the diff — and ship a Ship / Don't Ship / Inconclusive verdict a reviewer can read, follow the reasoning, and either trust or know exactly which assumption to challenge.

![whatifd workflow](./what_if_archi.png)

When you change a prompt, model, or tool in an LLM system, you don't actually know whether it improves behavior — you guess, with a handful of cherry-picked traces and inconsistent evaluation. Every step in the workflow has a tool: Langfuse for traces, Inspect AI for scoring, GitHub for PRs. **The experiment doesn't.**

**whatifd** is the experiment runner. Fork production traces (failed cases plus a representative baseline), replay them with your proposed change (original tool outputs cached so side effects don't re-fire), score with the judge of your choice, and produce a Markdown + JSON verdict report you can attach to the PR. You stop shipping changes that fix one failure while silently regressing ten others. You go from *"this feels better"* to *"this improved 14/20, regressed 3 — here's exactly where, and here's the evidence I'd defend in review."*

**Stop shipping LLM changes on gut feel.**

---

![whatifd on one page](./experiment_runner_overview.png)

## Status

**v0.2.0 — alpha.** v0.2 widens v0.1 along five axes: a `regression_check` experiment shape joins `failure_rescue` (Phase A/C); a doctrinally-correct paired-percentile bootstrap replaces the v0.1 empirical-quantile shortcut, and `MethodologyDisclosure.bootstrap.method` declares the real method (Phase E.1/E.2); the Arize Phoenix / OpenInference adapter ships as `whatifd-phoenix` (Phase D); a `whatifd-fork` GitHub Action wraps the CLI for PR-comment + status-annotation workflows (Phase I); and cardinal #4 widens from top-level-only to per-field opt-in inside `RunManifest`, with cross-platform CI byte-equality enforcement (Phase J). `inspect_ai` is now reachable from YAML via `scorer.score_fn` (Phase B).

| Version | Status | What it does |
|---|---|---|
| v0.1 | shipped (2026-05-09) | Langfuse ingest, prompt override, cached-tool replay, Inspect AI scorer, evidence-first Markdown + JSON reports, CI exit codes. |
| v0.2 | shipped (2026-05-10) | `regression_check` shape; paired-percentile bootstrap; Phoenix / OpenInference adapter; `whatifd-fork` GitHub Action; per-field determinism widening + cross-platform CI; YAML-loaded `inspect_ai` scorer. |
| v0.3 | M12 | Cluster-paired bootstrap; LangSmith adapter; marketplace publication of the GitHub Action; `environment.dependencies` ordering canonicalization; live-tool replay (opt-in, allowlist). |
| v1.0 | year 2 | The pre-merge regression gate for LLM behavior. |

## Install

```bash
uv pip install whatifd whatifd-langfuse whatifd-phoenix whatifd-inspect-ai

# From source (uv workspace):
git clone https://github.com/victoralfred/whatifd
cd whatifd
uv sync --all-extras --dev --group workspace
```

## Quickstart (programmatic — works today)

The library API is the load-bearing surface. The snippet below is **shape-only** — it omits `RunManifest`, `MethodologyDisclosure`, and `CacheSummary` construction plus the actual `run_pipeline(...)` call to keep the README focused. The full runnable end-to-end example lives at [`docs/getting-started.md`](./docs/getting-started.md). Minimal shape:

```python
from whatifd.adapters.stub import StubTraceSource, StubTraceSpec
from whatifd.adapters.factory import build_scorer
from whatifd.cli_pipeline import build_delta_fn
from whatifd.config import ChangeConfig, ScorerConfig
from whatifd.pipeline import run_pipeline
from whatifd.runner_loader import load_runner

# Your runner satisfies the contract Protocol — see docs/runner-contract.md
loaded_runner = load_runner("python:my_agent.replay:run")

scorer = build_scorer(ScorerConfig(adapter="stub"))  # or wire a real Inspect AI scorer

trace_source = StubTraceSource(specs=[
    StubTraceSpec(trace_id="f-1", user_message="...", original_response="...", cohort="failure"),
    # ...
])

delta_fn = build_delta_fn(
    loaded_runner=loaded_runner,
    scorer=scorer,
    change=ChangeConfig(system_prompt="new prompt"),
    replay_timeout_seconds=60.0,
)

# Construct floor / policy / runtime / methodology / cache_summary,
# then call run_pipeline → ReportV01.
# Full worked example: docs/getting-started.md.
```

## Quickstart (CLI — stub adapters work today)

```bash
# Write a config:
cat > whatifd.config.yaml <<EOF
source:
  adapter: stub
target:
  runner: python:examples.minimal_agent.replay:run
selection:
  failure_cohort: { limit: 5 }
  baseline_cohort: { limit: 5 }
change:
  system_prompt: my new prompt
scorer:
  adapter: stub
decision: {}
reporting: {}
timeouts: {}
EOF

# Run the fork:
uv run whatifd fork --config whatifd.config.yaml

# Exit codes:
#   0 = Ship verdict
#   1 = Don't Ship verdict
#   2 = Inconclusive verdict / setup failure / floor violation
```

Real Langfuse traces require `LANGFUSE_HOST` (or `LANGFUSE_BASE_URL`) + `LANGFUSE_PUBLIC_KEY` + `LANGFUSE_SECRET_KEY` in the environment. Real Inspect AI scoring is reachable from YAML via `scorer.score_fn: python:<module>:<attr>` (Phase B); the v0.1 programmatic-only path is preserved.

## How it composes

`whatifd` doesn't replace your tracer or your eval framework — it composes them into an experiment.

- **Tracers (reads from)**: Langfuse (v0.1); Arize Phoenix / OpenInference (v0.2); LangSmith / OpenTelemetry GenAI (v0.3+).
- **Scorers (wraps)**: Inspect AI (v0.1, real adapter shipped); pluggable via the scorer registry.
- **Your agent (calls back into)**: any Python callable matching the [runner contract](./docs/runner-contract.md).
- **Downstream of `whatifd`'s decisions**: your existing CI (GitHub Actions, GitLab CI), SLO platforms (Nobl9, Sloth, Honeycomb), incident tooling.

## What `whatifd` is not

- Not a tracer (use Langfuse / Phoenix / LangSmith / OpenTelemetry GenAI).
- Not an offline eval harness (use Inspect AI / Promptfoo; whatifd wraps them).
- Not an SLO platform (use Nobl9 / Sloth / Honeycomb downstream of whatifd's decisions).
- Not an agent runtime — the runner contract is the boundary.
- Not a UI or dashboard.
- Not a substitute for production monitoring; not a benchmark suite; not a load test; not a causal estimator beyond replay association; not a judge-quality validator (see [docs/concepts.md](./docs/concepts.md)).

## Documentation

- **[`docs/concepts.md`](./docs/concepts.md)** — the conceptual model: defensible verdicts, non-claims, trust floor vs decision policy, failure-as-data, evidence and audit bundle
- **[`docs/getting-started.md`](./docs/getting-started.md)** — worked end-to-end example
- **[`docs/runner-contract.md`](./docs/runner-contract.md)** — the user-facing extension point reference
- **[`docs/schema/v0.1.md`](./docs/schema/v0.1.md)** — `ReportV01` consumer compatibility guide
- **[`docs/walkthroughs/`](./docs/walkthroughs/)** — six rendered scenarios as reference (Ship, Don't Ship, Inconclusive)
- **[`examples/minimal-agent/`](./examples/minimal-agent/)** — copy-paste reference Runner

## Design

The full design — problem framing, prior art, runner contract, report shape, eval target, milestones, risks — lives in [DESIGN.md](./DESIGN.md). The doctrine and cardinal rules are in [`.claude/skills/whatifd-design/SKILL.md`](./.claude/skills/whatifd-design/SKILL.md).

## Contributing

Pre-alpha. Issues and design discussion welcome; pull requests deferred until v0.1 ships.

## License

Apache 2.0. See [LICENSE](./LICENSE).
