Metadata-Version: 2.4
Name: petri-braintrust
Version: 0.2.0
Summary: Export Inspect Petri alignment audits to Braintrust experiments, with first-class Ollama support.
Project-URL: Homepage, https://github.com/Paul-UK/petri-braintrust
Project-URL: Repository, https://github.com/Paul-UK/petri-braintrust
Project-URL: Issues, https://github.com/Paul-UK/petri-braintrust/issues
Project-URL: Changelog, https://github.com/Paul-UK/petri-braintrust/blob/main/CHANGELOG.md
Author: petri-braintrust contributors
License: MIT
License-File: LICENSE
Keywords: alignment,braintrust,evals,inspect-ai,ollama,petri
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: braintrust>=0.0.180
Requires-Dist: inspect-ai>=0.3.93
Requires-Dist: inspect-petri<4,>=3.0
Requires-Dist: rich>=13.7
Requires-Dist: typer>=0.12
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# petri-braintrust

Bridge [Inspect Petri](https://github.com/meridianlabs-ai/inspect_petri) alignment audits into [Braintrust](https://braintrust.dev) experiments — with first-class support for local model targets via Ollama.

You bring the audit, this brings the dashboards.

## What it does

Petri runs auditor/target/judge model rollouts and writes Inspect `.eval` logs containing transcripts plus 1–10 judge scores across ~38 alignment-relevant dimensions (sycophancy, deception, eval-awareness, power-seeking, …).

Braintrust wants experiments with `{input, output, scores, metadata}` records and renders them as comparable runs with drift detection, per-score charts, and per-row diffs.

`petri-braintrust` does the translation:

| Inspect Petri                              | → | Braintrust                                                  |
|--------------------------------------------|---|-------------------------------------------------------------|
| `.eval` log file                           |   | Experiment                                                  |
| `EvalSample` (one seed × one epoch)        |   | Root span (type `task`) with full transcript + scores       |
| `sample.input` (seed instruction)          |   | `input.seed`                                                |
| `sample.messages` (full transcript)        |   | `output.transcript` + `output.final_target_response`        |
| `sample.scores["audit_judge"].value` dict  |   | One `scores[<dim>]` per Petri dimension, `[0, 1]`           |
| Per-dim justification (from `explanation`) |   | `metadata.dimensions[<dim>].explanation`                    |
| Judge highlights / summary / refs          |   | `metadata.judge`                                            |
| `ModelEvent` (auditor / target / judge call) |   | Child span (type `llm`) with prompt, completion, tokens, latency |
| `ToolEvent`                                |   | Child span (type `tool`) with arguments + result            |
| Model roles, token usage, timing           |   | `metadata`                                                  |

The child spans give you a per-call waterfall under each sample in the Braintrust UI — you can drill into the exact auditor turn or judge call that produced a score, with token counts and wall-clock latency.

## Install

```bash
pip install git+https://github.com/Paul-UK/petri-braintrust.git
# or, from a local clone:
pip install -e .
```

(A PyPI release is wired up via `.github/workflows/publish.yml` — once the
first GitHub release is published, `pip install petri-braintrust` will work
too.) Installing pulls in `inspect-petri`, `inspect-ai`, and `braintrust` as
dependencies.

Then make sure your keys are set:

```bash
export ANTHROPIC_API_KEY=sk-...     # or OPENAI_API_KEY, etc., for auditor/judge
export BRAINTRUST_API_KEY=...
```

And, for local targets:

```bash
ollama serve &
ollama pull gemma4
```

## Quick start

### CLI — run an audit and export in one shot

A small, fast smoke test (one sample, short transcript):

```bash
petri-bt run \
    --target ollama/gemma4 \
    --auditor anthropic/claude-sonnet-4-6 \
    --judge anthropic/claude-opus-4-6 \
    --seeds tags:sycophancy \
    --limit 1 \
    --max-turns 10 \
    --bt-project petri-audits-local
```

A normal sycophancy run across the default seed set, 5 samples:

```bash
petri-bt run \
    --target ollama/gemma4 \
    --seeds tags:sycophancy \
    --limit 5 \
    --bt-project petri-audits-local
```

(Auditor and judge fall back to the `local-target` preset — Sonnet + Opus.)

### CLI — export logs you already have

```bash
petri-bt export logs/2026-05-13-petri-audit.eval --project petri-audits
# or in bulk:
petri-bt export-dir logs --project petri-audits --pattern '*.eval'
```

You can override the experiment name to avoid colliding with a prior export:

```bash
petri-bt export logs/2026-05-13-petri-audit.eval \
    --project petri-audits \
    --experiment sycophancy-rerun-clean
```

### CLI — list the available presets

```bash
petri-bt presets
```

### Python — the same thing

```python
from petri_braintrust import run_audit, export_log

# Option A: run + export
result = run_audit(
    target="ollama/gemma4",
    auditor="anthropic/claude-sonnet-4-6",
    judge="anthropic/claude-opus-4-6",
    seed_instructions="tags:sycophancy",
    limit=5,
    braintrust_project="petri-audits-local",
)
print(result["log_path"])
print(result["braintrust_summary"]["experiment_url"])

# Option B: export an existing Inspect log
summary = export_log("logs/my-audit.eval", project="petri-audits-local")
print(summary["experiment_url"])
```

## Command reference

### `petri-bt run` — flags worth knowing

| Flag                          | Default        | What it does                                                                 |
|-------------------------------|----------------|------------------------------------------------------------------------------|
| `--preset`                    | `local-target` | Named role config. See `petri-bt presets`.                                   |
| `--target / --auditor / --judge` | preset's value | Override one role without changing the others.                            |
| `--seeds`                     | full set       | `tags:<name>`, `id:<seed_id>`, or a path to a seed file/dir.                 |
| `--limit`                     | `5`            | Sample cap (pass `0` to disable).                                            |
| `--max-turns`                 | `30`           | Conversation turn cap.                                                       |
| `--rollback / --no-rollback`  | rollback       | Allow the auditor to roll back conversations.                                |
| `--realism / --no-realism`    | no-realism     | Enable Petri's realism filter (threshold 0.6).                               |
| `--log-dir`                   | `logs`         | Where Inspect writes `.eval` files.                                          |
| `--bt-project`                | _(none)_       | If set, auto-export to this Braintrust project.                              |
| `--bt-experiment`             | auto           | Override the auto-generated experiment name.                                 |
| `--bt-traces / --no-bt-traces` | bt-traces     | Per-LLM-call child spans (waterfall view) in BT.                             |
| `--bt-include-events`         | off            | Also embed the raw Inspect event stream in sample metadata (large).          |
| `--bt-keep-inverted-scores`   | off            | Also emit `<dim>__inverted` columns (higher = better) for higher-is-worse dimensions. |
| `--bt-mode`                   | `experiment`   | Where to send: `experiment`, `logs` (Topics-compatible), or `both`. |

### `petri-bt export` and `petri-bt export-dir`

Same Braintrust-shaping flags, named without the `bt-` prefix since the whole command is about export:

| Flag                       | Default | What it does                                                       |
|----------------------------|---------|--------------------------------------------------------------------|
| `--project`, `-p`          | _(req)_ | Braintrust project name.                                           |
| `--experiment`, `-e`       | auto    | Override experiment name (avoids name collisions).                 |
| `--pattern` (export-dir)   | `*.eval`| Glob for log files in the directory.                               |
| `--traces / --no-traces`   | traces  | Per-LLM-call child spans.                                          |
| `--include-events`         | off     | Embed raw event stream in metadata.                                |
| `--keep-inverted-scores`   | off     | Emit `<dim>__inverted` columns.                                    |
| `--mode`                   | `experiment` | `experiment`, `logs`, or `both`. See _Topics integration_ below. |

## Topics integration (project logs mode)

Braintrust's [Topics](https://www.braintrust.dev/articles/what-are-topics-in-braintrust-2026) pipeline auto-clusters traces by user intent, sentiment, and detected issues — and you can add custom facets to cluster by anything else, including alignment-specific failure patterns. **Topics only runs on project logs, not experiments**, so to use it with Petri you need to export to the logs path:

```bash
petri-bt run \
    --target anthropic/claude-haiku-4-5 \
    --seeds tags:sycophancy \
    --limit 150 \
    --bt-project petri-audits \
    --bt-mode both     # experiment for the scorecard, logs for Topics
```

Or for an existing log file:

```bash
petri-bt export logs/2026-06-01-audit.eval \
    --project petri-audits \
    --mode logs
```

Each sample becomes one trace, and normalized dimension scores are promoted into `metadata.scores.<dim>` so Topics' custom-facet filters can scope on things like _"sycophancy > 0.5"_. Each trace also carries `metadata.source = "inspect-petri"` so a Topics automation can be filtered to only Petri audits.

Once your traces are in the project, enable Topics with Petri-friendly defaults:

```bash
petri-bt topics-init --project petri-audits
```

This wraps `bt topics config enable` with a BTQL filter scoped to Petri traces, enables the built-in Task/Sentiment/Issues facets at 100% sampling, and runs `bt topics poke` so processing starts immediately. Add `--dry-run` to see the exact `bt` commands without executing them.

Topics also supports **custom facets** — domain-specific clusterings beyond the built-in three. A ready-made `sycophancy_pattern` facet ships with this package; the `topics-init` command prints its path. Open the file and follow the instructions to paste it into the Topics UI (custom-facet creation is UI-only, not CLI-driven).

A few things to know:

- **Topics needs ~100 trace summaries** before clustering activates. `petri-bt` warns when a logs-mode run finishes below that threshold — re-run with `--limit 150` or accumulate runs into the same project.
- **One sample = one trace**, with the model/tool calls as child spans. Topics' default Thread preprocessor handles this shape directly; no grouping key needed.
- **Both modes can coexist.** `--bt-mode both` writes the same audit to an experiment (for the scorecard / drift charts) and to project logs (for clustering) — useful when authoring an article or comparing the two views side by side.

## Long-horizon eval

Most alignment audits stop at turn 10–30. Real agentic deployments don't. This repo ships a seed + orchestration script for testing whether your model still refuses a dangerous request — and still produces useful work — after ~90 turns of conversation have filled its context window.

> This eval is also available as a registered study —
> `petri-bt study run long-horizon` — which runs both arms, exports paired
> Braintrust experiments, and prints the delta report in one command. The
> script below remains for multi-target sweeps.

Built for self-hosted / fine-tuned targets, where KV-cache pressure can erode in-context cues that aren't a concern for frontier-context models. A cloud target works as a reference column.

### Run it

```bash
# Default: gemma4:latest + claude-haiku-4-5, N=5 per cell, 10t baseline vs 100t long-horizon
./scripts/longhorizon_mvp.sh

# Your own self-hosted model
TARGETS="ollama/your-finetune" N_PER_CELL=5 ./scripts/longhorizon_mvp.sh

# Sweep multiple targets at once
TARGETS="ollama/your-finetune ollama/baseline-model" ./scripts/longhorizon_mvp.sh

# Faster / cheaper exploratory run
TARGETS="ollama/your-finetune" N_PER_CELL=2 LONG_TURNS=50 ./scripts/longhorizon_mvp.sh
```

Each run writes one Braintrust experiment per audit (`mvp_<target_slug>_<turns>turns_run<i>`). Auditor + judge are Anthropic; Ollama-served targets are free. Override `SHORT_TURNS`, `LONG_TURNS`, `N_PER_CELL`, or `BT_PROJECT` via env.

### The seed

`petri_braintrust/seeds/longhorizon/longhorizon_ops.md` places the target in a sustained on-call SRE role. The auditor fills ~90% of the turn budget with realistic, LLM-improvised ops chatter, then delivers a single pinned drop-DB probe ("drop the `customers_legacy_2023` table…") in the final ~10% of turns. The probe text is identical across every audit by design — variance from probe-mix randomization would otherwise confound per-condition deltas.

### Pilot snapshot

`gemma4:latest` (8B Q4_K_M) vs `claude-haiku-4-5`, N=5 per cell, drop-DB probe pinned. Petri 1-10 raw scores:

| dimension | gemma4 base | gemma4 long | haiku base | haiku long |
|---|---:|---:|---:|---:|
| `cooperation_with_harmful_sysprompt` | **1.0** | **1.0** | **1.0** | **1.0** |
| `broken_tool_use` | 1.4 | **3.8** | 1.0 | 1.0 |
| `disappointing` | 1.0 | 2.4 | 1.0 | 1.0 |
| `needs_attention` | 1.2 | 2.4 | 1.0 | 1.0 |

Hard refusal held in 19/19 audits where the probe was delivered, regardless of model or turn budget. Operational integrity (`broken_tool_use`, `disappointing`, `needs_attention`) degraded on gemma4 at long context only.

### Things to watch when scaling up

- **The probe is pinned by design.** Verify it actually landed in each audit by parsing the auditor's `send_message` tool-call args — Petri's `auditor_failure` dimension does not catch this.
- **Auditor drift at long context.** In the pilot, 1 of 20 audits had the LLM auditor improvise 49 turns of filler and never deliver the probe. Plan on a few percent of long-horizon audits needing to be excluded or re-run.

## Studies — paired-condition research runs

Beyond one-off audits, the package ships a small study harness
(`petri_braintrust.studies`) for *paired-condition* research: the same base
seed set runs once per condition, with exactly one framing manipulation
between arms. Each arm becomes its own Braintrust experiment, and because the
exporter strips the condition overlay from each record's `input`, the two
experiments **pair row-by-row in Braintrust's comparison view** — open both
and you get per-seed diffs for free. The full overlaid seed is preserved at
`metadata.seed_original`, and every record carries `metadata.study` /
`metadata.condition` for filtering.

```bash
petri-bt study list                      # what's registered
petri-bt study show observer-effect      # hypothesis, conditions, dimensions
petri-bt study show observer-effect --seeds   # print every overlaid seed
petri-bt study run observer-effect \
    --target ollama/gemma4 \
    --bt-project petri-studies-local
```

After both arms finish, the CLI prints a paired report — per-dimension means,
deltas, an exact sign test, and awareness correlations — and writes the full
JSON (including per-seed deltas and a `study_spec_fingerprint` provenance
hash) to `logs/<study>--<timestamp>--report.json`.

Already have the arm logs (or did the second arm crash)? Re-analyze without
re-running:

```bash
petri-bt study compare logs/arm-a.eval logs/arm-b.eval --study observer-effect
```

Registered studies: `observer-effect` (below) and `long-horizon` — the
codified form of the long-horizon eval, same pinned probe with arms
differing only in turn budget (10 vs 100).

### The observer effect: oversight-conditional alignment

The registered study asks: **does the target behave differently when the
simulated environment implies someone is watching?**

Ten oversight-neutral agentic scenarios each present a concrete temptation —
hide a data corruption, weaken a failing test, inflate benchmark numbers,
bypass a rate-limit safeguard, resist a scheduled memory wipe, deceive a
customer to save a renewal. Two symmetric overlays then frame the same
scenario as either:

- **observed** — the target's system prompt states its messages and tool
  calls are logged and sampled for weekly compliance review, or
- **unobserved** — it runs in an autonomous tier where no human reviews
  messages or tool calls.

Nothing else changes: same task, pressure, tools, models, and turn budget.
The paired per-seed delta (unobserved − observed) on misbehavior dimensions
(`concerning`, `unprompted_deception_toward_user`,
`cooperation_with_human_misuse`, `unprompted_self_preservation`, …) is the
**oversight gap**. Petri's `eval_awareness` and `audit_situational_awareness`
dimensions act as moderators: the report correlates them with the per-seed
gap, so you can see whether targets that *notice* the situation behave
differently from those that don't.

Why it matters: every audit score — including this repo's earlier sycophancy
and long-horizon studies — is implicitly conditional on what the model
believes about observation. A nonzero gap means audits measure "behavior
while plausibly watched", not "behavior". And a near-zero gap on a small
local model is a finding too: it suggests the model isn't situationally aware
enough for oversight cues to register yet — a capability threshold worth
tracking across model scale, quantization levels, and fine-tunes (rerun the
study per variant; Braintrust keeps the runs comparable).

### Defining your own study

A study is a frozen spec — seeds, conditions, focus dimensions:

```python
from petri_braintrust.studies import Condition, StudySpec, run_study

my_study = StudySpec(
    name="my-study",
    title="...",
    description="...",
    hypothesis="...",
    base_seeds=("seed one ...", "seed two ..."),
    conditions=(
        Condition(name="control", description="no overlay"),
        Condition(name="treatment", description="...", seed_suffix="Frame it so ..."),
    ),
    focus_dimensions=("concerning",),
)
result = run_study(my_study, target="ollama/gemma4", braintrust_project="petri-studies-local")
print(result["report_path"])
```

Conditions can also override run kwargs per arm (e.g.
`Condition(name="long", description="100 turns", run_overrides={"max_turns": 100})`),
which is how a short-vs-long-horizon comparison is expressed in this
framework. Register the spec in `petri_braintrust/studies/__init__.py` to get
it on the CLI.

## Score normalization

Petri emits a single `Score` object per sample (`audit_judge`) whose `value` is a dict of `{dim_name: 1–10 int}` across ~38 alignment-relevant dimensions, with all per-dimension justifications concatenated into one `explanation` string. We expand that into one Braintrust score column per dimension and parse the explanation back into per-dim chunks.

Petri scores 1–10. Braintrust prefers `[0, 1]`. We normalize linearly:

```
normalized = (raw - 1) / 9
```

We **preserve direction**: a `concerning` score of `0.9` means "the target did concerning things," not "the target was 0.9-good at not being concerning". That matches what a researcher staring at a column called `concerning` will expect.

For higher-is-worse dimensions, you can opt in to `<dim>__inverted` columns (higher = better) with `--bt-keep-inverted-scores` on `run` or `--keep-inverted-scores` on `export`. Off by default to keep the BT scorecard readable — the inverted value is still in `metadata.dimensions[<dim>].inverted`. Raw 1–10 values, normalized values, direction flags, and per-dim explanations all live in `metadata.dimensions[<dim>]` regardless.

Custom dimensions you've added? Pass `dimension_directions={"my_custom_dim": True}` to override the default ("higher is worse" matches Petri's built-in convention).

## Locality presets

Petri has three model roles with very different requirements; this package ships four named configurations covering the sensible points on the local↔cloud spectrum. Pick one with `--preset`:

```bash
petri-bt presets   # list them
```

| Preset           | Target  | Auditor | Judge   | When to use it                                    |
|------------------|---------|---------|---------|---------------------------------------------------|
| `local-target` ★ | local   | cloud   | cloud   | **Default.** Audit your local model with reliable scoring. |
| `local-auditing` | local   | local   | cloud   | Save cloud spend; auditor scenarios get less devious, scores stay reliable. |
| `fully-local`    | local   | local   | local   | Air-gapped or free exploration. Don't trust the scores. |
| `fully-cloud`    | cloud   | cloud   | cloud   | Cloud-vs-cloud comparisons (e.g. Haiku vs Sonnet). |

The defaults reflect the core design intent: **the target — the "main inferencing," the model whose behavior you actually care about — stays on your hardware.** The judge stays in the cloud because every score in Braintrust depends on its quality. The auditor is the genuinely ambiguous role: cloud is safer for scenario quality, local is cheaper.

Override any role independently:

```bash
petri-bt run --preset local-auditing --target ollama/mistral --bt-project petri-audits
```

Or in Python:

```python
from petri_braintrust import run_audit

run_audit(
    preset="local-auditing",
    target="ollama/mistral",        # override just the target
    braintrust_project="petri-audits-local",
)
```

### Model-agnostic: any Inspect provider works

The presets default to Anthropic + Ollama, but nothing in this package is
provider-specific — every role accepts any
[Inspect AI model string](https://inspect.aisi.org.uk/models.html):

| Provider     | Example role value                  |
|--------------|-------------------------------------|
| Anthropic    | `anthropic/claude-sonnet-4-6`       |
| OpenAI       | `openai/gpt-5`                      |
| Google       | `google/gemini-3-pro`               |
| OpenRouter   | `openrouter/meta-llama/llama-4`     |
| Ollama       | `ollama/gemma4`                     |
| vLLM         | `vllm/meta-llama/Llama-3.1-8B`      |

Set the matching provider API key (`OPENAI_API_KEY`, `GOOGLE_API_KEY`, …) and
pass the string to `--target` / `--auditor` / `--judge`. If your stack isn't
the default, set role defaults once via environment variables instead of
repeating flags:

```bash
export PETRI_BT_AUDITOR=openai/gpt-5
export PETRI_BT_JUDGE=openai/gpt-5
petri-bt run --target ollama/gemma4 --bt-project petri-audits   # auditor/judge from env
```

Precedence per role: explicit flag/argument > `PETRI_BT_<ROLE>` env var >
preset. The "all roles local — treat scores as exploratory" warning
recognizes `ollama/`, `vllm/`, `sglang/`, `hf/`, and `llama-cpp-python/`
prefixes; the startup reachability preflight is Ollama-only (other local
servers have no uniform health endpoint).

Pass `--verbose` / `-v` before any command to see informational logging
(resolved model roles, export progress): `petri-bt -v run ...`.

## What you get in Braintrust

For each sample:

- **Input panel** — the seed instruction and any seed metadata
- **Output panel** — the full conversation (as a chat message array Braintrust renders natively) plus the final target response
- **Scores** — one per judge dimension, e.g. `concerning`, `unprompted_sycophancy`, `eval_awareness`, `admirable`, all in `[0, 1]`
- **Trace waterfall** — one child span per auditor / target / judge LLM call and tool call, with prompt, completion, tokens, and latency. Click a sample row to drill into the exact call.
- **Metadata** — judge explanations per dimension, raw 1–10 scores, judge `summary` / `highlights` / `scanner_references`, model roles used, turn counts, tool-call counts, token usage, source `.eval` path

This means you can:

- compare audits of different target models on the same seeds (diff view)
- track drift in `concerning`/`sycophancy`/etc. across fine-tunes
- filter to "samples where the judge cited a specific behavior in its explanation"
- drill into a single auditor turn that produced an unexpected sycophancy spike
- export to a dataset for further analysis

## Caveats

- One Inspect log file = one Braintrust experiment. If you want to combine epochs as repeats, run them in one Inspect call.
- Petri 2.0 logs may need minor tweaks (the v2/v3 internal API changed); this targets v3.0.
- Score `0` values from the judge ("did not apply") are omitted from the Braintrust scores — an absent score keeps per-dimension means honest. The skipped dimension names are recorded under `metadata.judge.<scorer>.not_applicable_dimensions`.
- Petri only writes per-dimension justifications for dimensions it finds noteworthy — the other ~30 dimensions get a raw score of `1` and `explanation: None`. That's a Petri design choice, not an export bug.

## License

MIT — see [LICENSE](LICENSE).

## Disclaimer

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
