Metadata-Version: 2.4
Name: bayesian-cage
Version: 0.1.0
Summary: An offline Bayesian confidence gate for MCP tool calls — PROCEED / FLAG / BLOCK with calibrated, per-model confidence. Pure stdlib, BYO-LLM.
Author-email: Bugra Omer Aydemir <bayesboa@gmail.com>
License: MIT
Project-URL: Homepage, https://bayescore.com
Project-URL: Source, https://github.com/BayesCore/bayesian-cage
Project-URL: Issues, https://github.com/BayesCore/bayesian-cage/issues
Keywords: mcp,agents,llm,guardrails,calibration,hallucination,verification,offline,bayesian
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# bayesian-cage

An **offline Bayesian confidence gate** for MCP tool calls — and any LLM output.
It scores how much to trust an output and gates it **PROCEED / FLAG / BLOCK** with a
calibrated, per-model confidence that sharpens over time from local memory.
Pure stdlib. Bring your own LLM. Your data never leaves the machine.

## Install

Requires **Python 3.10+**. Zero runtime dependencies.

```bash
# from PyPI (once published)
pip install bayesian-cage

# from source — use a venv; macOS's default python3 is often 3.9
python3.12 -m venv .venv && source .venv/bin/activate
pip install -e .
```

### Verify your install (30 seconds, offline)
```bash
python -m bayesian_cage.eval.sqlbench.run --model stub
```
Runs a 55-task SQL calibration benchmark against a real SQLite database with a built-in
stub model. No network, no LLM, no API keys. Prints a reliability table comparing the
cage to raw model confidence; writes `dataset.jsonl`, `results.json`, and a
`reliability.svg` to `bench_out/`. If this finishes, the package works end-to-end.

## Quickstart (library)
```python
from bayesian_cage import Kernel

k = Kernel(db_path="~/.bayescore/bayescore.db")
v = k.check("SELECT 1", {"expected": "select 1"}, model_id="phi3", task="sql")
print(v.decision, v.p)            # PROCEED 1.0
k.observe("phi3", "sql", correct=True, observation_id=v.observation_id)
```

## Gate an MCP server

The cage is itself an MCP server. It speaks JSON-RPC over stdio to your host
(Claude Desktop, Cursor, Claude Code, …) and spawns the **real** server as a
subprocess via `BAYESIAN_CAGE_DOWNSTREAM`. It transparently forwards
`initialize` / `tools/list` / `tools/call`; every `tools/call` *result* is
graded by a Verifier and stamped with a `_bayescore` envelope. Under
`enforce`, BLOCK verdicts withhold the result and return `isError: true`.

### Install a stable binary

MCP-host integration needs `bayesian-cage` on a path the host can spawn
without your shell PATH. `pipx` is the cleanest route:

```bash
brew install pipx && pipx ensurepath
pipx install bayesian-cage     # or:  pipx install .  from a clone
which bayesian-cage            # copy this absolute path
```

### Claude Desktop

Edit `~/Library/Application Support/Claude/claude_desktop_config.json` (macOS)
and add an `mcpServers` block. **Use absolute paths** — Claude Desktop spawns
subprocesses without your shell's PATH:

```jsonc
{
  "mcpServers": {
    "filesystem-gated": {
      "command": "/Users/you/.local/bin/bayesian-cage",
      "env": {
        "BAYESIAN_CAGE_DOWNSTREAM": "/usr/local/bin/npx -y @modelcontextprotocol/server-filesystem /Users/you/Documents/Claude",
        "BAYESIAN_CAGE_MODE": "advisory",
        "BAYESIAN_CAGE_VERIFIER": "filesystem",
        "BAYESIAN_CAGE_MODEL": "fs-server"
      }
    }
  }
}
```

Quit and relaunch Claude Desktop (`Cmd+Q`, not just close the window). The
`filesystem-gated` server appears in the tools picker; every tool result now
carries a verdict. Same shape works for Cursor (`~/.cursor/mcp.json`) and any
host that takes a stdio MCP server command.

### One-command sanity check (no host required)

```bash
BAYESIAN_CAGE_DOWNSTREAM="npx -y @modelcontextprotocol/server-everything" \
BAYESIAN_CAGE_MODE=enforce bayesian-cage --selftest
```
Runs a canned `initialize` / `tools/list` / `tools/call` sequence against the
configured downstream and prints the gated responses.

### Knobs

- `BAYESIAN_CAGE_MODE` — **advisory** (default; never blocks, just labels) ·
  **enforce** (BLOCK halts) · **iterate** (retry instead of stop). Start
  advisory; switch to enforce once you trust the verifier.
- `BAYESIAN_CAGE_VERIFIER` — `heuristic` (default) · `sql` · `json` ·
  `filesystem` · `ensemble:heuristic+sql` · `your.pkg.module:CustomVerifier`
- `BAYESIAN_CAGE_MODEL` — the bucket beliefs are keyed by. Use one per
  downstream (`fs-server`, `pg-server`) — the *LLM* driving the host is not
  what's being scored, the server is.
- `BAYESIAN_CAGE_DB` — SQLite belief store path (default `~/.bayescore/bayescore.db`)

### Closing the loop

Without outcome feedback, beliefs sit at the Beta(1,1) prior and `p` just
equals the raw verifier signal — you get a useful day-one gate, but no
*learning*. To sharpen calibration over time, feed real outcomes back from
whatever ground truth you already have (test pass/fail, downstream success,
manual labels):

```bash
bayesian-cage observe --model fs-server --task list_directory \
  --correct true --id <observation_id>
```

The `observation_id` is in the `_bayescore` envelope on each gated result.
After a few dozen labeled outcomes the per-`(model, task, signal-bin)` Beta
sharpens and the same raw signal earns a different `p`. `bayesian-cage export
--out labels.jsonl` dumps the full observation log.

## How it works
`verify → calibrate → gate → observe`. A pluggable **Verifier** scores the output; a
per-`(model, task, signal-bin)` Beta belief in one local SQLite store calibrates that
score into a probability; the gate decides by an explicit cost ratio; observed outcomes
update the belief. Calibration is learned per model, so swapping the LLM recalibrates
instead of conflating.

## Verifiers
- `HeuristicVerifier` — default, offline. Catches refusals, non-answers, error markers,
  placeholders, degenerate repetition, truncation, heavy hedging. Pass `expected` or
  `reference` for a grounded signal.
- `SqlVerifier` — swallowed SQL errors, schema typos, broken-empty vs legitimately-empty
  results, all-NULL rows.
- `JsonVerifier` — malformed/valid JSON and a `required` key list for structured outputs.
- `FilesystemVerifier` — POSIX errno markers and access-denied envelopes block; file
  listings and plain content pass.
- `EnsembleVerifier` — composes verifiers and gates on the strictest (minimum) signal.

Add your own by implementing `Verifier.verify(output, context) -> VerifierResult`.

## Calibration benchmark — why not just ask the model?

Because a model's self-reported confidence is miscalibrated. Concretely, on a
55-task execution-graded text-to-SQL bench against `phi3` (via Ollama, 5-fold,
seed=7, accuracy 67.3%):

| metric        | raw phi3 |   cage  | direction                  |
|---------------|---------:|--------:|----------------------------|
| ECE           |    0.325 |   0.081 | lower better — **4× tighter** |
| Brier         |    0.322 |   0.174 | lower better — **46% lower**  |
| catch-rate    |    0.000 |   0.333 | higher better              |
| wrong-passed  |       18 |      12 | lower better               |
| over-block    |    0.000 |   0.000 | lower better               |
| AUROC         |    0.583 |   0.544 | higher better              |

Phi3's raw confidence isn't *discriminative* — every one of the 55 answers came
back at ~1.0, so AUROC stays near chance whether you ask the model or the cage.
What the cage *does* is stop the model from lying to you about that confidence:
calibration error drops 4×, and **a third of phi3's wrong answers now get caught**
instead of passed through, at the cost of zero correct answers blocked.

Reproduce:
```bash
ollama pull phi3
python -m bayesian_cage.eval.sqlbench.run --model phi3 --seed 7 --out bench_out
```
Correctness is labeled by executing the model's SQL against a real SQLite
database and comparing to a gold query — nothing hand-labeled. Each example is
held out exactly once and scored by a cage that has already observed the
*other* folds; the published number reflects learned calibration, not just the
raw verifier signal. Artifacts (reliability diagram, metrics, labeled dataset)
land in `bench_out/`. The same rig runs offline against a stub model
(`--model stub`) for CI and install verification.

## License
MIT — free for any use, including commercial.
