Metadata-Version: 2.4
Name: jeevesagent
Version: 0.9.3
Summary: Model-agnostic, MCP-native agent harness
Author: Jeeves
License: Apache-2.0
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: anyio>=4.4.0
Requires-Dist: pydantic>=2.6.0
Requires-Dist: python-ulid>=2.2.0
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40.0; extra == 'anthropic'
Provides-Extra: chroma
Requires-Dist: chromadb>=0.5.0; extra == 'chroma'
Provides-Extra: cohere
Requires-Dist: cohere>=5.0.0; extra == 'cohere'
Provides-Extra: dbos
Requires-Dist: dbos>=0.7.0; extra == 'dbos'
Provides-Extra: dev
Requires-Dist: anyio[trio]>=4.4.0; extra == 'dev'
Requires-Dist: bump-my-version>=0.30; extra == 'dev'
Requires-Dist: hypothesis>=6.100; extra == 'dev'
Requires-Dist: import-linter>=2.0; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: docs
Requires-Dist: furo>=2024.0; extra == 'docs'
Requires-Dist: linkify-it-py>=2.0; extra == 'docs'
Requires-Dist: myst-parser>=2.0; extra == 'docs'
Requires-Dist: sphinx-autoapi>=3.0; extra == 'docs'
Requires-Dist: sphinx>=7.0; extra == 'docs'
Provides-Extra: litellm
Requires-Dist: litellm>=1.40.0; extra == 'litellm'
Provides-Extra: loader
Requires-Dist: beautifulsoup4>=4.12; extra == 'loader'
Requires-Dist: openpyxl>=3.1; extra == 'loader'
Requires-Dist: pypdf>=4.0; extra == 'loader'
Requires-Dist: python-docx>=1.0; extra == 'loader'
Requires-Dist: tiktoken>=0.7; extra == 'loader'
Provides-Extra: loader-docx
Requires-Dist: python-docx>=1.0; extra == 'loader-docx'
Provides-Extra: loader-excel
Requires-Dist: openpyxl>=3.1; extra == 'loader-excel'
Provides-Extra: loader-html
Requires-Dist: beautifulsoup4>=4.12; extra == 'loader-html'
Provides-Extra: loader-pdf
Requires-Dist: pypdf>=4.0; extra == 'loader-pdf'
Provides-Extra: loader-token
Requires-Dist: tiktoken>=0.7; extra == 'loader-token'
Provides-Extra: mcp
Requires-Dist: httpx>=0.27.0; extra == 'mcp'
Requires-Dist: mcp>=1.0.0; extra == 'mcp'
Provides-Extra: openai
Requires-Dist: openai>=1.30.0; extra == 'openai'
Provides-Extra: otel
Requires-Dist: opentelemetry-api>=1.24.0; extra == 'otel'
Requires-Dist: opentelemetry-sdk>=1.24.0; extra == 'otel'
Provides-Extra: postgres
Requires-Dist: asyncpg>=0.29.0; extra == 'postgres'
Requires-Dist: pgvector>=0.2.5; extra == 'postgres'
Provides-Extra: redis
Requires-Dist: redis>=5.0.0; extra == 'redis'
Provides-Extra: temporal
Requires-Dist: temporalio>=1.7.0; extra == 'temporal'
Provides-Extra: vectorstore
Requires-Dist: asyncpg>=0.29.0; extra == 'vectorstore'
Requires-Dist: chromadb>=0.5.0; extra == 'vectorstore'
Requires-Dist: faiss-cpu>=1.8.0; extra == 'vectorstore'
Requires-Dist: numpy>=1.26; extra == 'vectorstore'
Requires-Dist: pgvector>=0.2.5; extra == 'vectorstore'
Provides-Extra: vectorstore-chroma
Requires-Dist: chromadb>=0.5.0; extra == 'vectorstore-chroma'
Provides-Extra: vectorstore-faiss
Requires-Dist: faiss-cpu>=1.8.0; extra == 'vectorstore-faiss'
Requires-Dist: numpy>=1.26; extra == 'vectorstore-faiss'
Provides-Extra: vectorstore-postgres
Requires-Dist: asyncpg>=0.29.0; extra == 'vectorstore-postgres'
Requires-Dist: pgvector>=0.2.5; extra == 'vectorstore-postgres'
Provides-Extra: voyage
Requires-Dist: voyageai>=0.2.0; extra == 'voyage'
Description-Content-Type: text/markdown

# JeevesAgent

**Production-ready async agent harness. Multi-tenant by default,
typed outputs, retries on transient errors, model-agnostic, MCP-native.**

📖 **Docs** — <https://jeevesagent.readthedocs.io>
&nbsp;&nbsp;·&nbsp;&nbsp;
**Migrating?** — [from LangGraph](docs/migrations/from-langgraph.md)
&nbsp;·&nbsp;
[from raw OpenAI SDK](docs/migrations/from-openai-sdk.md)
&nbsp;&nbsp;·&nbsp;&nbsp;
**Changelog** — [CHANGELOG.md](CHANGELOG.md)

```python
import asyncio
from pydantic import BaseModel
from jeevesagent import Agent

class WeatherReport(BaseModel):
    city: str
    temp_c: float
    conditions: str

async def main():
    agent = Agent("Be precise.", model="gpt-4.1-mini")

    # Free-form run, scoped to a user (memory partitions automatically).
    r = await agent.run("Hi, my name is Alice.", user_id="alice")
    print(r.output)

    # Same agent, structured output, conversation continues.
    r = await agent.run(
        "Weather in Tokyo right now: sunny, 22°C, light wind. Extract.",
        user_id="alice",
        session_id="conv_42",
        output_schema=WeatherReport,
    )
    report: WeatherReport = r.parsed   # ← typed, validated
    print(f"{report.city}: {report.temp_c}°C — {report.conditions}")

asyncio.run(main())
```

Set `OPENAI_API_KEY` and run. Swap `"gpt-4.1-mini"` for
`"claude-opus-4-7"`, `"mistral-large"`, `"command-r-plus"`,
`"echo"` (zero-key fake), or any of ~100 providers via LiteLLM.

**What's actually different about this framework:**

* `user_id` is a first-class typed primitive. One shared `Agent` +
  one shared `Memory` partitions automatically across N tenants
  with no cross-contamination. **No more "forgot to namespace" data
  leaks.**
* `output_schema=` accepts any Pydantic model. The framework
  augments the system prompt, parses the result, validates,
  retries-with-feedback on validation failure. **Typed outputs by
  default, free-text by omission.**
* Network model adapters are auto-wrapped with a typed error
  taxonomy + retry policy. Rate limits, 5xx, network blips don't
  blow up your run. **Resilient by default.**
* `session_id` is a real conversation handle. Reuse it across
  `agent.run()` calls and prior turns rehydrate as real chat
  history. **No reducer protocol, no `add_messages` magic.**
* The agent loop is a *strategy*. Twelve architectures shipped
  (ReAct, Self-Refine, Reflexion, TreeOfThoughts, PlanAndExecute,
  ReWOO, Router, Supervisor, ActorCritic, MultiAgentDebate, Swarm,
  Blackboard) behind one `Agent` constructor. **One kwarg flips
  the iteration pattern.**
* Async-only, anyio everywhere, structured concurrency cancellation
  works correctly. Fast path when production features (audit / OTel
  / permissions / hooks / journaling) aren't wired up.

> ⚠️ **`model` is required** as of v0.2.0. Earlier `0.1.x` releases
> silently defaulted to `EchoModel` which produced confusing output;
> now the harness fails fast with a helpful error if you forget.

---

## Why pick this over LangGraph / CrewAI / AutoGen

Every agent framework forces a choice you shouldn't have to make:

* **LangChain / LangGraph** lock you into a graph editor and a
  specific state model. `user_id` is a string in
  `config["configurable"]` — typo it once and you silently leak
  data across tenants. Structured outputs and retries are
  developer-side concerns.
* **Claude Agent SDK** is excellent if you're committed to Anthropic
  forever. It's not model-agnostic.
* **OpenAI Assistants** is a black box you don't run yourself.
* **CrewAI / AutoGen** are abstractions over LangChain — same
  problems.

JeevesAgent is the harness for engineers shipping production agents
without binding their stack to one model lab — and without wiring
multi-tenancy / structured outputs / retries by hand.

**Capabilities at a glance:**

* **Model-agnostic** — Anthropic, OpenAI, and ~100 more via LiteLLM
  behind one `Model` protocol. String-based resolver:
  `model="claude-opus-4-7"`, `"gpt-4.1-mini"`, `"mistral-large"`, …
* **Pluggable architectures** — twelve shipped, same `Agent`
  surface, one kwarg switches the iteration strategy.
* **MCP-native** — MCP is the tool spine, not an integration. Jeeves
  Gateway / Composio / any MCP server plugs into a single
  `MCPRegistry`.
* **Memory done right** — five backends (in-memory / vector /
  Chroma / Postgres+pgvector / Redis), pluggable embedders, and
  **bi-temporal facts** that track when claims were true in the
  world vs when you learned them. All five backends partition by
  `user_id`.
* **Durable runtime** — `SqliteRuntime` gives crash-recovery replay
  with zero infrastructure. Postgres also supported.
* **Observable** — OpenTelemetry spans and metrics for every step.
  Drop in your exporter (Honeycomb / Datadog / LangSmith).
* **Safe** — permission policies, sandbox layers, append-only
  HMAC-signed audit log, freshness/lineage policies for certified
  values.
* **Async-only, structured concurrency** — anyio everywhere, zero
  raw `asyncio.create_task` / `gather`. Parallel tool dispatch via
  task groups. Backpressure-aware streaming.

Three principles govern every line of code:

1. **The loop is deterministic; the world isn't.** Every side effect
   goes through `runtime.step(...)` so it can be cached and replayed.
2. **Trust boundary stays outside the sandbox.** The harness runs
   tools inside a sandbox; the harness doesn't run inside one.
3. **Validate state on write, not on read.** Pydantic everywhere.

---

## Install

```bash
pip install jeevesagent

# Pick the extras you need:
pip install 'jeevesagent[anthropic]'    # Claude
pip install 'jeevesagent[openai]'       # GPT
pip install 'jeevesagent[postgres]'     # PostgresMemory + facts
pip install 'jeevesagent[mcp]'          # real MCP client
pip install 'jeevesagent[otel]'         # OpenTelemetry exporters

# Or install everything for development:
pip install -e '.[dev,anthropic,openai,mcp,postgres,otel]'
```

Requires Python 3.11+.

---

## 30-second quickstart

```python
import asyncio
from jeevesagent import Agent, tool

@tool
async def get_weather(city: str) -> str:
    """Look up the current weather."""
    return f"It's sunny and 72°F in {city}."

async def main():
    agent = Agent(
        "You are a travel assistant.",
        model="claude-opus-4-7",       # or "gpt-4o", or any Model instance
        tools=[get_weather],
    )
    result = await agent.run("What's the weather like in Tokyo?")
    print(result.output)
    print(f"Used {result.tokens_in + result.tokens_out} tokens, ${result.cost_usd:.4f}")

asyncio.run(main())
```

Set `ANTHROPIC_API_KEY` (or `OPENAI_API_KEY`) before running. That's
it — no LangChain, no LangGraph, no `chat_engine = AgentExecutor.from_llm_and_tools(...)`.

Want to see *what's happening* as the agent runs?

```python
async for event in agent.stream("plan a 3-day Tokyo trip"):
    print(f"[{event.kind}] {event.payload}")
```

You'll see `STARTED → MODEL_CHUNK × N → TOOL_CALL → TOOL_RESULT →
MODEL_CHUNK × N → COMPLETED` flow through.

---

## Architectures: the agent loop is a strategy

The default loop is ReAct (observe / think / act). When that doesn't
fit your problem, swap it with one kwarg — everything else (model,
memory, tools, budget, telemetry, runtime) stays exactly the same.

### Single-agent loops: pass `architecture=`

```python
from jeevesagent import Agent

agent = Agent("...", model="claude-opus-4-7")                            # ReAct default
agent = Agent("...", model="...", architecture="self-refine")            # iterate until critic happy
agent = Agent("...", model="...", architecture="reflexion")              # verbal RL with lessons
agent = Agent("...", model="...", architecture="plan-and-execute")       # plan once, execute steps
agent = Agent("...", model="...", architecture="rewoo")                  # plan + parallel tools, 30-50% cheaper
agent = Agent("...", model="...", architecture="tree-of-thoughts")       # BFS beam over candidate thoughts
```

### Multi-agent teams: use `Team` builders (the ergonomic facade)

`Team` mirrors the builder shape every other framework uses
(`create_supervisor` / `Crew` / `GroupChatManager`) so migrating from
LangGraph / CrewAI / AutoGen / OpenAI Agents SDK is muscle-memory.
Each builder returns a regular `Agent` — same `.run()` / `.stream()`
interface, no special calling convention.

```python
from jeevesagent import Agent, Team, RouterRoute

# Coordinator + workers; the manager calls delegate(...) or forward_message(...)
team = Team.supervisor(
    workers={"researcher": researcher, "writer": writer, "reviewer": reviewer},
    instructions="manage the pipeline",
    model="claude-opus-4-7",
)

# Classify-and-dispatch — cheaper than Supervisor when one specialist
# is enough (1 classifier call + 1 specialist run, no synthesis pass)
team = Team.router(
    routes=[
        RouterRoute(name="billing", agent=billing, description="..."),
        RouterRoute(name="tech",    agent=tech,    description="..."),
    ],
    instructions="customer support entry point",
    model="claude-haiku-4-5",
)

# Peer agents passing control via typed handoffs (input_type= for
# structured payloads, input_filter= for selective history pruning)
team = Team.swarm(
    agents={"triage": triage, "billing": billing, "tech": tech},
    entry_agent="triage",
    model="claude-opus-4-7",
)

# Actor + critic with different models for blind-spot diversity
team = Team.actor_critic(
    actor=Agent("...", model="claude-opus-4-7"),
    critic=Agent("...", model="gpt-4o"),       # different model
    max_rounds=3,
    approval_threshold=0.9,
    model="claude-opus-4-7",                    # coordinator
)

# N debaters + optional judge with similarity-based early termination
team = Team.debate(
    debaters=[optimist, skeptic, analyst],
    judge=cio,
    rounds=2,
    convergence_similarity=0.85,
    model="claude-opus-4-7",
)

# Coordinator + agents share a workspace; decider synthesizes
team = Team.blackboard(
    agents={"hypothesis": h_agent, "evidence": e_agent, "critic": c_agent},
    coordinator=coord_agent,
    decider=decider_agent,
    model="claude-opus-4-7",
)
```

### Recursive composition (the differentiator)

Architectures wrap each other naturally — the property no
sibling-only framework gives you. Wrap a Supervisor in Reflexion for
cross-session learning of delegation patterns; nest Supervisors for
hierarchical teams; wrap an entire pipeline in `Reflexion` to retry
on low scores:

```python
from jeevesagent import Agent, Reflexion, Supervisor

agent = Agent(
    "...",
    model="claude-opus-4-7",
    architecture=Reflexion(
        base=Supervisor(workers={"researcher": ..., "writer": ...}),
        max_attempts=3,
        threshold=0.85,
        lesson_store=InMemoryVectorStore(embedder=HashEmbedder()),  # selective recall
    ),
)
```

The explicit nested form (`Agent(architecture=...)`) and `Team`
builders are interchangeable — `Team.supervisor(workers={...})` is
exactly `Agent(architecture=Supervisor(workers={...}))` under the
hood. Use `Team` for single-level teams (matches what you've seen
in other frameworks); use the nested form for recursive composition.

### Standalone testing of orchestrators

```python
from jeevesagent import Supervisor, run_architecture

sup = Supervisor(workers={"a": agent_a})
result = await run_architecture(sup, "do the thing", model="claude-opus-4-7")
```

Architectures are pluggable via the `Architecture` protocol — three
methods (`name`, `run`, `declared_workers`) and you have a custom
strategy. See [`Subagent.md`](Subagent.md) for the full design
rationale.

---

## Architecture cheat sheet

Visual reference for picking the right pattern. Each diagram shows
the actual data flow + LLM-call structure for that architecture.

### Single-agent loops

**`ReAct`** — observe / think / act loop. The default. One model call per turn; tools dispatch in parallel.

```
                 ┌────────── loop until no tool calls ──────────┐
                 │                                              │
   prompt ───► Model ───► tool calls? ──yes──► run tools ──► results
                 │                              (parallel)
                 └─────────► no calls ───► final output
```

**`SelfRefine`** — single-agent generate → critique → refine. Same model wears both hats.

```
   prompt ───► generate ───► critique ──┬── score ≥ threshold ──► output
                              ▲         │
                              │         └── below ──► refine ──┐
                              │                                │
                              └────────────────────────────────┘
```

**`Reflexion`** — wraps any base architecture with verbal-RL retry. Failed attempts produce a "lesson" stored in memory or a vector store; next attempt sees the relevant lessons.

```
   ┌─────────── attempt loop (max_attempts) ───────────┐
   │                                                    │
   │   prompt ──► [recall lessons] ──► base.run() ──► evaluator
   │                                                    │
   │                                              score < threshold?
   │                                                    │
   │                                              yes ──┴── no ──► output
   │                                                    │
   │                                              reflector ──► lesson
   │                                                    │
   └────────────────────────────────── persist ─────────┘
                                          │
                          memory block  OR  vector store (selective recall)
```

**`TreeOfThoughts`** — BFS beam search over candidate thoughts. Proposer + evaluator at every depth; beam keeps top-k; min_score floor drops weak branches early.

```
              proposer (×branch_factor)         evaluator
   prompt ──► [t1, t2, t3]  ──score──►  [0.9, 0.4, 0.7]
                                              │
                                         keep top beam_width
                                         drop below min_score
                                              │
                                              ▼
                                         [t1, t3]   ←── frontier for depth 2
                                              │
                                         (repeat to max_depth)
                                              │
                                              ▼
                                       best leaf wins
```

**`PlanAndExecute`** — planner emits a step list once; executor walks each step; synthesizer composes the final answer.

```
   prompt ───► planner ───► [step1, step2, step3]
                                     │
                                     ▼
                              executor (per step) ───► [r1, r2, r3]
                                                            │
                                                            ▼
                                                      synthesizer ───► output
```

**`ReWOO`** — like PlanAndExecute but the planner emits structured tool calls with `{{En}}` placeholders, and **independent steps run in parallel**. Two LLM calls + N tool calls — 30-50% cheaper than ReAct on tool-heavy workloads.

```
   prompt ───► planner ───► [search({{E1}}), fetch({{E2}}=search.url)]
                                          │
                                          ▼
                            parallel tool dispatch
                            (independent steps run concurrently;
                             dependent steps wait for {{En}})
                                          │
                                          ▼
                                    synthesizer ───► output
```

### Multi-agent teams

**`Router`** — classify-and-dispatch. ONE classifier call decides which specialist runs; that one specialist owns the answer.

```
                       ┌── refund_agent
   prompt ──► classifier ──► technical_agent      ◄── only ONE
                       └── faq_agent ◄── chosen      runs

   1 classifier call + 1 specialist run. The cheapest multi-agent pattern.
```

**`Supervisor`** — coordinator + workers, glued by a `delegate(worker, instructions)` tool. Multiple delegations in one supervisor turn run in parallel. `forward_message(worker)` returns a worker's output verbatim with no synthesis.

```
   prompt ───► manager ───► delegate(...) ─┬─► worker A ─┐
                              │            ├─► worker B ─┤  parallel
                              │            └─► worker C ─┤
                              ▼                          │
                          [worker outputs] ◄─────────────┘
                              │
                              ├─► synthesize ──► output
                              │
                              └─► forward_message(worker) ──► verbatim output
```

**`ActorCritic`** — actor + critic pair (use *different models* for blind-spot diversity). Critic returns structured JSON `{score, issues, summary}`; actor refines below threshold.

```
   prompt ───► actor ───► critic ──┬── score ≥ threshold ──► output
                  ▲                │
                  │                └── below ──► refine (apply rubric)
                  │                                  │
                  └──────────── max_rounds cap ──────┘
```

**`MultiAgentDebate`** — N debaters argue across rounds (in parallel each round). Jaccard convergence detects early agreement; optional judge synthesizes the final answer.

```
   prompt ──► [debater1, debater2, debater3]   ◄── round 1 (parallel)
                              │
                       converged? (Jaccard ≥ 0.85)
                       yes ───► output
                       no  ───► [responses fed back]
                              │
              [debater1, debater2, debater3]    ◄── round 2 (sees prior)
                              │
                              ▼
                          judge ──► output     (or majority vote if no judge)
```

**`Swarm`** — peer agents handing off control via a `handoff` tool (or per-target `transfer_to_<name>` tools when peers are wrapped in `Handoff` with an `input_type`). No central coordinator.

```
   prompt ──► agent A
                 │
                 │ handoff(B, payload)
                 ▼
              agent B
                 │
                 │ transfer_to_C(typed_args)
                 ▼
              agent C ──► final output
                 ▲
                 │ cycle detection: A→B→A→B kills the loop
                 │ max_handoffs caps total depth
```

**`BlackboardArchitecture`** — agents collaborate via a shared mutable workspace. Coordinator picks who acts next; decider says when work is done.

```
                ┌───────────── shared blackboard ─────────────┐
                │   facts · hypotheses · partial results       │
                └────▲──────▲──────▲──────▲────────────▲───────┘
                     │ r/w  │ r/w  │ r/w  │ r/w        │
                     │      │      │      │            │
   prompt ──► coordinator ──► picks who acts next      │
                     │      │      │      │            │
                  agent A  agent B  agent C            │
                     │      │      │      │            │
                     ▼      ▼      ▼      ▼            │
                              decider ◄────────────────┘
                                 │
                                 ├─ done? ──► output
                                 │
                                 └─ not done ──► next round
```

### Recursive composition

Any architecture can wrap any other. The killer combination: `Reflexion` *of* `Supervisor` — the team learns across attempts which worker handles which intent best.

```
   ┌────── Reflexion attempt loop ──────┐
   │                                     │
   │   prompt ──► Supervisor ──► output ─┤── score ≥ threshold ──► done
   │              (manager + 3 workers)  │
   │                                     │
   │                                     └── below ──► lesson ──► retry
   │                                                                │
   └────────────────────────────────────────────────────────────────┘
```

```python
agent = Agent(
    "...",
    model="claude-opus-4-7",
    architecture=Reflexion(
        base=Supervisor(workers={"researcher": ..., "writer": ..., "reviewer": ...}),
        lesson_store=InMemoryVectorStore(embedder=HashEmbedder()),  # selective recall
    ),
)
```

---

## Skills: packaged playbooks the agent loads on demand

Tools tell the agent **what** it can do. Skills tell it **how** —
domain-specific recipes the agent reads when relevant, ignores when
not. Same shape as [Anthropic Agent Skills (Oct 2025)](https://docs.claude.com/en/docs/agents-and-tools/agent-skills/overview):
a directory with `SKILL.md` (frontmatter + markdown body) and
optional bundled files. Drop your existing Anthropic-format skills
into our `skills=[...]` and they Just Work.

```python
from jeevesagent import Agent

agent = Agent(
    "...",
    model="claude-opus-4-7",
    skills=[
        "~/.jeeves/skills/system/",          # base layer
        "~/.jeeves/skills/user/",            # user override
        ("./.jeeves-skills/", "Project"),    # project-local with label
    ],
)
```

**Progressive disclosure**: only `name` + `description` (~50 tokens
per skill) load into the system prompt at startup. The model calls
a `load_skill(name)` tool when a skill is relevant — only THEN does
the full body enter context. A 50-skill agent costs ~2,500 tokens at
rest; nothing more until the model actually loads one.

### Three skill modes — coexist freely in any skill

```
skills/my-skill/
├── SKILL.md         ← required: frontmatter + markdown body
├── tools.py         ← OPTIONAL: @tool functions (Mode B, in-process Python)
└── scripts/         ← OPTIONAL: executable scripts (Mode A or Mode C)
    └── helper.py
```

**Mode A — pure markdown.** SKILL.md teaches the model how to use
your existing tools (`read`, `write`, `bash`). The model issues
those tool calls itself based on the body's instructions.

**Mode C — frontmatter declares a script as a typed tool.** Any
language. The framework wraps the script in a subprocess-backed
`Tool` with proper args; the model calls it like any built-in tool.

```yaml
---
name: calc
description: Arithmetic helpers.
tools:
  add:
    description: Sum two integers.
    script: scripts/add.py
    args:
      a:
        type: string
        description: First int
      b:
        type: string
        description: Second int
---
```
```python
# scripts/add.py — plain Python, no decorators
import sys
print(int(sys.argv[1]) + int(sys.argv[2]))
```
The model calls `calc__add(a="2", b="3")` → framework execs the
script → captures stdout → returns to the model.

**Mode B — `tools.py` ships `@tool` functions.** Auto-discovered by
filename presence; imported at construction; registered into the
agent's tool host when the skill is loaded.

```python
# skills/greeter/tools.py
from jeevesagent import tool

@tool
async def say_hi(name: str) -> str:
    """Say hi."""
    return f"Hi {name}!"
```
The model calls `greeter__say_hi(name="Anupam")` directly. In-process,
fast, can share the agent's state.

### Auto-namespacing prevents collisions

Tool names get prefixed with the skill name automatically:

| Skill ships | Registered as |
|---|---|
| `add` (Mode C, calc skill) | `calc__add` |
| `say_hi` (Mode B, greeter skill) | `greeter__say_hi` |
| `search` (in two skills A and B) | `a__search` and `b__search` — no clash |

### Inline skills — one-off in code

For tiny one-off skills that don't justify a folder:

```python
from jeevesagent import Skill

skill = Skill.from_text("""
---
name: standup
description: Format a daily standup from rough notes.
---
# Standup
Always 3 sections: Yesterday, Today, Blockers.
""")

agent = Agent("...", skills=[skill])
```

### Layered sources with last-wins override

When two sources ship a skill with the same name, the later source
wins. Lets you stack: system → user → project.

```python
agent = Agent(
    skills=[
        "~/.jeeves/skills/system/",      # base
        "~/.jeeves/skills/user/",        # user customizes
        "./.jeeves-skills/",             # project-local override
    ],
)
```

See the [`examples/`](examples/) directory for runnable end-to-end
samples that exercise the loader, vector store, retriever-as-tool
pattern, and multi-agent debate.

---

## Fast path by default

JeevesAgent ships with the full production surface — audit log, OTel
telemetry, permissions, hooks, durable runtime, budget — but **you
don't pay for what you don't wire up**. Every layer has a no-op
default, and the loop detects those defaults at construction time
and skips the integration points entirely on the hot path.

A barebones `Agent("hi", model="gpt-4.1-mini", tools=[...])` runs
without going through the audit / telemetry / permissions / hook /
journaling / budget layers at all. The moment you pass
`audit_log=`, `telemetry=`, `permissions=`, `runtime=`, etc., the
corresponding layer flips on and the integration becomes active —
same `Agent` class, same API, no flags to set.

```text
                  default Agent              production Agent
                  ─────────────────         ─────────────────────
audit_log         None        → SKIP       FileAuditLog(...)    → wired
telemetry         NoTelemetry → SKIP       OTelTelemetry(...)   → wired
permissions       AllowAll    → SKIP       StandardPermissions  → wired
hooks             empty       → SKIP       @before_tool/@after_tool → wired
runtime           InProc      → INLINE     SqliteRuntime(...)   → wired
budget            NoBudget    → SKIP       StandardBudget(...)  → wired
```

When a layer is detected as no-op, the loop:

* skips the `await audit_log.append(...)` call (so even the function
  call dispatch is removed)
* skips `telemetry.trace(...)` async-context-manager entry/exit and
  the kwargs-dict construction for `emit_metric` calls
* skips `permissions.check(call, context={})` (returns `allow_()`
  inline)
* skips `hooks.pre_tool` / `hooks.post_tool` iteration
* inlines `await fn(*args)` instead of routing through
  `runtime.step(name, fn, ...)` — saves the idempotency-key hash
  derivation per tool call
* skips `budget.allows_step()` / `budget.consume(...)`

The point: "framework is slow because it's full-featured" stops
being the trade-off. You get the harness when you want it, the
speed when you don't, with no code changes between modes.

---

## Resilient by default

Real model APIs fail. Rate limits, 5xx blips, transient connection
drops happen on every production deployment. JeevesAgent ships
**retry on transient errors enabled by default** for the in-tree
network adapters (OpenAI, Anthropic, LiteLLM) — the moment you
construct a real-world agent it's already covered:

```python
agent = Agent("...", model="gpt-4.1-mini")
# Default policy: 3 attempts, 1 s → 2 s → 4 s backoff
# (capped at 30 s, ±10% jitter), respects provider Retry-After.
```

The framework normalises every model SDK's exceptions into a small
typed taxonomy so callers + the retry layer reason about failures
uniformly:

```text
ModelError                       — base (catch-all model failure)
├── TransientModelError          — retry-able
│   └── RateLimitError           — 429 / quota; carries retry_after
└── PermanentModelError          — don't retry
    ├── AuthenticationError      — bad API key
    ├── InvalidRequestError      — malformed prompt / args
    └── ContentFilterError       — safety system rejection
```

`classify_model_error(exc)` does the SDK-specific mapping (lazy
imports, no hard dependency on any provider package). The wrapper
treats `TransientModelError` as retryable, `PermanentModelError` as
fatal, and any *unrecognised* exception is propagated unchanged —
the framework refuses to silently retry errors it doesn't understand.

Tune the policy per-Agent:

```python
from jeevesagent import Agent, RetryPolicy

# Default (production-sensible)
agent = Agent("...", model="gpt-4.1-mini")

# Aggressive — tolerates long provider blips
agent = Agent("...", model="gpt-4.1-mini",
              retry_policy=RetryPolicy.aggressive())

# Disabled — handle errors yourself
agent = Agent("...", model="gpt-4.1-mini",
              retry_policy=RetryPolicy.disabled())

# Custom
agent = Agent("...", model="gpt-4.1-mini",
              retry_policy=RetryPolicy(
                  max_attempts=4,
                  initial_delay_s=0.5,
                  max_delay_s=15.0,
              ))
```

Behaviour highlights:

* **Provider-supplied `Retry-After` is honoured** — when a 429
  response carries the header, the framework waits at least that
  long before the next attempt (even if it exceeds `max_delay_s`).
  Provider authority wins over local heuristics.
* **Streaming retries fire before the first chunk** — once the
  consumer has received any tokens we cannot rewind, so mid-stream
  errors propagate. Pre-first-chunk failures are retried per
  policy.
* **Custom Models are not auto-wrapped.** The framework only
  wraps its in-tree adapters by default because it knows their
  error classes. Custom Models opt in by passing `retry_policy=`
  explicitly to `Agent(...)`.

---

## Structured outputs

Production agents need to emit *data*, not free-form prose. Pass a
Pydantic `BaseModel` as `output_schema=` to `agent.run(...)` and the
framework gives you a typed, validated instance:

```python
from pydantic import BaseModel
from jeevesagent import Agent

class CompanyInfo(BaseModel):
    name: str
    founded_year: int
    headquarters: str

agent = Agent("extract company info", model="gpt-4.1-mini")
result = await agent.run("Tell me about Acme.", output_schema=CompanyInfo)

info: CompanyInfo = result.parsed   # ← typed, validated
print(info.founded_year)            # 2008
print(result.output)                # raw JSON text, still available
```

What the framework does:

1. **Schema-aware system prompt** — appends a `STRUCTURED OUTPUT
   REQUIRED` directive to the run's instructions, embedding the
   schema's JSON Schema. Your static `Agent(...)` instructions are
   not mutated; the augmentation is per-run.
2. **Tolerates real-world model quirks** — strips ` ```json ` /
   ` ``` ` markdown fences before parsing.
3. **Retry-with-feedback** — on a parse failure, the framework
   gives the model up to `output_validation_retries` (default `1`)
   extra single-shot turns to fix it, feeding the validation error
   back as a USER message ("Your previous response failed schema
   validation: ...; return only a corrected JSON object"). After
   the retry budget is exhausted, raises
   `OutputValidationError` with the underlying Pydantic
   `ValidationError` attached as `.cause`, the bad text on `.raw`,
   and the schema on `.schema` — so callers can build whatever
   recovery strategy they need.
4. **`result.output`** keeps the (cleaned) raw JSON text so you can
   log or audit what the model produced; **`result.parsed`** holds
   the validated Pydantic instance.

Set `output_validation_retries=0` to fail fast (no recovery turn).

End-to-end demo: [`examples/04_structured_outputs.py`](examples/04_structured_outputs.py)
extracts a structured `MeetingSummary` (with nested `ActionItem`
lists, ISO dates, and a sentiment enum) from a raw meeting transcript.

---

## Multi-tenancy by default

JeevesAgent treats `user_id` and `session_id` as **first-class typed
primitives**, not strings buried in a free-form config dict. The
moment you pass them to `agent.run(...)`, the framework partitions
memory automatically and rehydrates conversation history without
any extra plumbing.

```python
result = await agent.run(
    "what is my favourite food?",
    user_id="alice",            # hard namespace partition for memory
    session_id="conv_42",       # conversation thread; reused = continued
    metadata={"locale": "en"},  # free-form bag for app-specific keys
)
```

What the framework does with these:

* **`user_id`** is a hard partition on every memory primitive.
  Episodes and facts stored under one `user_id` are **never visible**
  to a recall scoped to a different one. `None` is its own bucket
  ("anonymous / single-tenant"). One shared `Memory` instance can
  back N concurrent users with zero risk of cross-contamination.
* **`session_id`** is the conversation handle. Reuse the same id
  across calls and the loop rehydrates prior user/assistant turns
  as real `Message` history — the model sees the chat thread, not
  just a recall summary.
* **`metadata`** rides along on the per-run `RunContext` and is
  reachable from any tool / hook via `get_run_context()` without
  threading it through every function signature.

Inside a tool, you read scope from the live `RunContext`:

```python
from jeevesagent import tool, get_run_context

@tool
async def fetch_user_orders() -> str:
    """Look up the current user's recent orders."""
    ctx = get_run_context()
    return await db.query("orders", user_id=ctx.user_id)
```

The model never sees `user_id` in the tool schema, can't pass the
wrong one, and the framework guarantees the tool gets the right
value (set by `_loop`, propagated through `anyio` task groups for
parallel tool dispatch and sub-agent spawning).

**Sub-agents inherit automatically.** Every multi-agent architecture
(Supervisor, Debate, Swarm, Router, ActorCritic, Blackboard, ReWOO)
forwards the parent's `RunContext` to its workers, so `user_id`
flows through deeply nested agent trees with no per-architecture
plumbing. Workers get a fresh `session_id` so their conversation
history stays separate from the parent's.

**Footgun protection.** When a memory store contains episodes for
named users and a recall runs with `user_id=None`, the framework
emits an `IsolationWarning` — the partition is still safe, but the
dev probably forgot to pass `user_id=` somewhere. Apps that want
strict enforcement promote it to an exception:

```python
import warnings
from jeevesagent import IsolationWarning
warnings.simplefilter("error", IsolationWarning)
```

End-to-end demo: [`examples/03_multi_user_sessions.py`](examples/03_multi_user_sessions.py).

---

## Capability matrix

| Capability | What you get | Where |
|---|---|---|
| **Multi-tenant memory** | First-class `user_id` partition + `session_id` continuity. One shared `Memory` instance backs N users with no cross-contamination; sub-agents inherit context automatically | `RunContext`, `get_run_context`, `set_run_context`, `IsolationWarning`, `Agent.run(user_id=, session_id=, metadata=)` |
| **Structured outputs** | Pass `output_schema=` to get a typed, validated Pydantic instance back. Framework augments the system prompt with the schema, parses + validates, retries with feedback on failure | `Agent.run(output_schema=)`, `RunResult.parsed`, `OutputValidationError` |
| **Resilient model calls** | Network adapters auto-wrapped with retry-on-transient (rate limit, 5xx, network blip). Typed error taxonomy. Provider `Retry-After` honoured. | `RetryPolicy`, `RetryPolicy.disabled/aggressive`, `ModelError`, `TransientModelError`, `RateLimitError`, `PermanentModelError`, `AuthenticationError`, `InvalidRequestError`, `ContentFilterError`, `classify_model_error` |
| **Architecture protocol** | Pluggable agent-loop strategy: 12 architectures shipped | `Architecture`, `ReAct`, `SelfRefine`, `Reflexion`, `TreeOfThoughts`, `PlanAndExecute`, `ReWOO`, `Router`, `Supervisor`, `ActorCritic`, `MultiAgentDebate`, `Swarm`, `BlackboardArchitecture` |
| **Team facade** | Sibling-style builders (`Team.supervisor`, `Team.swarm`, `Team.router`, `Team.debate`, `Team.actor_critic`, `Team.blackboard`) for the common multi-agent shapes | `Team`, `Handoff`, `run_architecture` |
| **Vector store** | `add` / `search` / `delete` with Mongo-style filters, MMR diversity, BM25 hybrid search, save/load | `InMemoryVectorStore`, `ChromaVectorStore`, `PostgresVectorStore`, `FAISSVectorStore`, `SearchResult` |
| **Document loader** | One-line load for PDF / DOCX / Excel / CSV / HTML / Markdown into chunks | `jeevesagent.loader.load`, `MarkdownChunker`, `RecursiveChunker`, `SentenceChunker`, `TokenChunker` |
| **Built-in tools** | `read` / `write` / `edit` / `bash` factories with sandbox-aware workdirs | `read_tool`, `write_tool`, `edit_tool`, `bash_tool`, `default_workdir` |
| **Skills (Anthropic-compatible)** | Packaged playbooks loaded on demand. Three modes coexist: pure markdown, frontmatter-declared subprocess tools (any language), and `tools.py` with `@tool` (Python, in-process). Layered sources with last-wins override. | `Skill`, `SkillRegistry`, `SkillSource`, `SkillMetadata`, `SkillError`, `Agent(skills=...)` |
| **Model adapters** | Anthropic, OpenAI, LiteLLM (~100 providers), Echo (zero-key), Scripted (tests) | `jeevesagent.AnthropicModel`, `OpenAIModel`, `LiteLLMModel`, `EchoModel`, `ScriptedModel` |
| **String model resolver** | `model="claude-opus-4-7"`, `"gpt-4o"`, `"mistral-large"`, `"command-r"`, `"echo"`, `"litellm/<any>"` | `Agent.__init__` |
| **Tools** | `@tool` decorator with auto-schema, sync + async; `agent.with_tool` decorator; `add_tool` / `remove_tool` / `tools_list` | `jeevesagent.tool`, `Tool` |
| **MCP servers** | stdio + Streamable HTTP, multi-server registry, name disambiguation | `MCPRegistry`, `MCPServerSpec` |
| **Jeeves Gateway** | One-line: `tools=JeevesGateway.from_env()` | `jeevesagent.jeeves` |
| **Memory backends** | In-memory dict, vector cosine, Chroma, Postgres+pgvector, Redis | `InMemoryMemory`, `VectorMemory`, `ChromaMemory`, `PostgresMemory`, `RedisMemory` |
| **Embedders** | HashEmbedder (deterministic, zero deps), OpenAIEmbedder, VoyageEmbedder, CohereEmbedder | `HashEmbedder`, `OpenAIEmbedder`, `VoyageEmbedder`, `CohereEmbedder` |
| **Bi-temporal facts** | All five memory backends. LLM-driven `Consolidator`. Auto-consolidate, plus `ConsolidationWorker` for long-lived agents. | `Fact`, `Consolidator`, `*FactStore` |
| **Durable runtime** | sqlite or postgres-backed replay across process restarts | `SqliteRuntime`, `PostgresRuntime`, `JournaledRuntime` |
| **Streaming** | `agent.stream()` → `AsyncIterator[Event]` with backpressure | `Agent.stream` |
| **Permissions** | mode-based + allow/deny lists, mirrors Claude Agent SDK | `StandardPermissions`, `Mode` |
| **Hooks** | `@agent.before_tool` / `@agent.after_tool` decorators | `HookRegistry` |
| **Sandbox** | `FilesystemSandbox` blocks path-arg escapes; `SubprocessSandbox` for full isolation | `FilesystemSandbox`, `SubprocessSandbox` |
| **Budget** | Per-token / per-cost / per-wall-clock limits with soft warnings | `StandardBudget`, `BudgetConfig` |
| **Telemetry** | OpenTelemetry spans + metrics for every milestone | `OTelTelemetry` |
| **Audit log** | HMAC-signed JSONL or in-memory; tracks every tool call | `FileAuditLog`, `InMemoryAuditLog` |
| **Certified values** | Freshness + lineage policies | `FreshnessPolicy`, `LineagePolicy` |
| **Declarative config** | Build agents from TOML or dicts | `Agent.from_config(path)`, `Agent.from_dict(cfg)` |

---

## Documentation

The full Sphinx-built documentation site lives at
<https://jeevesagent.readthedocs.io> — every public symbol is
auto-documented from its docstring, and the migration / quickstart
guides are mounted alongside the API reference.

Build it locally with:

```bash
pip install -e ".[docs]"
sphinx-build -b html docs docs/_build/html
open docs/_build/html/index.html
```

In-tree starting points:

| Doc | What's there |
|---|---|
| [`docs/quickstart.md`](docs/quickstart.md) | Step-by-step examples for each backend combo |
| [`docs/recipes.md`](docs/recipes.md) | Production patterns: persistent memory, MCP, durable replay, audit |
| [`docs/architecture.md`](docs/architecture.md) | Module tour, lifecycle, extension points |
| [`docs/migrations/from-langgraph.md`](docs/migrations/from-langgraph.md) | LangGraph → JeevesAgent translation guide |
| [`docs/migrations/from-openai-sdk.md`](docs/migrations/from-openai-sdk.md) | Hand-rolled OpenAI loop → JeevesAgent translation guide |
| [`docs/migration_0.1_to_0.2.md`](docs/migration_0.1_to_0.2.md) | What changed in 0.2.0; how to migrate |
| [`CHANGELOG.md`](CHANGELOG.md) | Version-by-version release notes |
| [`Subagent.md`](Subagent.md) | Architecture-protocol design rationale; full 14-architecture catalogue (the 5 shipped, the 9 candidates) |
| [`project.md`](project.md) | The full engineering plan (the design doc) |
| [`BUILD_LOG.md`](BUILD_LOG.md) | Slice-by-slice changelog |
| [`examples/`](examples/) | Four runnable end-to-end samples: `01_rag_pdf.py`, `02_specialist_debate.py`, `03_multi_user_sessions.py`, `04_structured_outputs.py` |

---

## API stability

The framework is pre-1.0 — major versions can introduce breaking
changes — but the surface area is split into stability tiers so
adopters know what they can pin against today.

| Tier | API | What it covers |
|---|---|---|
| **Stable** | `Agent`, `Agent.run` / `stream` / `resume`, `RunResult`, `RunContext`, `get_run_context`, `set_run_context`, `Memory` protocol, `Episode`, `Fact`, `Message`, `Role`, `Event`, `Tool`, `@tool`, `Model` protocol, the error hierarchy under `JeevesAgentError`, `RetryPolicy`, `OutputValidationError`, `IsolationWarning` | Will not break in 0.x without a migration note + deprecation cycle. Pin against these in production code. |
| **Stable backends** | `InMemoryMemory`, `ChromaMemory`, `PostgresMemory`, `RedisMemory`, `VectorMemory`, `OpenAIModel`, `AnthropicModel`, `LiteLLMModel`, `EchoModel`, `ScriptedModel`, `InProcRuntime`, `SqliteRuntime`, `PostgresRuntime`, `StandardBudget`, `NoBudget`, `AllowAll`, `StandardPermissions`, `HookRegistry`, `OTelTelemetry`, `NoTelemetry`, `FileAuditLog`, `InMemoryAuditLog` | Concrete implementations; constructor signatures stable, behaviour locked. |
| **Experimental** | `MultiAgentDebate` / `Swarm` / `Blackboard` / `ReWOO` / `TreeOfThoughts` (the newer architectures), `Skills` and `SkillRegistry`, `JeevesGateway`, `agent.generate_graph()`, the `Team.*` builders | Useful, tested, but newer — internal details may change as we collect production feedback. Wrap with your own thin layer if you depend on them. |
| **Internal** | `_loop`, `_wrapped_model`, `Dependencies`, `AgentSession`, the architecture protocol's exact shape, anything starting with `_` | No stability promise. Subject to change without notice. |

If a symbol isn't listed, it's experimental by default. Open an
issue if you depend on something not yet in the Stable tier and
need it promoted.

---

## Status

* **866 tests pass** in ~6 seconds (5 env-gated integrations skip
  without `JEEVES_TEST_PG_DSN` / `JEEVES_TEST_REDIS_URL`)
* **mypy `--strict`** clean across 105 production source files
* **ruff** clean including `flake8-async` lints
* v0.10 ships **multi-tenancy by default**, **structured outputs**,
  **retry-on-transient by default**, and the **fast path by
  default**. Every layer (audit, telemetry, permissions, hooks,
  runtime, budget) is detected as no-op or production-wired at
  construction time, so a barebones `Agent` runs at LangChain-class
  latency with the integration skipped. `user_id` and `session_id`
  are first-class typed primitives — memory is hard-partitioned
  per `user_id`, conversations continue when a `session_id` is
  reused, and sub-agents inherit the parent's `RunContext`
  automatically via a contextvar (`get_run_context()`). Pass
  `output_schema=` (any Pydantic `BaseModel`) and `agent.run`
  returns a typed, validated instance on `result.parsed` — with
  retry-with-feedback on validation failure. Network model
  adapters are auto-wrapped with a typed error taxonomy
  (`TransientModelError` / `RateLimitError` /
  `PermanentModelError` / `AuthenticationError` /
  `InvalidRequestError` / `ContentFilterError`) and a configurable
  `RetryPolicy` so transient 5xx / 429 / network blips don't blow
  up production runs. All zero-config; no flags.
* v0.9 ships **Skills** (Anthropic Agent Skills format, with
  `tools.py` auto-discovery for in-process Python tools and
  frontmatter `tools:` manifest for any-language scripts wrapped
  as typed tools), agent-graph visualization (`agent.generate_graph()`
  → Mermaid / PNG), the `Team` facade for ergonomic multi-agent
  construction, the full vector-store stack (`InMemoryVectorStore` /
  Chroma / Postgres / FAISS — Mongo-style filters, MMR diversity,
  BM25 hybrid search, persistence), the document loader with
  chunking strategies, and 12 architectures including selective
  lesson recall (Reflexion), typed handoffs (Swarm),
  forward_message (Supervisor), Jaccard convergence (Debate), and
  parallel proposer/evaluator with min_score floor (TreeOfThoughts).

---

## Verify your install

```bash
git clone <repo>
cd jeevesagent
pip install -e '.[dev]'
ruff check jeevesagent
mypy --strict jeevesagent
pytest tests/ -v
```

You should see 815 passed. Five integration tests skip without
`JEEVES_TEST_PG_DSN` / `JEEVES_TEST_REDIS_URL` / API-key env vars set.

---

## Contributing

The harness has a strict CI gate: ruff + mypy `--strict` + pytest. All
three must pass. Async-only — every public function returning anything
other than a value is `async`. Every fan-out uses `anyio` task groups.
Zero raw `asyncio.create_task` or `asyncio.gather` calls.

See [`project.md`](project.md) §2 for the non-negotiable engineering
principles.

---

## License

Apache 2.0.
