Metadata-Version: 2.4
Name: foreman-agents
Version: 0.9.0
Summary: Foreman — a supervisor for AI agent crews: watches for drift, catches loops, and keeps the crew on task (relational wellness monitoring)
Author: Foreman
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: langgraph
Requires-Dist: langgraph>=0.2; extra == "langgraph"
Requires-Dist: langchain-core>=0.3; extra == "langgraph"
Requires-Dist: typing-extensions>=4.0; extra == "langgraph"
Provides-Extra: otel
Requires-Dist: opentelemetry-api>=1.20; extra == "otel"
Requires-Dist: opentelemetry-sdk>=1.20; extra == "otel"
Provides-Extra: chromadb
Requires-Dist: chromadb>=0.4; extra == "chromadb"
Provides-Extra: anthropic
Requires-Dist: langchain-anthropic>=0.1; extra == "anthropic"
Provides-Extra: all
Requires-Dist: langgraph>=0.2; extra == "all"
Requires-Dist: langchain-core>=0.3; extra == "all"
Requires-Dist: opentelemetry-api>=1.20; extra == "all"
Requires-Dist: opentelemetry-sdk>=1.20; extra == "all"
Requires-Dist: chromadb>=0.4; extra == "all"
Dynamic: license-file

# Foreman

[![CI](https://github.com/foreman/foreman/actions/workflows/ci.yml/badge.svg)](https://github.com/foreman/foreman/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/foreman-agents)](https://pypi.org/project/foreman/)
[![Python](https://img.shields.io/pypi/pyversions/foreman-agents)](https://pypi.org/project/foreman/)
[![License](https://img.shields.io/github/license/foreman/foreman)](LICENSE)

Relational wellness monitoring for AI agent crews. The only monitoring system that watches how agents work **together**, not just how they perform individually.

> **Not a coder?** Built an agent and just want it improved? See
> [`docs/quickstart-novice.md`](docs/quickstart-novice.md) — double-click a
> launcher, drag in your agent's log, and get pasteable prompt fixes. No code.

```bash
pip install foreman-agents
foreman demo                  # 30-second tour, no setup required
wellness improve myrun.txt --goal "Build a REST API"   # no-code: analyze a run + fix prompts
```

![foreman demo](assets/demo.svg)

The single-file HTML dashboard (`foreman dashboard`):

![foreman dashboard](assets/dashboard.png)

## 60-second integration

```python
from foreman import WellnessCallbackHandler

handler = WellnessCallbackHandler(crew_id="my-team")     # passive, observe-only
app = graph.compile(callbacks=[handler])

# That's it. Every agent message is now monitored.
print(handler.crew_health)            # 0.85
print(handler.is_degraded)            # False
print(handler.threshold)              # 0.65 (or adaptive after warmup)
print(handler.analysis())             # full crew intelligence report
```

Optional but recommended — wire the outcome feedback loop so the score actually reflects task success:

```python
handler.set_task_context(task_type="technical", task_description="Implement LRU cache class")
result = app.invoke(state)
handler.record_outcome("task_1", source="Coder", success=True)
```

When you're ready to evaluate, render a dashboard:

```bash
# Dump live analysis to JSON, then render a self-contained HTML dashboard
python -c "import json; print(json.dumps(handler.analysis(), default=str))" > today.json
foreman dashboard --analysis today.json --out dashboard.html
```

## Two modes, clearly separated

```python
# PASSIVE (default)  — observe-only.
# Zero mutation of agent inputs, prompts, state, or outputs. Zero LLM calls.
handler = WellnessCallbackHandler(crew_id="my-team")

# INTERVENTION  — opt-in. Adds reframe injection into worker.backstory
# when crew health drops below threshold. Fully reversible.
handler = WellnessCallbackHandler(
    crew_id="my-team",
    intervention_mode=True,
    worker_agents=[agent1, agent2, agent3])

# Switch live, or tear down cleanly:
handler.enable_intervention(worker_agents=[agent1, agent2])
handler.disable_intervention()        # restores backstories
handler.disconnect()                  # restores + flushes telemetry + final report
```

The observe-only guarantee applies to **passive mode** and is verified by A/B testing — the handler reads responses after generation and records them; it does not touch agent inputs, prompts, state, or outputs. Intervention mode is opt-in and clearly separated; the only thing it mutates is `agent.backstory`, and `disconnect()` restores originals.

## Verified by independent A/B testing

Tested on real LangGraph crews with live GPT-4o-mini API calls across 256 tasks:

| Property | Result |
|----------|--------|
| Task correctness impact (passive mode) | 0% — handler never alters outputs |
| Extra LLM calls per task (passive mode) | 0 |
| Extra latency (passive mode) | 0ms (within API variance) |
| Extra cost per task (passive mode) | $0.00 |
| Non-coding crew false positive rate | 0% (40 tasks, v7+v8) |

### Simulation results

| Environment | No monitoring | + Reactive | + Proactive |
|-------------|:---:|:---:|:---:|
| Standard (40 mixed tasks) | 21% | 49% | **73%** |
| Enterprise (50 CMU-calibrated) | 23% | 38% | **89%** |
| Mid-Market (50 real workflows) | 25% | 52% | **79%** |
| Hostile (4 adversarial scenarios) | 16% | 28% | **73%** |

**Why this matters:** AI agents fail 76% of enterprise tasks ([Carnegie Mellon, 2025](https://www.cs.cmu.edu/news/2025/agent-company)). Unstructured multi-agent networks amplify errors up to 17x ([Google DeepMind, 2025](https://arxiv.org/abs/2503.13657)).

## Install

```bash
pip install foreman-agents
```

No required dependencies. All monitoring runs without API keys or external services.

```bash
pip install foreman-agents[langgraph]    # LangGraph callback handler
pip install foreman-agents[otel]         # OpenTelemetry export
pip install foreman-agents[all]          # Everything
```

If you import `WellnessCallbackHandler` without `langchain-core` available, the handler raises `LangChainNotInstalledError` with the install command. Pass `strict=False` to use it for direct-API observation outside LangChain.

## CLI

```bash
foreman demo                                   # 30-second synthetic crew tour
foreman scan -m "won't work" -s Engineer       # one-message signal scan
foreman calibrate --crew my-team --logs interactions.jsonl
foreman health    --crew my-team --bundle fingerprint.json
foreman report    --analysis today.json --out weekly.md
foreman dashboard --analysis today.json --out dashboard.html
foreman info
```

`foreman demo` runs a synthetic crew through a degradation-and-recovery scenario in your terminal — useful for verifying install and seeing the value model before wiring anything up.

## Scoring profiles

```python
WellnessCallbackHandler(profile="auto")          # default — picks from agent role names
WellnessCallbackHandler(profile="engineering")   # suppresses code-review conflict
WellnessCallbackHandler(profile="support")       # penalizes thin completions / deflection
WellnessCallbackHandler(profile="research")      # tolerates hedging, penalizes groupthink
```

`"auto"` inspects the first observed agent role and picks the matching profile. Falls back to engineering if no keywords match.

## Privacy

```python
WellnessCallbackHandler(crew_id="my-team", redact=True)
```

With `redact=True`, message text is never persisted — only signals plus stable hash prefixes. Use for crews running on customer data or in regulated industries.

## Framework connectors — one `connect()` for any platform

As of v0.4.0 every framework is reachable through a single entry point. `connect()` returns the right observer or callback handler, already wired to a monitor:

```python
from foreman import connect, available_platforms

# LangGraph — returns a LangChain callback handler
handler = connect("langgraph", crew_id="my-team")
graph.invoke(state, config={"callbacks": [handler]})
print(handler.crew_health)

# Any other framework — returns an observer exposing `.monitor`
obs = connect("crewai", crew_id="my-team")
crew = Crew(..., step_callback=obs.on_step, task_callback=obs.on_task)
print(obs.monitor.crew_health)

# Discover everything available, with verification status
for p in available_platforms():
    print(p["name"], p["status"], "—", p["integration"])
```

### Supported platforms

| Platform | `connect(...)` | Verification | How it hooks in |
|----------|----------------|--------------|-----------------|
| LangGraph | `"langgraph"` | **verified-live** | pass as a callback in `config={"callbacks": [handler]}` |
| CrewAI | `"crewai"` | built-to-spec | `step_callback=obs.on_step`, `task_callback=obs.on_task` |
| AutoGen (v0.2 + v0.4) | `"autogen"` | built-to-spec | `obs.on_message(msg)` per message; `obs.on_task_done(...)` |
| Google A2A | `"a2a"` | built-to-spec | `obs.on_task_event(event)` / `obs.on_message(msg)` |
| OpenAI Agents SDK | `"openai-agents"` | built-to-spec | pass as `hooks=obs` to `Runner.run(...)` |
| Google ADK | `"google-adk"` | built-to-spec | set `after_model_callback` / `after_agent_callback` on the agent |
| LlamaIndex | `"llamaindex"` | built-to-spec | `dispatcher.add_event_handler(obs)` |
| Semantic Kernel | `"semantic-kernel"` | built-to-spec | `kernel.add_filter("function_invocation", obs.function_invocation_filter)` |
| Pydantic AI | `"pydantic-ai"` | built-to-spec | `obs.on_run_result(result, task_id=..., success=...)` |

**On "verification" — read this honestly.** *verified-live* means the connector has been exercised against a real run of that framework (LangGraph, with live GPT-4o-mini calls across 256 tasks). *built-to-spec* means it is implemented against the framework's published callback interface and mock-tested, but not yet run against a live install — the target SDKs aren't all pip-installable in our CI. All connectors are defensive: a malformed event can never crash your host pipeline. If you run one live, a bug report (or a thumbs-up) is very welcome.

See `examples/` for runnable scripts.

## Self-improving manager mode (opt-in)

By default this is a passive monitor. Wrap a crew in `SelfImprovingManager` and
it becomes an active, goal-driven manager that runs without a human prompting
it — `observe → evaluate → intervene → re-observe → learn`:

```python
from foreman import SelfImprovingManager

mgr = SelfImprovingManager(crew_id="my-team",
                           goal="Build a REST API in Python with FastAPI",
                           max_messages=200)         # optional cost circuit-breaker

a = mgr.observe(message, source="Coder")
if a.action == "halt":      # crew looping / over budget / off the rails
    stop()
mgr.record_outcome("task-1", success=True)           # feeds autonomous learning
```

It adds what a regex-on-text monitor structurally can't: a **goal anchor +
drift detection**, **loop/resource accounting** (non-termination + token/message
budgets), **autonomous learning** (the scoring profile retrains from your
outcomes), and an **opt-in LLM-judge tier** for silent semantic failures
(off by default — pass `judge_fn=...` to enable; preserves the zero-LLM
default).

We built this after stress-testing the passive monitor against the
[MAST failure taxonomy](https://arxiv.org/abs/2503.13657) (1,600+ real traces):
on raw text the passive monitor catches ~11–17% of real failures (most
multi-agent failures are silent/semantic, not lexical); the manager raises that
to **~89% (16/18) with no LLM** and **18/18 with the judge tier** (validated
with a simulated judge), at zero control false positives. The two modes that
need the judge — reasoning–action mismatch and incorrect verification — are
genuinely semantic; everything else is caught deterministically. Full writeup:
`docs/self-improving-manager.md`.

Two further opt-ins (both off by default) let it improve itself and your agents:

```python
mgr = SelfImprovingManager(
    crew_id="team", goal="...", worker_agents=[coder, reviewer],
    auto_select_interventions=True,   # learn which intervention phrasing works best
    allow_agent_rewrite=True,         # propose durable prompt improvements (gated)
)

a = mgr.observe(message, source="Coder")
if a.needs_review:                    # after steering, it summarizes and asks
    print(a.review_summary)           # "...Apply these improvements? [yes/no]"
    mgr.apply_improvements(approve=user_said_yes)   # ONLY this writes; reversible
    mgr.export_learned("guidance.json")             # carry improvements across runs
```

`allow_agent_rewrite` never modifies your agents without explicit approval,
writes onto their pristine prompts, and is fully reversible via
`revert_improvements()`.

## Direct API (no LangChain)

```python
from foreman import WellnessMonitor

monitor = WellnessMonitor(crew_id="my-team", profile="auto")
monitor.observe("message", source="Engineer", target="Designer")
monitor.record_outcome("task_1", "Engineer", success=True)
print(monitor.crew_health)
print(monitor.analysis())
```

Useful for pipelines that don't run on LangGraph, or for testing.

## Webhooks + OpenTelemetry

```python
handler = WellnessCallbackHandler(
    crew_id="ops", enable_otel=True, enable_webhooks=True)
```

Pipes crew health into your existing observability stack via standard OTel spans (works with LangSmith, Arize, Datadog).

## What it monitors

**35 signal types** across agent messages — 7 hard signals, 7 human soft signals, 21 agent soft signals including 6 code-specific signals. Zero LLM calls.

**Code-specific signals** — `code_retry_loop`, `code_complexity_creep`, `code_hallucination`, `code_abandonment`, `code_cargo_cult`, `code_test_avoidance`. Detect when coding agents are stuck, over-engineering, or skipping quality checks.

**Output quality validation** — `output_format_mismatch`, `output_json_when_code_expected`, `output_task_incoherence`, `output_too_short`, `output_repeated_failure_pattern`. Catches when output doesn't match what the task asked for. 10/10 detection rate on a known-failure-output test set.

**6 relational dynamics** between agents — communication patterns, conflict resolution quality, context coherence, influence concentration, delegation effectiveness, creative synergy.

**Adaptive thresholds** — learns the crew's normal score range during a warmup period (default 8 tasks), then **continuously re-calibrates** on a 32-message rolling cadence so the threshold tracks how the crew actually behaves over time. Bounded 200-score rolling history.

**Model family support** — adjusts hedging detection for GPT, Claude, and Gemini models.

## What it does about it

**Reactive (intervention mode)** — detects degradation and injects targeted reframes into worker `backstory` strings. Breaks failure spirals. Reversible via `disconnect()` / `disable_intervention()`.

**Proactive (intervention mode)** — prevents degradation before it starts. Task routing, readiness checks, smart decomposition with verification loops for technical tasks, failure inoculation.

**Consecutive failure tracking** — direct score penalty from task failures (via `record_outcome`), independent of signal detection. Catches degradation even when agents communicate cleanly but produce bad output.

## Architecture

8 layers, built on a research-backed model of collaborative team dynamics.

| Layer | What it does |
|-------|-------------|
| Core Framework | LangGraph StateGraph, signal scanning, memory, scoring |
| Intelligence | Learned classifiers, crew fingerprints, trajectory prediction |
| Agent Identity | Capability confidence, self-models, teammate awareness |
| Interaction Intelligence | 6 modules tracking relational dynamics between agents |
| Proactive Wellness | Pre-task routing, readiness, decomposition, inoculation |
| Integration | OTel spans + metrics, webhook events, fingerprint API |
| Governance | Outcome validation, self-reflection, skill routing, meta-monitor |
| Socratic Observation | Convergence detection, metacognitive process evaluation |

### Key design decisions

**Crew topology awareness.** Odd-node pipelines (draft → review → finalize) are collaborative — the finalizer resolves disagreements. Even-node pipelines (solve → review) are adversarial — the reviewer's job is to critique. The engineering scoring profile suppresses conflict/coherence signals for adversarial pipelines because disagreement *is* the healthy pattern.

**Verification loops for technical tasks.** Non-technical tasks use majority-wins decomposition (2/3 subtasks succeed = task succeeds). Technical tasks use staged gates with retry (build → verify → retry if failed). This solves the compound probability problem where 3 parallel subtasks at 65% each = 27% joint success, while verification gives 1−(0.35)² = 88%.

## Starter bundles

New crews don't have data yet, so the adaptive threshold falls back to 0.65. Skip the 8-task warmup with a pre-calibrated baseline matching your crew topology:

```python
handler = WellnessCallbackHandler(
    crew_id="my-team",
    starter_bundle="eng_3agent",   # draft → review → finalize
)
```

Built-in bundles: `eng_2agent` (solve→review), `eng_3agent` (draft→review→finalize), `support_2agent`, `research_2agent`, `general_2agent`. Once your crew accumulates real data the rolling re-calibration takes over and the starter values stop mattering.

## Privacy + telemetry

**No telemetry. Nothing phones home.** The package never makes outbound network calls except through the integrations you explicitly enable (OpenTelemetry export to your collector, webhooks to your endpoint). Verifiable by inspection — there is no `requests.post`, no analytics, no sign-in.

```python
WellnessCallbackHandler(crew_id="my-team", redact=True)
```

With `redact=True`, message text is never persisted — only signal hashes. Use for crews running on customer data or in regulated industries.

## Public benchmark

A reproducible benchmark suite ships with the package. 30 tasks with ground-truth labels across 7 failure modes (retry loop, hostile, silent drift, hedging spiral, format break, ownership loss, context loss). Deterministic — no LLM calls.

```bash
wellness benchmark --profile engineering --verbose
```

v0.4.0 results on the default suite:

| Metric | Value |
|--------|-------|
| Precision | 1.00 |
| Recall | 1.00 |
| F1 | 1.00 |
| Accuracy | 1.00 |
| False positives on healthy code-review tasks | 0 / 12 |

v0.3.1 missed all 3 hostile-input tasks: the hard hostility signals (`direct_insult`, `all_caps_rage`, `threatening_language`, and three more) were escalated by the live handler but carried **no weight** in the health score, so a hostile session scored 0.66 — just above the 0.65 flag threshold. v0.3.2 assigns those signals real weights; hostile sessions now score ~0.27 and are flagged decisively, with no new false positives (healthy tasks still score 1.00). One thin margin remains and is tracked honestly: `context_loss` lands at 0.64 vs the 0.65 threshold, because its primary signal (`coherence_needs_realignment`) is intentionally suppressed in the engineering profile to avoid code-review false positives.

## Early access

We're looking for **10 teams** running multi-agent crews in production to join early access.

- **What you get:** direct setup support, free fingerprint calibration, access to starter bundle library, priority feature requests.
- **What we learn:** how relational monitoring performs on real production crews, which signal types matter most for your domain.

Open an issue titled "Early Access" with your crew type and rough agent count.

## Contributing

Found a bug? Open an issue. Want a feature? Open an issue describing the use case. PRs welcome for the core framework.

## License

MIT. See [LICENSE](LICENSE).
