Metadata-Version: 2.4
Name: litmus-trace
Version: 0.1.1
Summary: Record and deterministically replay AI agent executions
Project-URL: Homepage, https://github.com/romirjain/litmus
Project-URL: Repository, https://github.com/romirjain/litmus
Author: Romir Jain
License-Expression: MIT
License-File: LICENSE
Keywords: agents,ai,deterministic,llm,replay,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.10
Requires-Dist: click>=8.0
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: starlette>=0.38
Requires-Dist: uvicorn>=0.30
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.24; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Description-Content-Type: text/markdown

# Litmus

**Record and deterministically replay AI agent executions.**

Litmus captures every LLM and tool call your agent makes, then replays them deterministically — same inputs, same outputs, no real API calls. Debug production failures, test resilience with fault injection, and gate deploys with reliability scoring.

```bash
pip install litmus-trace
```

## Quick Start — Zero Code Changes

```bash
# Record your agent (wraps the process, captures all LLM calls)
litmus run python my_agent.py

# Replay deterministically (no API key needed, no real calls made)
litmus run --replay ./traces/lt-abc123.trace.json python my_agent.py

# What happens when the LLM refuses? Times out? Returns an error?
litmus run --replay trace.json --fault llm_refuse:step=0 python my_agent.py
```

Your agent code stays completely unchanged. Litmus patches the SDK transport layer at runtime.

## What It Does

**Record** — Intercepts every HTTP call to LLM APIs (Anthropic, OpenAI, Mistral, 14+ providers). Saves the full request and response as a trace file.

**Replay** — Feeds recorded responses back to your agent. The agent runs the same code path — same tool calls, same final output — without hitting any real API. No API key needed.

**Fault Injection** — Mutate recorded responses to test resilience. What happens when Claude refuses? When GPT returns a 500? When the API times out? Find out without waiting for it to happen in production.

**CI Gating** — Score your trace corpus for reliability and block deploys that drop below a threshold.

```bash
litmus ci ./traces --threshold 85
# Exit code 1 if score < 85 — blocks the deploy
```

## Three Ways to Use It

### 1. CLI Wrapper (recommended — zero code changes)

```bash
litmus run python my_agent.py
```

### 2. One-Line Python API

```python
import litmus

litmus.record()
# ... your existing agent code, unchanged ...
litmus.stop()
```

### 3. Proxy Mode (any language, advanced use)

```bash
litmus proxy --mode record
# Then point your SDK:
ANTHROPIC_BASE_URL=http://localhost:8787/anthropic python my_agent.py
```

## Fault Injection

Test how your agent handles failures — before they happen in production.

```bash
# LLM refuses to help
litmus run --replay trace.json --fault llm_refuse:step=0 python agent.py

# LLM returns a 500 error
litmus replay trace.json --fault llm_error:step=0

# LLM times out
litmus replay trace.json --fault llm_timeout:step=0

# LLM hallucinates (returns plausible but wrong answer)
litmus replay trace.json --fault llm_hallucinate:step=1
```

## CI/CD Integration

```bash
# Score all traces — exit non-zero if below threshold
litmus ci ./traces --threshold 85

# Verbose output with per-trace breakdown
litmus ci ./traces --threshold 80 --verbose

# JSON output for pipeline parsing
litmus ci ./traces --threshold 85 --json-output report.json
```

Scores across three dimensions:
- **Correctness** — did the agent complete without errors?
- **Resilience** — how does it handle faults?
- **Efficiency** — reasonable call count, no infinite loops?

## Supported Providers

Works with any LLM API out of the box:

| Provider | Status |
|----------|--------|
| Anthropic (Claude) | Tested |
| OpenAI (GPT) | Tested |
| Google (Gemini) | Supported |
| Mistral | Supported |
| Cohere | Supported |
| Groq | Supported |
| Together AI | Supported |
| Fireworks AI | Supported |
| DeepSeek | Supported |
| Perplexity | Supported |
| OpenRouter | Supported |
| Ollama (local) | Supported |
| vLLM (local) | Supported |
| LM Studio (local) | Supported |

**Custom/self-hosted models:**

```bash
litmus proxy --provider my-model=https://my-finetuned-llama.example.com/v1
```

## CLI Reference

```
litmus run          Wrap a command to record/replay (zero code changes)
litmus proxy        Start the recording/replay proxy server
litmus replay       Replay a trace with optional fault injection
litmus view         Pretty-print a trace file
litmus ci           Score traces and gate deploys
litmus providers    List all supported providers
```

## How It Works

Litmus monkey-patches the `httpx` transport layer used by both Anthropic and OpenAI Python SDKs. When you call `client.messages.create(...)`, Litmus intercepts the HTTP request before it leaves your machine.

**Record mode:** The real API call goes through. Litmus captures the request and response, then saves them to a trace file. API keys are automatically redacted.

**Replay mode:** The real API is never called. Litmus serves the recorded response directly from the trace file. Your agent gets the exact same response it got during recording — same tool calls, same content, same stop reason.

## Security

- API keys (`Authorization`, `x-api-key`) are **automatically redacted** from trace headers
- Use `--compact` to strip request bodies for smaller trace files
- Note: message content in request/response bodies is NOT redacted — don't include secrets in your prompts

## Limitations

- **Python only** — the monkey-patch approach (`litmus run`, `litmus.record()`) requires Python. Use proxy mode for other languages.
- **httpx-based SDKs** — works with SDKs that use `httpx` under the hood (Anthropic, OpenAI, Mistral, Cohere, etc). SDKs using `requests` or `aiohttp` are not intercepted.
- **Sequential replay** — responses are served in recorded order. Agents that make calls in a different order on replay will get mismatched responses.
- **No tool call recording** — only LLM API calls are captured. External tool calls (database, HTTP APIs) are not recorded.

## Community

- [Discord](https://discord.gg/z2XJ85pT5e) — chat, bugs, feature requests
- [GitHub Issues](https://github.com/rylinjames/litmus/issues) — bug reports
- [PyPI](https://pypi.org/project/litmus-trace/) — package

## Why Litmus?

**Observability tools** (LangSmith, Langfuse) tell you what happened. They log traces.

**Litmus tells you what *would* happen.** Record a production trace, replay it 100 times with different faults, and know exactly how your agent breaks — before your users find out.

LangSmith is the dashcam. Litmus is the crash test facility.

## License

MIT
