Metadata-Version: 2.4
Name: traceix
Version: 0.1.2
Summary: Trajectory-based CI testing for AI agents
Project-URL: Homepage, https://github.com/itgoujie2/traceix
Project-URL: Repository, https://github.com/itgoujie2/traceix
Project-URL: Bug Tracker, https://github.com/itgoujie2/traceix/issues
Author-email: traceix <itgoujie2@gmail.com>
License: MIT
Keywords: agents,ci,llm,testing,trajectory
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.11
Requires-Dist: click>=8.0
Requires-Dist: httpx>=0.25
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: crewai
Requires-Dist: crewai>=0.1; extra == 'crewai'
Provides-Extra: dev
Requires-Dist: anthropic>=0.40; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-httpx>=0.30; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: langgraph
Requires-Dist: langchain-core>=0.2; extra == 'langgraph'
Requires-Dist: langgraph>=0.2; extra == 'langgraph'
Description-Content-Type: text/markdown

# traceix

**Trajectory-based CI testing for AI agents.**

traceix lets you declare which tools your agent should call — and in what order — as plain YAML, then run those assertions in CI the same way you'd run unit tests. No LLM-as-judge, no flaky eval pipelines: if the trajectory doesn't match, the build fails.

```bash
pip install traceix
```

---

## Why traceix?

LLM-powered agents are non-deterministic. The same prompt might call `search_flights` then `confirm_booking` today, but skip straight to `confirm_booking` tomorrow. traceix makes that observable and enforceable:

- **Declare the expected trajectory** in YAML — which tools, in what order, with what args.
- **Run it in CI** — the agent still calls the real LLM; only the tool *responses* are mocked.
- **Get a clear pass/fail** — no prompt-engineering an evaluator, no statistical thresholds.

---

## Quick start

```bash
pip install traceix
traceix init          # detects your framework, scaffolds traceix.yaml + tests/example.yaml
```

Write a test (`tests/book_flight.yaml`):

```yaml
name: book-flight-basic
input: "Book me the cheapest flight from NYC to SFO"

mocks:
  search_flights:
    return: { flights: [{ id: F1, price: 390 }, { id: F2, price: 420 }] }
  confirm_booking:
    return: { booking_id: BK-001, status: confirmed }

expected:
  trajectory:
    mode: contains        # these steps must appear in order (others allowed)
    steps:
      - tool: search_flights
        args: { origin: NYC, destination: SFO }
        arg_mode: partial  # only check the keys listed above
      - tool: confirm_booking
        arg_mode: ignore
  forbidden_tools: [cancel_booking]
```

Run it:

```bash
traceix run tests/ --handler mypackage.agent:run
```

```
  ✓  book-flight-basic   1/1   2 steps   142ms
  ──────────────────────────────────────────────
  1 passed · 0 failed
```

---

## Integration

### `@traceix_tool` decorator (LangChain / LangGraph — recommended)

Add `@traceix_tool` above your `@tool` decorators. Your agent handler stays completely unchanged — traceix patches mocks in during test runs without touching it:

```python
# mypackage/tools.py
from langchain_core.tools import tool
from traceix import traceix_tool

@traceix_tool   # ← add this line; nothing else changes
@tool
def search_flights(origin: str, destination: str) -> dict:
    """Search available flights."""
    ...  # real implementation
```

```python
# mypackage/agent.py — no traceix imports, no changes needed
def run(user_input: str) -> str:
    graph = create_react_agent(model, [search_flights, confirm_booking])
    result = graph.invoke({"messages": [HumanMessage(content=user_input)]})
    return result["messages"][-1].content
```

### `tools=` parameter (any framework)

An alternative for frameworks where `@traceix_tool` isn't available. You don't write any mock code — traceix builds callable tool objects from the `mocks:` section in your YAML and passes them as the `tools` list. Your handler just accepts and forwards them:

```python
# mypackage/agent.py
def run(input: str, tools: list) -> str:  # ← accept the injected tools
    graph = create_react_agent(model, tools)  # ← forward them to the agent
    result = graph.invoke({"messages": [HumanMessage(content=input)]})
    return result["messages"][-1].content
```

Given this YAML:

```yaml
mocks:
  search_flights:
    return: { flights: [{ id: F1, price: 390 }] }
  confirm_booking:
    return: { booking_id: BK-001, status: confirmed }
```

traceix calls your handler as `run(input="...", tools=[<mocked search_flights>, <mocked confirm_booking>])`. The agent calls those tools, they return what the YAML says, and traceix records the trajectory.

traceix auto-detects which mode you're using based on whether your handler has a `tools` parameter.

---

## CLI commands

| Command | What it does |
|---|---|
| `traceix init` | Detect framework, scaffold `traceix.yaml` + example test |
| `traceix run tests/` | Run tests, exit 0 on pass / 1 on fail |
| `traceix run tests/ --fixture record` | Save real tool responses to `.traceix/fixtures/` |
| `traceix run tests/ --fixture replay` | Replay recorded responses in CI |
| `traceix snapshot tests/` | Save golden trajectory baselines |
| `traceix check tests/` | Compare live run against saved baselines |
| `traceix compare tests/ --a "model=X" --b "model=Y"` | A/B test two model configs side by side |

---

## Trajectory modes

`mode` controls how the expected steps are matched against the agent's actual tool calls. Set it under `expected.trajectory.mode`.

**`contains`** — listed steps must appear in order, but the agent can call other tools in between. Most permissive and the most common choice:

```yaml
# passes for: search → clarify → confirm  (extra "clarify" step is fine)
mode: contains
steps:
  - tool: search_flights
  - tool: confirm_booking
```

**`strict`** — the agent must call exactly these tools, in exactly this order, nothing more:

```yaml
# fails if the agent calls any extra tool or reorders steps
mode: strict
steps:
  - tool: search_flights
  - tool: confirm_booking
```

**`unordered`** — all listed tools must be called, but order doesn't matter:

```yaml
# passes whether the agent searches flights before or after searching hotels
mode: unordered
steps:
  - tool: search_flights
  - tool: search_hotels
```

**`within`** — listed steps must appear as a contiguous block (no other tools in between), but can be preceded or followed by anything:

```yaml
# passes for: login → search → confirm → logout
# fails if any tool appears between search and confirm
mode: within
steps:
  - tool: search_flights
  - tool: confirm_booking
```

## Arg modes

`arg_mode` controls how strictly the tool's arguments are checked. Set it per step under `expected.trajectory.steps`.

**`ignore`** — only assert the tool was called, don't check arguments at all:

```yaml
- tool: complete_todo
  arg_mode: ignore
```

**`partial`** — assert only the keys you list; extra arguments the agent passes are fine:

```yaml
- tool: add_todo
  args: { title: "Buy groceries" }
  arg_mode: partial   # passes even if agent also sent priority: "high"
```

**`exact`** — every argument must match and no extra keys are allowed. This is the default when `arg_mode` is omitted:

```yaml
- tool: add_todo
  args: { title: "Buy groceries", priority: "medium" }
  arg_mode: exact     # fails if agent passes any other key
```

In practice: use `ignore` when you only care that a tool ran, `partial` when you want to pin one or two key arguments, and `exact` when the full payload matters (e.g. a payment or deletion).

---

## Framework support

| Framework | Integration |
|---|---|
| LangGraph | `@traceix_tool` decorator or `tools=` injection |
| CrewAI | `tools=` injection |
| Anthropic SDK | `tools=` injection |
| OpenAI SDK | `tools=` injection |
| Any other | `tools=` injection |

---

## Configuration

Set defaults in `traceix.yaml` (or `[tool.traceix]` in `pyproject.toml`):

```yaml
handler: mypackage.agent:run
runs: 3          # runs per test case (increase in CI for confidence)
tolerance: 0.67  # fraction of runs that must pass
fixture_mode: replay
```

---

## License

MIT
