Metadata-Version: 2.4
Name: ashr-labs
Version: 0.3.1
Summary: Python SDK for the Ashr Labs API
License-Expression: MIT
Project-URL: Homepage, https://github.com/ashr-labs/testing-platform
Project-URL: Documentation, https://github.com/ashr-labs/testing-platform#readme
Project-URL: Repository, https://github.com/ashr-labs/testing-platform
Keywords: ashr,labs,api,sdk
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.27
Requires-Dist: pydantic>=2.6
Provides-Extra: livekit
Requires-Dist: livekit-agents>=0.10; extra == "livekit"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Dynamic: license-file

# Ashr Labs Python SDK

A Python client library for evaluating AI agents against Ashr Labs test datasets.

## Documentation

- [Testing Your Agent](docs/testing-your-agent.md) — **start here** (includes debugging failures with transcripts and classification)
- [Quick Start Guide](docs/quickstart.md)
- [Installation](docs/installation.md)
- [Authentication](docs/authentication.md)
- [API Reference](docs/api-reference.md)
- [Error Handling](docs/error-handling.md)
- [Examples](docs/examples.md)

## Installation

```bash
pip install ashr-labs
```

## Quick Start

```python
from ashr_labs import AshrLabsClient, EvalRunner

# Only need your API key — base_url and tenant_id are automatic
client = AshrLabsClient(api_key="tp_your_api_key_here")

# Fetch a dataset and run your agent against it
runner = EvalRunner.from_dataset(client, dataset_id=42)
run = runner.run(my_agent)

# Submit results — grading happens server-side
created = run.deploy(client, dataset_id=42)

# Wait for grading to complete (typically 1-3 minutes)
graded = client.poll_run(created["id"])
metrics = graded["result"]["aggregate_metrics"]
print(f"Passed: {metrics['tests_passed']}/{metrics['total_tests']}")
```

Your agent just needs two methods:

```python
class MyAgent:
    def respond(self, message: str) -> dict:
        # Call your LLM, return {"text": "...", "tool_calls": [...]}
        return {"text": "response", "tool_calls": []}

    def reset(self) -> None:
        # Clear conversation history between scenarios
        pass
```

See [Testing Your Agent](docs/testing-your-agent.md) for a full end-to-end guide.

## Agents

Agents group your datasets and define how they should be generated and graded. Create an agent once, then generate consistent datasets for it.

```python
# Create an agent with tool definitions and grading config
agent = client.create_agent(
    name="Support Bot",
    description="Spanish-language healthcare scheduling agent",
    config={
        "tool_definitions": [
            {"name": "fetch_kareo_data", "required": True, "description": "Fetch appointment availability"},
            {"name": "save_data", "required": True, "description": "Persist caller info"},
            {"name": "end_session", "required": False, "description": "Close the conversation"},
        ],
        "behavior_rules": [
            {"rule": "Always fetch before quoting availability", "strictness": "required"},
            {"rule": "Save caller name via save_data", "strictness": "required"},
        ],
        "grading_config": {
            "tool_strictness": {
                "fetch_kareo_data": "required",
                "end_session": "optional",
                "await_user_response": "optional",
            },
        },
    },
)

# Link a dataset to the agent
client.set_dataset_agent(dataset_id=42, agent_id=agent["id"])

# Submit a run and auto-link to agent
run.deploy(client, dataset_id=42, agent_id=agent["id"])
```

### Grading behavior

The grading system uses agent config to make smarter decisions:

- **`required` tools**: Must be called. If the agent skips a required tool, it's a failure.
- **`optional` tools**: If the agent achieves the same intent via text (e.g. ends the conversation naturally instead of calling `end_session`), the grader recovers it as a partial match instead of a failure.
- **`expected` tools**: Should be called, but a miss is a warning, not a failure.

## Observability — Production Tracing

Trace your agent in production. Captures LLM calls, tool invocations, and events. **Never crashes your agent** — if the backend is unreachable, errors are logged silently.

```python
# Context managers (recommended) — auto-end on exit, auto-capture errors
with client.trace("handle-ticket", user_id="user_42") as trace:
    with trace.generation("classify", model="claude-sonnet-4-6",
                          input=[{"role": "user", "content": "help"}]) as gen:
        result = call_llm(...)
        gen.end(output=result, usage={"input_tokens": 50, "output_tokens": 12})

    with trace.span("tool:search", input={"q": "..."}) as tool:
        data = search(...)
        tool.end(output=data)

# Analytics
analytics = client.get_observability_analytics(days=7)
print(f"Traces: {analytics['overview']['total_traces']}")
print(f"Tool calls: {analytics['overview']['total_tool_calls']}")
```

See [API Reference](docs/api-reference.md) for full Trace/Span/Generation docs.

## Voice Observability — LiveKit

For realtime voice agents, the SDK ships an `ashr_labs.voice_obs` submodule that captures STT/LLM/TTS metrics, turn boundaries, barge-ins, and mixed-audio replay from a LiveKit `AgentSession`. Two-line attach:

```python
import os
from ashr_labs.voice_obs.livekit import VoiceObservability

obs = VoiceObservability(api_key=os.environ["ASHR_API_KEY"])
obs.attach(session, agent_id="support_v3", agent_version="v42")
```

LiveKit deps live behind an extra:

```bash
pip install ashr-labs[livekit]
```

Runnable demos:
```bash
python -m ashr_labs.voice_obs.examples.livekit_worker dev          # minimal
python -m ashr_labs.voice_obs.examples.ashr_support_agent dev      # full demo
```

Voice sessions land in the same Observability panel as text-trace sessions; the dashboard auto-renders turns, transcripts, per-stage cost, latency, and audio replay.

## VM Stream Logs

Attach virtual machine session logs to test results for browser-based or desktop-based agents:

```python
test = run.add_test("checkout_flow")
test.start()
# ... run agent, add tool calls and responses ...

# Kernel browser session (first-class support)
test.set_kernel_vm(
    session_id="kern_sess_abc123",
    duration_ms=15000,
    logs=[
        {"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
        {"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
    ],
    replay_id="replay_abc123",
    replay_view_url="https://www.kernel.sh/replays/replay_abc123",
    stealth=True,
    viewport={"width": 1920, "height": 1080},
)

# Or use the generic set_vm_stream() for any provider
test.set_vm_stream(
    provider="browserbase",
    session_id="sess_abc123",
    duration_ms=45000,
    logs=[
        {"ts": 0, "type": "navigation", "data": {"url": "https://app.example.com"}},
        {"ts": 1200, "type": "action", "data": {"action": "click", "selector": "#login"}},
    ],
)
test.complete()
```

## Available Methods

All methods that accept `tenant_id` auto-resolve it from your API key if omitted.

### Agents

| Method | Description |
|--------|-------------|
| `list_agents()` | List all agents with dataset counts |
| `create_agent(name, description, config)` | Create a new agent |
| `update_agent(agent_id, name, description, config)` | Update an agent |
| `delete_agent(agent_id)` | Soft-delete an agent |
| `get_agent_datasets(agent_id)` | Get datasets linked to an agent |
| `set_dataset_agent(dataset_id, agent_id)` | Link/unlink a dataset to an agent |

### Datasets

| Method | Description |
|--------|-------------|
| `get_dataset(dataset_id, ...)` | Get a dataset by ID |
| `list_datasets(limit, cursor, ...)` | List datasets (cursor-based pagination) |

### Runs

| Method | Description |
|--------|-------------|
| `create_run(dataset_id, result, ...)` | Create a new test run |
| `get_run(run_id)` | Get a run by ID |
| `list_runs(dataset_id, limit)` | List runs |
| `delete_run(run_id)` | Delete a run |
| `poll_run(run_id, timeout, poll_interval)` | Wait for server-side grading to complete |

### EvalRunner

| Method | Description |
|--------|-------------|
| `EvalRunner.from_dataset(client, dataset_id)` | Create a runner from a dataset |
| `runner.run(agent, max_workers=1, on_environment=...)` | Run agent against all scenarios, return `RunBuilder` |
| `runner.run_and_deploy(agent, client, dataset_id, max_workers=1)` | Run and submit in one call |

### RunBuilder

| Method | Description |
|--------|-------------|
| `RunBuilder()` | Create a new run builder |
| `run.start()` | Mark the run as started |
| `run.add_test(test_id)` | Add a test and get a `TestBuilder` |
| `run.complete(status)` | Mark the run as completed |
| `run.build()` | Serialize to a result dict |
| `run.deploy(client, dataset_id, agent_id)` | Build and submit via the API |

### TestBuilder

| Method | Description |
|--------|-------------|
| `test.start()` | Mark the test as started |
| `test.add_user_file(file_path, description)` | Record a user file upload |
| `test.add_user_text(text, description)` | Record a user text input |
| `test.add_tool_call(expected, actual, match_status)` | Record an agent tool call |
| `test.add_agent_response(expected_response, actual_response, match_status)` | Record an agent response |
| `test.set_vm_stream(provider, session_id, logs, ...)` | Attach VM session logs |
| `test.set_kernel_vm(session_id, ...)` | Attach Kernel VM session (convenience) |
| `test.complete(status)` | Mark the test as completed |

### Requests

| Method | Description |
|--------|-------------|
| `create_request(request_name, request, ...)` | Create a new request |
| `get_request(request_id)` | Get a request by ID |
| `list_requests(status, limit, cursor)` | List requests |

### Observability

| Method | Description |
|--------|-------------|
| `client.trace(name, ...)` | Start a production trace (returns `Trace`) |
| `trace.span(name, ...)` / `trace.generation(name, ...)` | Add spans or LLM calls |
| `trace.end(output=...)` | Flush trace to backend (**never raises**) |
| `list_observability_traces(user_id, session_id, ...)` | List traces |
| `get_observability_trace(trace_id)` | Get trace with full observation tree |
| `get_observability_analytics(days)` | Analytics: tokens, latency, errors, tool perf |
| `get_observability_errors(days, limit, page)` | Traces with errors |
| `get_observability_tool_errors(days, limit, page)` | Traces with tool failures |

### API Keys & Session

| Method | Description |
|--------|-------------|
| `init()` | Validate credentials and get user/tenant info |
| `list_api_keys(include_inactive)` | List API keys for your tenant |
| `revoke_api_key(api_key_id)` | Revoke an API key |
| `health_check()` | Check if the API is reachable |

## Error Handling

```python
from ashr_labs import AshrLabsClient, NotFoundError, AuthenticationError

client = AshrLabsClient(api_key="tp_...")

try:
    dataset = client.get_dataset(dataset_id=999)
except AuthenticationError:
    print("Invalid API key")
except NotFoundError:
    print("Dataset not found")
```

## Configuration

```python
# All defaults — just pass API key
client = AshrLabsClient(api_key="tp_...")

# From environment (reads ASHR_LABS_API_KEY)
client = AshrLabsClient.from_env()

# Custom timeout
client = AshrLabsClient(api_key="tp_...", timeout=60)

# Custom base URL (for self-hosted)
client = AshrLabsClient(api_key="tp_...", base_url="https://your-api.example.com")
```

## Requirements

- Python 3.10+
- No external dependencies (uses only standard library)

## License

MIT
