Metadata-Version: 2.4
Name: zhanla-sdk-py
Version: 0.1.2.1
Summary: Benchmark SDK for instrumenting AI components
Project-URL: Homepage, https://benchmark-black.vercel.app/
Project-URL: Repository, https://github.com/zhanla-ai
Author-email: Zhanla <chouweimin@berkeley.edu>
License: Proprietary
Requires-Python: >=3.10
Provides-Extra: anthropic
Requires-Dist: anthropic; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: example-providers
Requires-Dist: anthropic; extra == 'example-providers'
Requires-Dist: google-genai; extra == 'example-providers'
Requires-Dist: openai; extra == 'example-providers'
Provides-Extra: google
Requires-Dist: google-genai; extra == 'google'
Provides-Extra: openai
Requires-Dist: openai; extra == 'openai'
Description-Content-Type: text/markdown

# zhanla-sdk-py

`zhanla-sdk-py` is the Python SDK for defining Benchmark components in code.

You use it to declare tools, skills, agents, orchestrations, and evals as Python objects, then run them with `zhanla`.

## Installation

Install the SDK:

```bash
pip install zhanla-sdk-py
```

Requires Python `>=3.10`.

The SDK itself has no runtime dependencies. Provider packages (`anthropic`, `openai`, `google-genai`) are optional and only required if you use `bench.wrap()`.

If you want to execute components from the command line, install the CLI too:

```bash
pip install zhanla
```

## Quick Start

Create a Python file with module-level component instances:

```python
import anthropic
import zhanla

client = bench.wrap(anthropic.Anthropic())


def _classify(message: str, customer_tier: str = "standard", **_) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=64,
        system='Reply with JSON: {"priority": "high|normal|low"}.',
        messages=[{"role": "user", "content": message}],
    )
    import json
    result = json.loads(response.content[0].text)
    result["customer_tier"] = customer_tier
    return result


priority_tool = bench.Tool(
    name="priority_tool",
    description="Classify the priority of a support message.",
    input_schema={},
    fn=_classify,
    input_schema={"message": str, "customer_tier": str},
    output_schema={
        "priority": str,
        "customer_tier": str,
    },
)


def priority_eval(model_response, expected_output, **_) -> dict:
    return {
        "score": 1.0 if model_response["priority"] == expected_output["priority"] else 0.0
    }


priority_eval_component = bench.CodeEval(
    name="priority_eval",
    description="Check whether the predicted priority matches the expected value.",
    input_schema={},
    fn=priority_eval,
)
```

Run it with the CLI:

```bash
bench run components.py:priority_tool --dataset tickets.json --eval components.py:priority_eval
```

If a file contains exactly one runnable component, `:component_name` is optional.

## Core Concepts

The SDK is class-based. The public import is:

```python
import zhanla as bench
```

Define components as module-level objects so the CLI can discover them when it imports your file.

## Runnable Components

### `Tool`

Use a `Tool` for deterministic Python logic.

```python
lookup_customer = bench.Tool(
    name="lookup_customer",
    description="Fetch a customer record by ID.",
    input_schema={},
    fn=get_customer,
    input_schema={"customer_id": str},
    output_schema={"id": str, "email": str},
)
```

Requirements:

- `name`
- `description`
- `fn`
- `input_schema`
- `output_schema`

Notes:

- `fn` can be sync or async.
- `input_schema` can be a simple dict like `{"field": str}`, a JSON-Schema-shaped dict, or a Pydantic model class.
- If `fn` returns a non-dict value at runtime, the CLI wraps it as `{"result": value}`.
- The CLI validates the first produced output against `output_schema`.
- `output_schema` can be a simple dict like `{"field": str}` or a Pydantic model class.

### `Skill`

Use a `Skill` for reusable instructions, optionally backed by Python code.

```python
summarize_ticket = bench.Skill(
    name="summarize_ticket",
    description="Summarize a support ticket.",
    instructions="Summarize the ticket in one short paragraph.",
)
```

With tools and an output schema:

```python
summarize_ticket = bench.Skill(
    name="summarize_ticket",
    description="Summarize a support ticket.",
    instructions="Summarize the ticket in one short paragraph.",
    tools=[lookup_customer],
    output_schema={"summary": str},
)
```

Requirements:

- `name`
- `description`
- `instructions`

Notes:

- `tools` and `output_schema` are optional.
- Skills are prompt-only definitions. They cannot be executed directly in the local CLI; they are composed into Agents and Orchestrations.

### `Agent`

Use an `Agent` to define an LLM-backed component with instructions and references to other components.

```python
import zhanla as bench

support_agent = bench.Agent(
    name="support_agent",
    description="Respond to support requests.",
    instructions="Answer clearly and use available tools when needed.",
    model="claude-sonnet-4-6",
    tools=[lookup_customer],
    skills=[summarize_ticket],
    output_schema={"answer": str},
)
```

Requirements:

- `name`
- `description`
- `instructions`
- `model`

Notes:

- `tools`, `skills`, `agents`, and `output_schema` are optional.
- Local CLI execution requires a configured `runner` on the component. Without a runner, the CLI raises an error.

### `LLMProcessor`

Use an `LLMProcessor` when you want a prompt-defined LLM transformation step.

```python
import zhanla as bench

intent_classifier = bench.LLMProcessor(
    name="intent_classifier",
    description="Classify the user's intent.",
    instructions="Return the intent as billing, technical, or other.",
    model="claude-haiku-4-5",
    output_schema={"intent": str},
)
```

Requirements:

- `name`
- `description`
- `instructions`
- `model`

Notes:

- `output_schema` is optional.
- Local CLI execution requires a configured `runner` on the component. Without a runner, the CLI raises an error.

### `Orchestration`

Use an `Orchestration` to compose multiple steps into a DAG.

```python
support_pipeline = bench.Orchestration(
    name="support_pipeline",
    description="Classify intent, then draft a reply.",
    steps=[
        bench.Step(component=intent_classifier, name="classify", next=["reply"]),
        bench.Step(component=support_agent, name="reply"),
    ],
)
```

Requirements:

- `name`
- `description`
- `steps`

Notes:

- `bench.Step` is an alias for `bench.OrchestrationStep`.
- Step names must be unique.
- `next` targets must point to existing steps.
- Cycles are rejected by CLI validation.
- During execution, each step receives the accumulated state dictionary.

### `Conditional`

Use `Conditional` inside an orchestration to route between steps.

```python
bench.Step(
    component=bench.Conditional(
        condition=lambda state: state["classify"]["intent"] == "billing",
        if_true="billing_reply",
        if_false="general_reply",
    ),
    name="route",
)
```

`Conditional` does not emit output. It only chooses the next step.

## Eval Components

### `CodeEval`

Use a `CodeEval` for Python-based scoring logic.

```python
quality_eval = bench.CodeEval(
    name="quality_eval",
    description="Score whether the answer is acceptable.",
    input_schema={},
    fn=score_answer,
)
```

Requirements:

- `name`
- `description`
- `fn`

Notes:

- `fn` can be sync or async.
- If the eval returns a non-dict value, runtime wraps it as `{"score": value}`.
- `model_response_format` defaults to `"JSON"` and can also be set to `"TEXT"` or `"YAML"`.

### `LLMEval`

Use an `LLMEval` for prompt-defined scoring.

```python
tone_eval = bench.LLMEval(
    name="tone_eval",
    description="Check response tone.",
    instructions="Return a score from 0.0 to 1.0 and a short reason.",
    model="your-model-id",
    output_schema={"score": float, "reason": str},
)
```

Requirements:

- `name`
- `description`
- `instructions`
- `model`

Notes:

- `output_schema` is optional.
- Local CLI execution is currently placeholder-based.

### `Checklist`

Use a `Checklist` to combine multiple evals with optional weights.

```python
answer_quality = bench.Checklist(
    name="answer_quality",
    description="Combine correctness and tone scores.",
    evals=[quality_eval, tone_eval],
    weights=[0.8, 0.2],
)
```

Notes:

- If `weights` is omitted, each eval gets weight `1.0`.
- Weights must be positive and must match the number of evals.

### `EvalTree`

Use an `EvalTree` for score-based branching.

```python
adaptive_eval = bench.EvalTree(
    name="adaptive_eval",
    description="Route to different evals based on an initial score.",
    root=bench.Branch(
        eval=quality_eval,
        threshold=0.8,
        if_pass=[bench.Edge(weight=1.0, node=bench.Leaf(eval=quality_eval))],
        if_fail=[bench.Edge(weight=1.0, node=bench.Leaf(eval=tone_eval))],
    ),
)
```

Notes:

- Branch thresholds must be between `0.0` and `1.0`.
- Edge weights must be positive.

## Discovery And CLI Usage

The CLI discovers components by importing your Python file and scanning module-level attributes for `bench` component instances.

That means:

- your file is executed at import time
- module-level side effects will run during discovery
- components should usually be defined at module scope
- if a file contains multiple runnable components, use `file.py:component_name`
- evals are referenced separately with `--eval file.py:eval_name`

Example:

```bash
bench run workflow.py:support_pipeline --dataset tickets.json --eval evals.py:answer_quality
```

## Validation Rules

Before execution, the CLI validates component structure.

- `Tool` must provide a callable `fn` and a non-`None` `output_schema`
- `CodeEval` must provide a callable `fn`
- `Skill`, `Agent`, `LLMProcessor`, and `LLMEval` must provide `instructions`
- `Agent`, `LLMProcessor`, and `LLMEval` must provide `model`
- `Orchestration` steps must reference valid targets and must not contain cycles
- `Checklist` weights must match the eval count and all be positive
- `EvalTree` branch thresholds must stay in `[0.0, 1.0]` and edge weights must be positive

## Local Execution Caveats

The SDK defines the component model. The current local CLI runtime dispatches as follows:

- `Tool` — executes `fn`
- `CodeEval` — executes `fn`
- `Skill` — raises an error; Skills are prompt-only and cannot be executed directly
- `Agent` — requires a configured `runner` and `model`; calls the runner to generate a response
- `LLMProcessor` — requires a configured `runner` and `model`; calls the runner to generate a response
- `LLMEval` — requires a configured `runner` and `model`; calls the runner to score the response
- `Orchestration` — executes its steps locally and returns the last step output

When a `runner` is set and a wrapped client is used, actual LLM calls are made and traced.

## Version Hashing

Every component exposes `version_hash()` for stable content-based versioning.

```python
tool.version_hash()
support_agent.version_hash()
answer_quality.version_hash()
```

Highlights:

- descriptions do not affect the hash
- `Tool` hashes function source and `output_schema`
- `Agent` hashes instructions, model, referenced component names, and `output_schema`
- `Checklist` hashes referenced eval names and weights
- `EvalTree` hashes its tree structure

## Parsing Model Output

Use `bench.parse_json_response(text)` to extract JSON from raw model text, including fenced code blocks.

```python
import zhanla as bench

text = client.messages.create(...).content[0].text
result = bench.parse_json_response(text)
```

This handles responses wrapped in ` ```json ` fences as well as bare JSON strings.

## LLM Call Observability

### `bench.wrap(client)`

Wrap an LLM client so every call made through it is recorded against the current eval run.

```python
import anthropic
import openai
import zhanla as bench

# Anthropic
client = bench.wrap(anthropic.Anthropic())

# OpenAI (also covers OpenRouter via base_url)
client = bench.wrap(openai.OpenAI())
```

The wrapped client is identical to the original. `bench.wrap()` only observes — it does not re-implement any LLM logic.

Supported clients:

| Client | Import |
|---|---|
| `anthropic.Anthropic` | `pip install anthropic` |
| `anthropic.AsyncAnthropic` | `pip install anthropic` |
| `openai.OpenAI` | `pip install openai` |
| `openai.AsyncOpenAI` | `pip install openai` |
| `google.genai.Client` | `pip install google-genai` |

When `bench.wrap()` is active and `llm_function` is called by the CLI, each LLM call captures:

- provider, model
- input messages, output, tool calls, raw response
- input/output token counts
- latency
- stop reason

### Trace context

The CLI sets a `TraceContext` before calling `llm_function`. The wrapped client reads the active context automatically. You do not need to manage the context directly.

If you need to access the trace context in your own code:

```python
from zhanla.trace_store import get_trace_context

ctx = get_trace_context()  # None outside a CLI run
if ctx:
    print(ctx.trace_id)
```

## Advanced Utilities

Most users only need the component classes above. The SDK also exposes a few lower-level helpers:

```python
import zhanla as bench
from zhanla.registry import registry
```

- `bench.ComponentType` enum for component categories
- `bench.EvalTrace` for runtime trace records
- `bench.parse_json_response(text)` for extracting JSON from model text responses
- `bench.get_all()` and `bench.clear()` for the execution-local trace store
- `registry.register(...)`, `registry.get(...)`, `registry.get_by_name(...)`, `registry.discover()`, and `registry.clear()` for the global component registry

In normal CLI usage, you do not need to register components manually.
