Metadata-Version: 2.4
Name: tollgateai
Version: 0.6.0
Summary: Track LLM model usage and compute live gross margin with Tollgate.
Project-URL: Homepage, https://tollgateai.vercel.app
Author: Tollgate
License: MIT
License-File: LICENSE
Keywords: anthropic,bedrock,cost,gemini,llm,margin,observability,openai,tokens,tollgate
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.8
Description-Content-Type: text/markdown

# tollgateai

> Real-time gross-margin observability for AI agents. Track every LLM call's cost, attribute it to a customer, and see whether you're making money — before the invoice goes out.

**v0.6.0** &middot; [PyPI](https://pypi.org/project/tollgateai/) &middot; [Dashboard](https://tollgateai.vercel.app)

---

## Why Tollgate

You sell an AI-powered product. Each customer interaction triggers LLM calls that cost you real money — input tokens, output tokens, reasoning tokens, audio tokens, cached tokens, web searches, tool calls. Tollgate captures that cost automatically from provider responses, joins it with the revenue your pricing model defines, and shows you per-customer, per-agent, per-run gross margin in real time.

## Installation

```bash
pip install tollgateai
```

Requires Python 3.8+. **Zero dependencies** — uses only `urllib` and `threading` from the standard library.

## Quick Start

```python
from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()          # reads TOLLGATE_API_KEY from env
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

# Every call is tracked automatically — tokens, cost, latency, tool calls.
msg = anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Resolve this billing dispute..."}],
)

# Close the run and book revenue.
tollgate.resolve(
    run_id="ticket_8842",
    customer_id="cust_acme",
    outcome="resolved",
    revenue_unit_cents=50,       # $0.50 per resolved ticket
)
```

## Provider Support

| Provider | Wrapper | Streaming | What Gets Extracted |
|---|---|---|---|
| **Anthropic** | `wrap_anthropic` | Automatic | Tokens, thinking/reasoning, cache (read + write by TTL), web search requests, tool calls, latency |
| **OpenAI** | `wrap_openai` | `stream_options={"include_usage": True}` | Tokens, reasoning, cached, audio in/out, text in/out, prediction tokens, service tier, tool calls, latency |
| **Google Gemini** | `wrap_gemini` | Automatic | Tokens, thinking, cached, audio/image/video per-modality, web search (grounding), tool calls, latency |
| **OpenAI-compatible** | `wrap_openai` + `provider="openai_compatible"` | Same as OpenAI | Same as OpenAI |
| **AWS Bedrock** | `wrap_bedrock` | Automatic | Tokens, cache (read + write), tool calls, latency |

## Configuration

| Environment Variable | Required | Default |
|---|---|---|
| `TOLLGATE_API_KEY` | Yes | — |
| `TOLLGATE_BASE_URL` | No | `https://tollgateai.vercel.app` |

Or pass them directly:

```python
tollgate = create_tollgate_client(
    api_key="tg_live_xxx",
    base_url="https://tollgateai.vercel.app",
    timeout=10.0,       # per-request timeout in seconds (default 10)
    max_retries=2,      # retries on 5xx/429/network (default 2)
)
```

---

## Auto-Instrumentation

Wrap your provider client once. Every `create` / `generate_content` call reports usage in the background — non-blocking on a daemon thread. Failures go to `on_error` (default: `logger.warning`) and never break your LLM call.

### Anthropic

```python
from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize this ticket..."}],
)
```

### OpenAI

```python
from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
openai = wrap_openai(OpenAI(), tollgate, customer_id="cust_acme")

openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)
```

### Google Gemini

```python
import google.generativeai as genai
from tollgate import create_tollgate_client, wrap_gemini

genai.configure(api_key=GEMINI_API_KEY)
tollgate = create_tollgate_client()
model = wrap_gemini(
    genai.GenerativeModel("gemini-2.0-flash"),
    tollgate,
    customer_id="cust_acme",
)

response = model.generate_content("Explain quantum computing")
```

### OpenAI-Compatible Gateways

Point the OpenAI SDK at any compatible endpoint and pass `provider="openai_compatible"`:

```python
from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
groq = wrap_openai(
    OpenAI(api_key=GROQ_KEY, base_url="https://api.groq.com/openai/v1"),
    tollgate,
    customer_id="cust_acme",
    provider="openai_compatible",
)

groq.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}],
)
```

### AWS Bedrock

```python
import boto3
from tollgate import create_tollgate_client, wrap_bedrock

tollgate = create_tollgate_client()
bedrock = wrap_bedrock(
    boto3.client("bedrock-runtime", region_name="us-east-1"),
    tollgate,
    customer_id="cust_acme",
)

bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=[{"role": "user", "content": [{"text": "Hello"}]}],
)
```

### Streaming

Streaming is captured automatically. Iterate the stream as usual — usage and latency are reported when the stream ends.

**OpenAI / compatible** requires `stream_options={"include_usage": True}`. **Anthropic**, **Gemini**, and **Bedrock** need no extra flags.

```python
stream = openai.chat.completions.create(
    model="gpt-4o",
    stream=True,
    stream_options={"include_usage": True},
    messages=[{"role": "user", "content": "Hello"}],
)
for chunk in stream:
    pass  # render to UI
# Usage + latency reported automatically when stream ends.
```

---

## What Gets Tracked

Every auto-instrumented call captures these fields from the provider response:

| Field | Providers | Description |
|---|---|---|
| `tokensIn` | All | Input tokens consumed |
| `tokensOut` | All | Output tokens generated |
| `reasoningTokens` | OpenAI, Anthropic, Gemini | Reasoning/thinking tokens (billed at reasoning rate) |
| `cachedTokens` | All | Prompt cache read tokens (reduced rate) |
| `cacheWrite5mTokens` | Anthropic, Bedrock | 5-min TTL cache creation tokens |
| `cacheWrite1hTokens` | Anthropic | 1-hour TTL cache creation tokens |
| `audioTokensIn` | OpenAI | Audio input tokens (GPT-4o audio / Realtime) |
| `audioTokensOut` | OpenAI, Gemini | Audio output tokens |
| `imageTokensIn` | Gemini | Image/vision input tokens |
| `imageTokensOut` | Gemini | Image generation output tokens |
| `videoTokensIn` | Gemini | Video input tokens |
| `textTokensIn` | OpenAI, Gemini | Text-only input tokens (modality split) |
| `textTokensOut` | OpenAI, Gemini | Text-only output tokens |
| `webSearchRequests` | Anthropic, Gemini | Web search requests (server tools / grounding) |
| `acceptedPredictionTokens` | OpenAI | Predicted Outputs: accepted tokens |
| `rejectedPredictionTokens` | OpenAI | Predicted Outputs: rejected tokens (waste) |
| `serviceTier` | OpenAI | Service tier used (`default`, `flex`, `priority`) |
| `latencyMs` | All | SDK-measured request duration in milliseconds |
| `toolCalls` | All | Number of tool calls in the response |
| `model` | All | Model identifier as reported by the provider |

Cost is computed **server-side** from token counts and a rate card that auto-syncs daily from the LiteLLM registry (1,500+ models). Rate cards include per-token pricing for text, audio, image, video, cache, reasoning, and web search. Unknown models are priced at $0 and flagged in logs.

---

## Outcome-Based Pricing

Under per-resolution pricing, only a **resolved** run earns revenue. An escalated or failed run earns $0 but its provider cost still counts.

```python
run_id = "ticket_8842"
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id=run_id,
)

# ... multiple LLM calls within this run ...

tollgate.resolve(
    run_id=run_id,
    customer_id="cust_acme",
    outcome="resolved",        # "resolved" | "escalated" | "failed"
    revenue_unit_cents=50,
)
```

For simple per-call billing, pass `revenue_unit_cents` in the wrap options and skip `resolve()`.

---

## External Tool Costs

Report costs from external services (image generation, code sandboxes, search APIs) alongside LLM calls:

```python
tollgate.track({
    "customerId": "cust_acme",
    "runId": "ticket_8842",
    "provider": "openai",
    "model": "gpt-4o",
    "tokensIn": 500,
    "tokensOut": 200,
    "externalCostCents": 4.0,     # $0.04 for the DALL-E call
    "idempotencyKey": "ticket_8842#step_2",
})
```

---

## Customer & Plan Setup

Create customers and assign plans before sending usage so plan-priced revenue is recognized from the first event. Idempotent.

```python
tollgate.upsert_customer(
    "cust_acme",
    name="Acme Corp",
    plan={
        "name": "Pro Plan",
        "pricingModel": "usage_based",   # per_unit | per_resolution | usage_based | per_seat | flat | hybrid
        "unitRevenueCents": 10,
    },
)
```

---

## API Reference

### Exports

```python
# Client
create_tollgate_client(api_key?, base_url?, timeout?, max_retries?)  # -> TollgateClient
TollgateError                    # Exception with status & body

# Auto-instrumentation wrappers
wrap_anthropic(client, tollgate, customer_id, **kwargs)   # -> instrumented Anthropic client
wrap_openai(client, tollgate, customer_id, **kwargs)      # -> instrumented OpenAI / compatible client
wrap_bedrock(client, tollgate, customer_id, **kwargs)     # -> instrumented Bedrock client
wrap_gemini(model, tollgate, customer_id, **kwargs)       # -> instrumented Gemini model

# Low-level event builders (for manual track payloads)
anthropic_event_from(msg, customer_id, **kwargs)          # -> dict | None
openai_event_from(completion, customer_id, **kwargs)      # -> dict | None
bedrock_event_from(usage, model, customer_id, **kwargs)   # -> dict | None
gemini_event_from(response, customer_id, **kwargs)        # -> dict | None
```

### TollgateClient

| Method | Description |
|---|---|
| `track(event)` | Report a single usage event. Idempotent on `idempotencyKey`. |
| `resolve(run_id, customer_id, outcome, ...)` | Close a run with an outcome. Books revenue only when `outcome` is `"resolved"`. |
| `upsert_customer(customer_id, ...)` | Create or update a customer and optionally assign a plan. |

### Wrapper Parameters

| Parameter | Type | Required | Description |
|---|---|---|---|
| `customer_id` | `str` | Yes | Your end customer's stable identifier |
| `agent_id` | `str` | No | Agent or workflow identifier |
| `run_id` | `str \| Callable` | No | Logical run ID (defaults to provider response ID) |
| `provider` | `str` | No | Override the reported provider |
| `revenue_unit_cents` | `int \| Callable` | No | Revenue per call in cents |
| `provider_cost_cents` | `float \| Callable` | No | Exact cost override (skips rate card) |
| `on_error` | `Callable` | No | Error handler for background tracking |

---

## How It Works

1. **Proxy wrappers** intercept provider calls without modifying the request or response.
2. After the provider responds, the wrapper extracts token counts (by modality), tool calls, service tier, and latency from the response.
3. A `POST /api/track` fires **on a background daemon thread** with automatic retries on transient failures.
4. The server computes cost from tokens via rate cards (text, audio, image, video, cache, reasoning, web search), joins it with plan-configured revenue, and updates real-time margin rollups.
5. Events are **idempotent** on `idempotencyKey` (auto-set to the provider response ID).

## Privacy & Security

- **No prompt content is ever sent.** Only token counts, model name, and metadata.
- Events are deduplicated server-side.
- Background tracking never raises into your application code.

---

## What's New in v0.6.0

- **Fix: Anthropic thinking token extraction** — `output_tokens_details.thinking_tokens` is now extracted and costed at the reasoning rate instead of the output rate. Previously, thinking tokens from extended thinking (Sonnet 4.x, Opus 4.x) were invisible to cost computation.
- **Fix: OpenAI double-counting** — `completion_tokens` includes reasoning and audio sub-totals; these are now subtracted from `tokensOut` so each token is costed at exactly one rate. Previously, reasoning tokens were billed at both the output rate and the reasoning rate.
- **Fix: OpenAI input double-counting** — `prompt_tokens` includes cached and audio sub-totals; these are now subtracted from `tokensIn`. Previously, cached tokens were billed at both the full input rate and the cached rate.
- **Fix: Multimodal-only events** — audio, image, video, and web search events now trigger rate-card lookup even when text token counts are zero.
- `reasoningTokens` is now extracted from **all three** providers: OpenAI, Anthropic, and Gemini.

### v0.5.0

- Google Gemini / Vertex AI support (`wrap_gemini`) with full multimodal extraction
- Audio token tracking (OpenAI GPT-4o audio / Realtime API)
- Image & video token tracking (Gemini per-modality breakdowns)
- Web search request tracking (Anthropic `server_tool_use`, Gemini grounding)
- Latency measurement on all wrappers (SDK-measured `latencyMs`)
- OpenAI Predicted Outputs (`acceptedPredictionTokens` / `rejectedPredictionTokens`)
- Service tier tracking (OpenAI `flex` / `priority`, Anthropic `priority`)
- Text modality split for accurate cost attribution in mixed-modal requests
- Expanded rate card sync: audio, image, video, and web search rates from LiteLLM

---

## License

Licensed for use with Tollgate.
