Metadata-Version: 2.4
Name: tollgateai
Version: 0.8.0
Summary: Track LLM model usage and compute live gross margin with Tollgate.
Project-URL: Homepage, https://tollgateai.vercel.app
Author: Tollgate
License: MIT
License-File: LICENSE
Keywords: anthropic,bedrock,cost,gemini,llm,margin,observability,openai,tokens,tollgate
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.8
Description-Content-Type: text/markdown

<p align="center">
  <img src="https://www.tollgateai.dev/logo.png" alt="Tollgate" width="120" />
</p>

<h1 align="center">tollgateai</h1>

<p align="center">
  <strong>Real-time gross-margin observability for AI-powered products.</strong><br />
  Track every LLM call's cost, attribute it to a customer, and know whether you're making money — before the invoice goes out.
</p>

<p align="center">
  <a href="https://pypi.org/project/tollgateai/"><img src="https://img.shields.io/pypi/v/tollgateai?color=blue&label=pypi" alt="pypi" /></a>
  <a href="https://pypi.org/project/tollgateai/"><img src="https://img.shields.io/pypi/dm/tollgateai?color=green" alt="downloads" /></a>
  <img src="https://img.shields.io/badge/python-%3E%3D3.8-brightgreen" alt="python" />
  <img src="https://img.shields.io/badge/dependencies-0-brightgreen" alt="zero deps" />
  <img src="https://img.shields.io/badge/license-MIT-blue" alt="license" />
</p>

<p align="center">
  <a href="https://www.tollgateai.dev">Dashboard</a> &middot;
  <a href="https://www.npmjs.com/package/@tollgateai/sdk">TypeScript SDK</a> &middot;
  <a href="#quick-start">Quick Start</a> &middot;
  <a href="#api-reference">API Reference</a>
</p>

---

## Why Tollgate?

AI products bill customers on plans (per ticket, per seat, usage-based) but pay providers per token. **Tollgate joins the two in real time** — giving you per-customer, per-agent, per-run gross margin the moment each LLM call completes.

- **2-line integration** — wrap your provider client once; every call is tracked automatically.
- **Zero dependencies** — uses only `urllib` and `threading` from the Python standard library.
- **Non-blocking** — usage reporting fires on a daemon thread. Failures never raise into your application code.
- **Privacy-first** — no prompt content is ever transmitted. Only token counts, model identifiers, and metadata.
- **Universal coverage** — Anthropic, OpenAI, Google Gemini, AWS Bedrock, and every OpenAI-compatible gateway.

```
┌──────────────┐    ┌───────────────┐    ┌────────────────┐
│  Your App    │───▶│ LLM Provider  │───▶│   Provider     │
│  (SDK wrap)  │◀──│ (Anthropic,   │◀──│   Response     │
│              │    │  OpenAI, …)   │    │  (tokens, id)  │
└──────┬───────┘    └───────────────┘    └────────────────┘
       │
       │  POST /api/track (background daemon thread)
       ▼
┌─────────────────────────────────────────────────────┐
│  Tollgate Server                                    │
│                                                     │
│  ┌─────────────┐ ┌───────────┐ ┌─────────────────┐ │
│  │ Rate Card   │ │ Plan      │ │ Margin Rollups  │ │
│  │ (1,500+     │ │ Revenue   │ │ (per customer,  │ │
│  │  models)    │ │ Config    │ │  agent, run)    │ │
│  └─────────────┘ └───────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────┘
```

---

## Installation

```bash
pip install tollgateai
```

**Requirements:** Python 3.8+ &middot; Zero dependencies &middot; Standard library only (`urllib`, `threading`)

---

## Quick Start

```python
from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()          # reads TOLLGATE_API_KEY from env
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

# Every call is tracked automatically — tokens, cost, latency, tool calls.
msg = anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Resolve this billing dispute..."}],
)

# Close the run and book revenue.
tollgate.resolve(
    run_id="ticket_8842",
    customer_id="cust_acme",
    outcome="resolved",
    revenue_unit_cents=50,       # $0.50 per resolved ticket
)
```

---

## Provider Support

| Provider | Wrapper | Streaming | Extracted Fields |
|---|---|---|---|
| **Anthropic** | `wrap_anthropic` | Automatic | Input/output tokens, cache read/write, web search requests, tool calls, latency |
| **OpenAI** | `wrap_openai` | `stream_options={"include_usage": True}` | Input/output tokens, reasoning, cached, audio in/out, text in/out, prediction tokens, service tier, tool calls, latency |
| **Google Gemini** | `wrap_gemini` | Automatic | Input/output tokens, thinking, cached, audio/image/video per-modality, web search (grounding), tool calls, latency |
| **OpenAI-compatible** | `wrap_openai` + `provider="openai_compatible"` | Same as OpenAI | Same as OpenAI + gateway-reported cost (when available) |
| **AWS Bedrock** | `wrap_bedrock` | Automatic | Input/output tokens, cache read/write (per-TTL split), tool calls, latency |

---

## Configuration

### Environment Variables

| Variable | Required | Default | Description |
|---|---|---|---|
| `TOLLGATE_API_KEY` | Yes | — | Your account API key (`tg_live_…`) |
| `TOLLGATE_BASE_URL` | No | `https://www.tollgateai.dev` | Self-hosted deployment URL |

### Programmatic Configuration

```python
tollgate = create_tollgate_client(
    api_key="tg_live_xxx",
    base_url="https://www.tollgateai.dev",
    timeout=10.0,       # per-request timeout in seconds (default 10)
    max_retries=2,      # retries on 5xx/429/network (default 2)
)
```

---

## Auto-Instrumentation

Wrap your provider client once. Every `create` / `generate_content` / `converse` call reports usage on a background daemon thread — non-blocking, fire-and-forget. Failures go to `on_error` (default: `logger.warning`) and never raise into your application code.

### Anthropic

```python
from anthropic import Anthropic
from tollgate import create_tollgate_client, wrap_anthropic

tollgate = create_tollgate_client()
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id="ticket_8842",
)

anthropic.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Summarize this ticket..."}],
)
```

### OpenAI

```python
from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
openai = wrap_openai(OpenAI(), tollgate, customer_id="cust_acme")

openai.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello"}],
)
```

### Google Gemini

```python
import google.generativeai as genai
from tollgate import create_tollgate_client, wrap_gemini

genai.configure(api_key=GEMINI_API_KEY)
tollgate = create_tollgate_client()
model = wrap_gemini(
    genai.GenerativeModel("gemini-2.0-flash"),
    tollgate,
    customer_id="cust_acme",
)

response = model.generate_content("Explain quantum computing")
```

### OpenAI-Compatible Gateways

Works with any OpenAI-compatible endpoint — OpenRouter, Groq, Together, Nebius, Vercel AI Gateway, local vLLM, and more.

```python
from openai import OpenAI
from tollgate import create_tollgate_client, wrap_openai

tollgate = create_tollgate_client()
groq = wrap_openai(
    OpenAI(api_key=GROQ_KEY, base_url="https://api.groq.com/openai/v1"),
    tollgate,
    customer_id="cust_acme",
    provider="openai_compatible",
)

groq.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello"}],
)
```

> When a gateway returns cost inline (e.g. OpenRouter's `usage.cost`), the SDK captures it automatically as `providerCostCents`. The server uses it verbatim, bypassing the rate card. Gateways that don't return cost fall through to rate-card pricing. An explicit `provider_cost_cents` in the wrapper options always takes precedence.

### AWS Bedrock

```python
import boto3
from tollgate import create_tollgate_client, wrap_bedrock

tollgate = create_tollgate_client()
bedrock = wrap_bedrock(
    boto3.client("bedrock-runtime", region_name="us-east-1"),
    tollgate,
    customer_id="cust_acme",
)

bedrock.converse(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    messages=[{"role": "user", "content": [{"text": "Hello"}]}],
)
```

### Streaming

Streaming is captured automatically. Iterate the stream as usual — usage and latency are reported when the stream ends.

**OpenAI / compatible** requires `stream_options={"include_usage": True}`. **Anthropic**, **Gemini**, and **Bedrock** need no extra flags.

```python
stream = openai.chat.completions.create(
    model="gpt-4o",
    stream=True,
    stream_options={"include_usage": True},
    messages=[{"role": "user", "content": "Hello"}],
)
for chunk in stream:
    pass  # render to UI
# Usage + latency reported automatically when stream ends.
```

---

## Tracked Fields

Every auto-instrumented call captures these fields from the provider response:

| Field | Providers | Description |
|---|---|---|
| `tokensIn` | All | Input tokens (deduplicated — excludes cached/audio for OpenAI; excludes cached/audio/image/video for Gemini) |
| `tokensOut` | All | Output tokens (deduplicated — excludes reasoning/audio for OpenAI; excludes audio/image for Gemini) |
| `reasoningTokens` | OpenAI, Gemini | Reasoning/thinking tokens (billed at reasoning rate) |
| `cachedTokens` | All | Prompt cache read tokens (reduced rate) |
| `cacheWrite5mTokens` | Anthropic, Bedrock | Cache creation tokens (5-minute TTL) |
| `cacheWrite1hTokens` | Bedrock | Cache creation tokens (1-hour TTL) |
| `audioTokensIn` / `Out` | OpenAI, Gemini | Audio modality tokens (GPT-4o audio, Gemini multimodal) |
| `imageTokensIn` / `Out` | Gemini | Image/vision input and generation output tokens |
| `videoTokensIn` | Gemini | Video input tokens |
| `textTokensIn` / `Out` | OpenAI, Gemini | Text-only modality tokens |
| `webSearchRequests` | Anthropic, Gemini | Web search requests (server tools / grounding) |
| `acceptedPredictionTokens` | OpenAI | Predicted Outputs: accepted tokens |
| `rejectedPredictionTokens` | OpenAI | Predicted Outputs: rejected (waste) tokens |
| `serviceTier` | OpenAI | Service tier (`default`, `flex`, `priority`) |
| `latencyMs` | All | SDK-measured request duration in milliseconds |
| `toolCalls` | All | Number of tool calls in the response |
| `providerCostCents` | OpenAI-compatible | Gateway-reported cost — used verbatim, bypasses rate card |
| `model` | All | Model identifier as reported by the provider |

Cost is computed **server-side** from token counts and a rate card that auto-syncs daily from the LiteLLM registry (1,500+ models). Rate cards include per-token pricing for every modality, cache tier, reasoning, and web search. Unknown models are priced at $0 and flagged in logs.

---

## Provider Field Coverage

<details>
<summary><strong>Anthropic</strong> — Messages API</summary>

| Anthropic API Field | SDK Field | Notes |
|---|---|---|
| `usage.input_tokens` | `tokensIn` | Input tokens (excludes cached) |
| `usage.output_tokens` | `tokensOut` | Output tokens (includes reasoning — billed at output rate) |
| `usage.cache_read_input_tokens` | `cachedTokens` | Prompt cache read tokens |
| `usage.cache_creation_input_tokens` | `cacheWrite5mTokens` | Prompt cache creation tokens |
| `usage.server_tool_use.web_search_requests` | `webSearchRequests` | Web search server tool requests |
| `response.content[]` (type `tool_use`) | `toolCalls` | Count of tool-use content blocks |
| *(SDK-measured)* | `latencyMs` | Request duration |

Anthropic bills reasoning tokens at the output rate. The SDK reports the full `output_tokens` count; the server-side rate card applies the matching output rate.

In streaming mode, `message_start` carries input/cache counts and `message_delta` carries the output count. The SDK accumulates both automatically.

</details>

<details>
<summary><strong>OpenAI</strong> — Chat Completions API</summary>

| OpenAI API Field | SDK Field | Notes |
|---|---|---|
| `usage.prompt_tokens` | `tokensIn` | **Minus** cached and audio tokens to prevent double-billing |
| `usage.completion_tokens` | `tokensOut` | **Minus** reasoning and audio tokens to prevent double-billing |
| `usage.completion_tokens_details.reasoning_tokens` | `reasoningTokens` | Reasoning/thinking tokens |
| `usage.prompt_tokens_details.cached_tokens` | `cachedTokens` | Prompt cache read tokens |
| `usage.prompt_tokens_details.audio_tokens` | `audioTokensIn` | Audio input tokens |
| `usage.completion_tokens_details.audio_tokens` | `audioTokensOut` | Audio output tokens |
| `usage.prompt_tokens_details.text_tokens` | `textTokensIn` | Text modality input tokens |
| `usage.completion_tokens_details.text_tokens` | `textTokensOut` | Text modality output tokens |
| `usage.completion_tokens_details.accepted_prediction_tokens` | `acceptedPredictionTokens` | Predicted Outputs: accepted |
| `usage.completion_tokens_details.rejected_prediction_tokens` | `rejectedPredictionTokens` | Predicted Outputs: rejected |
| `service_tier` | `serviceTier` | Service tier used |
| `choices[].message.tool_calls` | `toolCalls` | Tool call count |
| *(SDK-measured)* | `latencyMs` | Request duration |

OpenAI's `prompt_tokens` and `completion_tokens` are totals that include sub-category tokens. The SDK subtracts each sub-category so every token is costed at exactly one rate.

</details>

<details>
<summary><strong>Google Gemini</strong> — Generative AI / Vertex AI</summary>

| Google API Field | SDK Field | Notes |
|---|---|---|
| `usageMetadata.promptTokenCount` | `tokensIn` | **Minus** cached, audio, image, video to prevent double-billing |
| `usageMetadata.candidatesTokenCount` | `tokensOut` | **Minus** audio and image output (thinking is already excluded by Google) |
| `usageMetadata.thoughtsTokenCount` | `reasoningTokens` | Thinking/reasoning tokens (Gemini 2.x) |
| `usageMetadata.cachedContentTokenCount` | `cachedTokens` | Prompt cache read tokens |
| `promptTokensDetails[AUDIO]` | `audioTokensIn` | Audio input modality |
| `candidatesTokensDetails[AUDIO]` | `audioTokensOut` | Audio output modality |
| `promptTokensDetails[IMAGE]` | `imageTokensIn` | Image/vision input |
| `candidatesTokensDetails[IMAGE]` | `imageTokensOut` | Image generation output |
| `promptTokensDetails[VIDEO]` | `videoTokensIn` | Video input |
| `promptTokensDetails[TEXT]` | `textTokensIn` | Text input |
| `candidatesTokensDetails[TEXT]` | `textTokensOut` | Text output |
| `candidates[].groundingMetadata.webSearchQueries` | `webSearchRequests` | Google Search grounding |
| `candidates[].content.parts[].functionCall` | `toolCalls` | Function call count |
| *(SDK-measured)* | `latencyMs` | Request duration |

Google's `candidatesTokenCount` does **not** include `thoughtsTokenCount`, so reasoning tokens are not subtracted. However, it **does** include audio and image output tokens, so the SDK subtracts those to prevent double-billing.

The Python SDK handles both snake_case (`usage_metadata`, `prompt_token_count`) and camelCase (`usageMetadata`, `promptTokenCount`) response formats — compatible with both the official `google-generativeai` SDK and the REST API.

</details>

<details>
<summary><strong>AWS Bedrock</strong> — Converse API</summary>

| Bedrock API Field | SDK Field | Notes |
|---|---|---|
| `usage.inputTokens` | `tokensIn` | Input tokens |
| `usage.outputTokens` | `tokensOut` | Output tokens (includes reasoning — Bedrock does not split) |
| `usage.cacheReadInputTokens` | `cachedTokens` | Prompt cache read tokens |
| `usage.cacheDetails[ttl="5m"]` | `cacheWrite5mTokens` | Cache creation (5-minute TTL) |
| `usage.cacheDetails[ttl="1h"]` | `cacheWrite1hTokens` | Cache creation (1-hour TTL, higher rate) |
| `output.message.content[].toolUse` | `toolCalls` | Tool-use content block count |
| *(SDK-measured)* | `latencyMs` | Request duration |

Bedrock's `cacheDetails` array provides per-TTL breakdowns. The SDK splits these into `cacheWrite5mTokens` and `cacheWrite1hTokens`. When `cacheDetails` is absent, `cacheWriteInputTokens` falls back to the 5m bucket.

In streaming mode (`converse_stream`), the final `metadata` event carries usage totals. Tool calls are accumulated from `contentBlockStart` events.

</details>

---

## Pricing Models

### Per-Call Revenue

For simple per-call billing, pass `revenue_unit_cents` in the wrapper options:

```python
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    revenue_unit_cents=50,  # $0.50 earned per LLM call
)
```

### Outcome-Based Pricing

Under per-resolution pricing, only a **resolved** run earns revenue. Escalated or failed runs earn $0, but provider costs still count against margin.

```python
run_id = "ticket_8842"
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    run_id=run_id,
)

# ... multiple LLM calls within this run ...

tollgate.resolve(
    run_id=run_id,
    customer_id="cust_acme",
    outcome="resolved",        # "resolved" | "escalated" | "failed"
    revenue_unit_cents=50,
)
```

### External Tool Costs

Report costs from non-LLM services (image generation, code sandboxes, search APIs) alongside LLM calls:

```python
tollgate.track({
    "customerId": "cust_acme",
    "runId": "ticket_8842",
    "provider": "openai",
    "model": "dall-e-3",
    "tokensIn": 0,
    "tokensOut": 0,
    "externalCostCents": 4.0,     # $0.04 for the DALL-E call
    "idempotencyKey": "ticket_8842#dalle",
})
```

---

## Customer & Plan Setup

Create customers and assign plans before sending usage so plan-priced revenue is recognized from the first event. Idempotent — safe to call on every app boot.

```python
tollgate.upsert_customer(
    "cust_acme",
    name="Acme Corp",
    plan={
        "name": "Pro Plan",
        "pricingModel": "usage_based",   # per_unit | per_resolution | usage_based | per_seat | flat | hybrid
        "unitRevenueCents": 10,
    },
)
```

---

## Error Handling

The SDK separates **tracking errors** (non-fatal) from **client errors** (actionable):

```python
import logging
from tollgate import create_tollgate_client, wrap_anthropic, TollgateError

# Tracking errors are logged as warnings by default.
# Override with on_error to route to your observability stack:
anthropic = wrap_anthropic(
    Anthropic(), tollgate,
    customer_id="cust_acme",
    on_error=lambda err: sentry_sdk.capture_exception(err),
)

# Client errors (missing API key, invalid plan) raise TollgateError:
try:
    tollgate.upsert_customer("cust_acme")
except TollgateError as err:
    print(err.status, err.body)  # HTTP status + response body
```

**Retry behavior:** The client retries on 5xx and 429 responses with exponential backoff (200ms, 400ms, ...). Deterministic 4xx errors (400, 401, 403, 404, 422) fail immediately.

**Logging:** The SDK uses the standard `logging` module under the `"tollgate"` logger name. Configure it as you would any Python logger:

```python
logging.getLogger("tollgate").setLevel(logging.DEBUG)
```

---

## API Reference

### Exports

```python
# Client
create_tollgate_client(api_key?, base_url?, timeout?, max_retries?)  # -> TollgateClient
TollgateError                    # Exception with status & body

# Auto-instrumentation wrappers
wrap_anthropic(client, tollgate, customer_id, **kwargs)   # -> instrumented Anthropic client
wrap_openai(client, tollgate, customer_id, **kwargs)      # -> instrumented OpenAI / compatible client
wrap_bedrock(client, tollgate, customer_id, **kwargs)     # -> instrumented Bedrock client
wrap_gemini(model, tollgate, customer_id, **kwargs)       # -> instrumented Gemini model

# Low-level event builders (for manual track payloads)
anthropic_event_from(msg, customer_id, **kwargs)          # -> dict | None
openai_event_from(completion, customer_id, **kwargs)      # -> dict | None
bedrock_event_from(usage, model, customer_id, **kwargs)   # -> dict | None
gemini_event_from(response, customer_id, **kwargs)        # -> dict | None
```

### `TollgateClient`

| Method | Description |
|---|---|
| `track(event: dict)` | Report a single usage event. Idempotent on `idempotencyKey`. Returns `{"status", "eventId"}`. |
| `resolve(run_id, customer_id, outcome, ...)` | Close a run with an outcome. Books revenue only when `outcome == "resolved"`. |
| `upsert_customer(customer_id, ...)` | Create or update a customer and optionally assign a plan. Returns `{"status", "customerId", "id", "planId"}`. |

### `create_tollgate_client` Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `api_key` | `str` | `TOLLGATE_API_KEY` env | Account API key |
| `base_url` | `str` | `https://www.tollgateai.dev` | Tollgate server URL |
| `timeout` | `float` | `10.0` | Per-request timeout in seconds |
| `max_retries` | `int` | `2` | Retry attempts on 5xx / 429 / network errors |

### Wrapper Parameters

| Parameter | Type | Required | Description |
|---|---|---|---|
| `customer_id` | `str` | Yes | Your end customer's stable identifier |
| `agent_id` | `str` | No | Agent or workflow identifier |
| `run_id` | `str \| Callable` | No | Logical run ID (defaults to provider response ID) |
| `provider` | `str` | No | Override the reported provider |
| `revenue_unit_cents` | `int \| Callable` | No | Revenue per call in cents |
| `provider_cost_cents` | `float \| Callable` | No | Exact cost override in cents (skips rate card) |
| `on_error` | `Callable` | No | Error handler for background tracking (default: `logger.warning`) |

---

## How It Works

1. **Proxy wrappers** intercept provider calls without modifying the request or response. Your code sees the exact same types and behavior as without the SDK.
2. After the provider responds, the wrapper extracts token counts by modality, tool calls, service tier, and latency from the response object.
3. A `POST /api/track` fires **on a background daemon thread** with automatic retries on transient failures. Your application code continues immediately.
4. The server computes cost from tokens via rate cards (per modality, cache tier, reasoning, and web search), joins it with plan-configured revenue, and updates real-time margin rollups.
5. Events are **idempotent** — deduplication is based on `idempotencyKey` (auto-set to the provider response ID).

## Security & Privacy

- **No prompt content is ever transmitted.** Only token counts, model identifiers, and metadata.
- **Idempotent ingestion** — duplicate events are safely deduplicated server-side.
- **Non-invasive** — background tracking never raises into your application code.
- **Transport security** — all communication over HTTPS with Bearer token authentication.
- **Thread-safe** — all wrappers and the client are safe for concurrent use.

---

## Changelog

### v0.8.0

- Base URL now points to `https://www.tollgateai.dev` for all SDK requests
- Confirmed zero-dependency footprint across all providers (Anthropic, OpenAI, Bedrock, Gemini)

### v0.7.0

- Google Gemini: fixed double-billing of multimodal input tokens — `tokensIn` now subtracts cached, audio, image, and video tokens
- Aligned Anthropic extraction with actual API response fields
- Simplified Anthropic streaming accumulation
- Verified field coverage for all five providers

### v0.6.0

- OpenAI: fixed double-counting of reasoning, audio, and cached tokens
- Multimodal-only events now trigger rate-card lookup
- Reasoning token extraction for OpenAI and Gemini

### v0.5.0

- Google Gemini / Vertex AI support with full multimodal extraction
- Audio, image, video, and text modality token tracking
- Web search request tracking (Anthropic, Gemini)
- OpenAI Predicted Outputs and service tier tracking
- Latency measurement on all wrappers

---

## License

MIT &mdash; see [LICENSE](LICENSE) for details.
