Metadata-Version: 2.4
Name: paygent
Version: 0.1.0
Summary: The economic brain for AI agents — meter costs and enforce guardrails.
Project-URL: Homepage, https://paygent.to/
Project-URL: Documentation, https://paygent.to/
Project-URL: Issues, https://paygent.to/
Author-email: Abhishek <yabhishekkm@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: agents,ai,ai-agents,billing,guardrails,llm,metering
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: aiosqlite>=0.19.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: pydantic>=2.0
Provides-Extra: all
Requires-Dist: crewai>=0.1.0; extra == 'all'
Requires-Dist: langchain-core>=0.1.0; extra == 'all'
Provides-Extra: crewai
Requires-Dist: crewai>=0.1.0; extra == 'crewai'
Provides-Extra: dev
Requires-Dist: anthropic>=0.18.0; extra == 'dev'
Requires-Dist: openai>=1.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.1.0; extra == 'langchain'
Description-Content-Type: text/markdown

# Paygent

**The economic brain for AI agents** -- meter costs and enforce guardrails.

Paygent is a Python SDK that auto-instruments LLM API calls to meter per-user costs (including model-level token tracking), enforce spending guardrails, and sync usage data to the Paygent backend. It's the missing runtime enforcement layer for AI agent applications.

## Quick Start

```bash
pip install paygent
```

Configure your plans once on the Paygent dashboard, then in your app:

```python
import openai
from paygent import Paygent, paygent_context

pg = Paygent.init(api_key="pg_live_...")

# Wrap LLM calls in paygent_context with the end user's ID.
# Paygent auto-loads the user's plan on first use — no extra setup.
with paygent_context(user_id="user_123"):
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    )

# Query usage any time
usage = pg.get_usage("user_123")
print(f"Period cost: ${usage.period_cost:.4f}")
```

No backend? See [Local Mode](#local-mode) for running fully offline.

## Features

- **Auto-instrumentation** -- Monkey-patches OpenAI and Anthropic SDKs. The LLM call line itself is unchanged — you just wrap it in `paygent_context(user_id=...)`. Works transparently with most frameworks that route through these SDKs (tested: LangChain, LangGraph, CrewAI).
- **Per-user metering** -- Track token consumption per user, per session, per model in real time.
- **Spending guardrails** -- Soft gates (warnings) and hard gates (blocks) for period spend, session spend, and per-model token limits.
- **Concurrency-safe** -- Two-phase reservation pattern protects against hard-gate overrun when concurrent calls race at a cap boundary.
- **Model-level tracking** -- Track and limit tokens per model separately (e.g., 50K GPT-4o + 30K Claude per period).
- **Background sync** -- Events sync to the Paygent backend asynchronously without blocking your agent.
- **Local fallback** -- Works fully offline with local SQLite. Events queue and sync when the backend is reachable.
- **Fail-open** -- Paygent is designed not to break your agent. Every path that intercepts an LLM call is guarded with try/except and falls through to the original call on error.

## Installation

```bash
# Core SDK
pip install paygent

# With LangChain support
pip install paygent[langchain]

# With CrewAI support
pip install paygent[crewai]

# Everything
pip install paygent[all]
```

## Usage

### Auto-Instrumentation

When `Paygent.init()` runs, it monkey-patches OpenAI and Anthropic SDK methods. Any subsequent call inside a `paygent_context(user_id=...)` block is automatically metered and guard-checked. No changes to the LLM call line itself.

```python
import openai
from paygent import Paygent, paygent_context

pg = Paygent.init(api_key="pg_live_...")

with paygent_context(user_id="user_123"):
    # Automatically metered -- nothing else to do
    response = openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "..."}],
    )
```

**Frameworks**: LangChain, LangGraph, and CrewAI all call the OpenAI/Anthropic SDKs under the hood, so auto-instrumentation covers them with no extra wiring. Wrap framework entry points (e.g. `chain.invoke(...)`) in `paygent_context(...)` just like direct SDK calls.

### When to call `start_session` (optional)

The SDK auto-loads a user's session on first use — `start_session()` is **not required**. Call it explicitly only when you want to:

- **Pre-warm** the cache to avoid the one backend round-trip on first call
- **Supply a plan config inline** (useful in local-only mode, or as a fallback in case the backend is unreachable)
- **Fire `on_session_start` callbacks** at a known moment (e.g. at request start)

In connected mode with plans configured on the Paygent backend, you can skip it entirely.

```python
# Pre-warm (optional — just avoids latency on the first call)
pg.start_session("user_123")
```

### Decorator

```python
@pg.track(user_id="user_123")
def handle_request(query):
    return openai.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
    )

# Dynamic user ID from a function argument
@pg.track(user_id_param="uid")
def handle_request(uid: str, query: str):
    return openai.chat.completions.create(...)
```

### Explicit Wrap

For cases where you prefer explicit per-call control over monkey-patching:

```python
import openai
client = openai.OpenAI()

# Sync: wrap() takes a ZERO-ARG CALLABLE
response = pg.wrap(
    lambda: client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    ),
    user_id="user_123",
    model="gpt-4o",
)
```

```python
import openai
async_client = openai.AsyncOpenAI()

# Async: awrap() takes an AWAITABLE (the coroutine directly)
response = await pg.awrap(
    async_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Hello!"}],
    ),
    user_id="user_123",
    model="gpt-4o",
)
```

The `model` parameter is optional — Paygent extracts it from the response after the call. Note: **per-model token-limit checks only apply when `model` is passed in**, since the pre-call guard can't know which model's cap to check until you tell it.

You can also pass `session_id`, `metadata`, `provider` (explicit token extractor), and `estimated_input_tokens` / `estimated_max_tokens` (for better reservation sizing under concurrency).

### Plan Configuration

Plans are normally configured on the Paygent dashboard/API and fetched by the SDK on session load. You only need to construct a `PlanConfig` in code for **local-only mode** (no API key) or as a **fallback** when the backend is unreachable.

```python
from paygent import PlanConfig, ModelCostRate, ModelLimitConfig

plan_config = PlanConfig(
    max_spend_per_period=49.00,
    max_spend_per_session=5.00,
    soft_gate_at=0.80,      # Warn at 80%
    hard_gate_at=1.00,      # Block at 100%
    model_limits={
        "gpt-4o": ModelLimitConfig(max_tokens_per_period=50000),
        "claude-sonnet-4-20250514": ModelLimitConfig(max_tokens_per_period=30000),
    },
    cost_rates={
        "gpt-4o": ModelCostRate(input=0.0025, output=0.01),
        "claude-sonnet-4-20250514": ModelCostRate(input=0.003, output=0.015),
    },
    # Fallback rate for models not listed in cost_rates (opt-in)
    default_cost_rate=ModelCostRate(input=0.002, output=0.008),
    tool_costs={"web_search": 0.05},
    # Safety margin applied to reservation estimates under concurrency.
    # Absorbs estimation drift (chars/4 tokenizer approximation, unknown
    # max_tokens, small race windows at cap boundaries).  Only affects the
    # TEMPORARY hold during the await — actual recorded spend is always
    # the real cost from the response.
    reservation_safety_factor=1.2,
)
```

### Guardrails

```python
from paygent import PaygentLimitExceeded

# Register soft gate callback (approaching a limit)
def on_approaching_limit(result):
    print(f"Warning: {result.message}")
    # result.gate_reason: "total_spend", "session_spend", "model_limit:gpt-4o"

pg.on_soft_gate(on_approaching_limit)

# Register hard gate callback (fires before the exception is raised)
def on_limit_hit(result):
    log.error(f"Blocked: {result.message}")
    notify_user(result.gate_reason)

pg.on_hard_gate(on_limit_hit)

# Hard gates raise PaygentLimitExceeded
try:
    with paygent_context(user_id="user_123"):
        response = openai.chat.completions.create(...)
except PaygentLimitExceeded as e:
    print(f"Blocked: {e.guard_result.message}")

# Pre-flight check
guard = pg.check_guard("user_123", model="gpt-4o")
if guard.status == "hard_gate":
    print("User has exceeded their limit")

# Size max_tokens safely before the call — especially useful for streaming
# or any scenario where you want to bound output to what the user can afford.
advice = pg.get_max_tokens(
    "user_123",
    model="gpt-4o-mini",
    messages=my_messages,  # Paygent estimates input tokens from this
)
if advice.max_tokens == 0:
    return f"Budget exhausted: {advice.binding_limit}"
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=my_messages,
    max_tokens=advice.max_tokens,  # never pushes the user past any limit
)
```

### Event Callbacks

```python
# Called after every successfully metered LLM call
def on_usage(event):
    print(f"{event.model}: {event.total_tokens} tokens, ${event.cost_total:.4f}")

pg.on_usage(on_usage)

# Called when a user's session is first loaded (from backend / snapshot /
# permissive defaults)
def on_session(session):
    print(f"Session: {session.user_id} on plan {session.plan}")

pg.on_session_start(on_session)
```

### Usage Queries

```python
# Period + session totals (snapshot, auto-loads if not cached)
usage = pg.get_usage("user_123")
print(f"Period cost: ${usage.period_cost:.2f}")
print(f"Session cost: ${usage.session_cost:.2f}")
print(f"Period tokens: {usage.period_tokens_total}")

# Per-model breakdown
for m in pg.get_model_usage("user_123"):
    limit = f"/ {m.tokens_limit}" if m.tokens_limit else ""
    print(f"  {m.model}: {m.tokens_used} tokens {limit}, ${m.cost:.4f}")

# Multi-dimensional remaining budget — spend caps + per-model token caps.
# Dimensions with no configured limit are reported as float('inf') for
# spend fields or None for per-model token fields.
budget = pg.get_remaining_budget("user_123")
print(f"Most constrained: {budget.most_constrained}")
if budget.period_spend_remaining != float("inf"):
    print(f"Period remaining: ${budget.period_spend_remaining:.2f}")

# Quick "is the next call allowed?" boolean
if pg.is_within_limit("user_123", model="gpt-4o"):
    response = openai.chat.completions.create(...)
```

## How It Works

Paygent adds negligible overhead per LLM call — typically single-digit milliseconds. Guard checks are in-memory operations held briefly under a per-user lock. Events are pushed to a non-blocking queue and flushed by a background thread. The call path is:

1. **Read context** — which user is this call for?
2. **Guard check + reserve** — held under a per-user lock; pre-call reservation prevents concurrent bursts from overrunning a cap.
3. **Execute the LLM call** — lock released; network I/O runs in parallel with other calls for the same user.
4. **Meter + finalize** — extract tokens from the response, update the cache (replacing the reservation with actual cost), push to the background event queue.

For the full architecture (event queue, SQLite schema, reservation semantics), see [CONTRIBUTING.md](./CONTRIBUTING.md#architecture).

## Local Mode

Paygent supports two offline-ish scenarios — they're separate, and the SDK behaves differently in each.

### Local-only mode (no backend at all)

Omit the API key to run without any backend. Everything works the same in the agent's hot path — guardrails, metering, per-model tracking — but events are stored in a local SQLite database and stay there. There's no backend to sync to.

```python
pg = Paygent.init()  # No api_key = local-only
print(pg.is_local_mode)  # True

# Plans must be supplied in code since there's no backend to fetch from.
pg.start_session("user_123", plan="free", plan_config=PlanConfig(
    max_spend_per_period=5.00,
    cost_rates={"gpt-4o": ModelCostRate(input=0.0025, output=0.01)},
))
```

Good for tests, local development, demos.

### Connected mode with offline fallback

When you pass `api_key=...` but the backend is transiently unreachable, Paygent degrades gracefully:

- Guard checks continue running against the last-known cached state.
- New events queue in the local SQLite database marked unsynced.
- A background thread retries the sync on every `sync_pending` cycle (default every 30s).
- When the backend returns, queued events flush to it automatically.

You don't need to do anything for this — it's automatic. Events are never lost due to transient backend outages.

The local database lives at `~/.paygent/local.db` by default. Override via `Paygent.init(db_path=...)`.

## API Reference

### `Paygent`

| Method | Description |
|--------|-------------|
| `Paygent.init(api_key=None, ...)` | Initialize the SDK (singleton) |
| `pg.start_session(user_id, plan, plan_config)` | **Optional** — pre-warm a user's session (SDK auto-loads on first use) |
| `pg.get_usage(user_id)` | Get current usage snapshot (auto-loads) |
| `pg.get_model_usage(user_id)` | Get per-model breakdown |
| `pg.get_remaining_budget(user_id)` | Multi-dimensional remaining budget (spend + per-model tokens) |
| `pg.get_max_tokens(user_id, model, ...)` | Recommend a safe `max_tokens` value for the next call |
| `pg.is_within_limit(user_id, model=None)` | Quick boolean: is the next call allowed? |
| `pg.check_guard(user_id, model)` | Manual pre-flight guard check (returns `GuardResult`) |
| `pg.on_soft_gate(callback)` | Register soft gate handler |
| `pg.on_hard_gate(callback)` | Register hard gate handler |
| `pg.on_usage(callback)` | Register post-metering handler |
| `pg.on_session_start(callback)` | Register session start handler |
| `pg.track(user_id=...)` | Decorator for user context |
| `pg.wrap(call, user_id, model)` | Explicit metering wrapper (sync) |
| `pg.awrap(coro, user_id, model)` | Explicit metering wrapper (async) |
| `pg.flush()` | Manually flush pending events |
| `pg.shutdown()` | Graceful shutdown |

### Context Managers

| Function | Description |
|----------|-------------|
| `paygent_context(user_id, ...)` | Set user context for a block |
| `paygent_track(user_id, ...)` | Decorator variant |

### Models

| Model | Description |
|-------|-------------|
| `PlanConfig` | Plan limits, cost rates, model limits |
| `ModelCostRate` | Per-1K-token cost for a model (input + output) |
| `ModelLimitConfig` | Per-model token cap within a plan |
| `GuardResult` | Result of a guard check (ok/soft_gate/hard_gate) |
| `UsageEvent` | A single metered event |
| `CurrentUsage` | Live usage counters |
| `ModelUsage` | Per-model tokens/cost snapshot |
| `BudgetRemaining` | Remaining spend and per-model tokens (returned by `get_remaining_budget`) |
| `MaxTokensAdvice` | Safe `max_tokens` recommendation (returned by `get_max_tokens`) |
| `UserState` | Full cached state for a user (plan + usage + billing period) |
| `BillingPeriod` | Subscription-anchored billing window |
| `UserSession` | Deprecated alias for `UserState` (kept for backward compat) |

## Known Limitations

### Multi-process / multi-replica deployments

Paygent keeps per-user usage in an **in-memory cache per process** and syncs events to the backend on a background timer. Guard checks (soft gate, hard gate, model limits) run against the **local cache only** — they do not round-trip to the backend on every LLM call.

When you run multiple worker processes (Gunicorn with `workers > 1`, multi-replica Kubernetes, multiple containers, etc.), each process has its own independent cache. The caches converge by periodic refresh from the backend (`refresh_interval`, default **60 seconds**), but between refreshes they drift.

**Practical impact**: a user making concurrent requests that land on different workers can briefly exceed their configured limit. Maximum possible overspend per refresh window is roughly:

```
(workers - 1) × refresh_interval × request_rate × avg_cost_per_request
```

**Example**: 4 Gunicorn workers, 1 LLM req/sec, $0.01/req, 60s refresh → up to **~$1.80 overspend per user per minute** in the worst case.

**Mitigations** (pick based on your needs):

1. **Single worker**: run with `--workers 1` if strict per-user enforcement is required and throughput is acceptable.
2. **Tighter refresh**: pass `Paygent.init(refresh_interval=10.0)` — reduces drift by 6× at the cost of 6× more backend traffic.
3. **Generous plan buffer**: configure hard gates with a safety margin (e.g. set hard gate at 90% of what you actually want to enforce) until shared-cache support lands.

**Planned for Phase 2**: shared-cache mode (Redis or lease-based budget) that removes this drift entirely while preserving the sub-millisecond guard-check latency of the local cache.

## Contributing

See [CONTRIBUTING.md](./CONTRIBUTING.md) for development setup, architecture details, testing, and release process.

## License

MIT
