Metadata-Version: 2.4
Name: llm-toll
Version: 0.13.0
Summary: Lightweight drop-in Python decorator to track costs, monitor token usage, and enforce budget/rate limits for LLM API calls
Project-URL: Homepage, https://github.com/felipemorandini/llm-toll
Project-URL: Repository, https://github.com/felipemorandini/llm-toll
Project-URL: Issues, https://github.com/felipemorandini/llm-toll/issues
Author: Felipe Morandini
License: MIT
License-File: LICENSE
Keywords: anthropic,budget,cost-tracking,decorator,gemini,llm,openai
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Typing :: Typed
Requires-Python: >=3.10
Provides-Extra: all
Requires-Dist: anthropic>=0.25; extra == 'all'
Requires-Dist: google-genai>=1.0; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: psycopg2-binary>=2.9; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.25; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: gemini
Requires-Dist: google-genai>=1.0; extra == 'gemini'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: postgres
Requires-Dist: psycopg2-binary>=2.9; extra == 'postgres'
Description-Content-Type: text/markdown

# llm-toll

A lightweight, drop-in Python decorator to track costs, monitor token usage, and enforce budget and rate limits for LLM API calls.

## Overview

`llm_toll` is a developer tool designed for local prototyping and small-scale production scripts. By simply wrapping a function with `@track_costs`, developers can automatically log token usage, calculate the exact cost of the run in USD, and halt execution if a hard-coded budget or API rate limit is breached.

## Features

- **Drop-In Decorator** — Minimal code intrusion. Just add `@track_costs` above any function making an LLM call.
- **Multi-Provider Support** — Built-in pricing matrices for OpenAI, Anthropic, Gemini, and general OpenAI-compatible endpoints.
- **Hard Budget Caps** — Prevents functions from executing if the cumulative cost exceeds a defined threshold.
- **Rate Limiting** — Local enforcement of RPM and TPM to prevent HTTP 429 errors.
- **Local Persistence** — SQLite-backed usage tracking across multiple script runs and days.
- **Cost Reporting** — Clean, color-coded terminal summary of cost per call and total session cost.

## Quick Start

### Installation

```bash
pip install llm-toll
# or, with uv
uv add llm-toll
```

### Basic Usage (Auto-detect)

For users utilizing standard SDKs, the decorator infers the model and token count from the response object.

```python
from llm_toll import track_costs

@track_costs(project="my_scraper", max_budget=2.00, reset="monthly")
def generate_summary(text):
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": text}]
    )
    return response  # Decorator parses the usage from this object
```

### Advanced Usage (Rate Limits & Explicit Models)

For custom setups or raw API requests, users can explicitly state the model and rate limits.

```python
from llm_toll import track_costs

@track_costs(
    model="claude-sonnet-4-20250514",
    rate_limit=50,       # max 50 requests per minute
    tpm_limit=40000,     # max 40k tokens per minute
    extract_usage=lambda res: (res['model'], res['in_tokens'], res['out_tokens'])
)
def custom_anthropic_call(prompt):
    # custom logic here
    pass
```

Rate limits use a sliding-window algorithm. When a limit is reached, `LocalRateLimitError` is raised with a `retry_after` attribute indicating how long to wait.

### Streaming Support

The decorator automatically detects streaming responses (generators). Cost is tracked after the stream is fully consumed.

```python
from llm_toll import track_costs

@track_costs(project="my_app", max_budget=5.00)
def stream_response(text):
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": text}],
        stream=True,
        stream_options={"include_usage": True},  # recommended for accurate counts
    )

for chunk in stream_response("Hello"):
    print(chunk.choices[0].delta.content, end="")
# Cost is logged automatically after the stream completes
```

> **Note:** For accurate token counts with OpenAI streaming, pass `stream_options={"include_usage": True}`. Without it, output tokens are estimated using a character-based heuristic.

### Async Support

The decorator auto-detects async functions and async generators — no changes needed:

```python
from llm_toll import track_costs

@track_costs(project="my_app", max_budget=5.00)
async def async_chat(text):
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": text}]
    )
    return response

@track_costs(project="my_app")
async def async_stream(text):
    stream = await client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": text}],
        stream=True, stream_options={"include_usage": True},
    )
    async for chunk in stream:
        yield chunk
```

SQLite operations run in a thread pool (`asyncio.to_thread`) so the event loop is never blocked.

## Supported Providers

| Provider | SDK Auto-Parsing | Streaming Support | Custom Model Overrides |
|----------|-----------------|-------------------|----------------------|
| OpenAI | Yes (`openai` client) | Yes (chunk calculation) | Yes |
| Anthropic | Yes (`anthropic` client) | Yes | Yes |
| Google Gemini | Yes (`google-genai` client) | Yes | Yes |
| Local/Ollama | Via OpenAI-compat API | N/A | Rate limiting only ($0 cost) |

### Local/Ollama Models

Local models (`ollama/`, `local/`, `llama.cpp/` prefixes) are tracked at $0 cost. Rate limiting still applies — useful for managing local GPU resources.

```python
from llm_toll import track_costs

@track_costs(
    model="ollama/llama3",
    rate_limit=10,       # limit local GPU to 10 RPM
    extract_usage=lambda r: ("ollama/llama3", r["prompt_tokens"], r["completion_tokens"])
)
def local_inference(prompt):
    # Ollama call here
    pass
```

> **Tip:** Ollama's API is OpenAI-compatible, so if you use the `openai` client pointed at `localhost:11434`, auto-parsing works automatically.

## LiteLLM Integration

Track costs automatically across all LiteLLM calls — no decorator needed:

```python
import litellm
from llm_toll import LiteLLMCallback

litellm.callbacks = [LiteLLMCallback(project="my-app", max_budget=10.0)]

# All litellm completions are now tracked automatically
response = litellm.completion(model="gpt-4o", messages=[{"role": "user", "content": "Hi"}])
```

The callback also works with the `@track_costs` decorator — use whichever approach fits your codebase.

## LangChain Integration

Track costs across all LLM calls in a LangChain chain or agent:

```python
from langchain_openai import ChatOpenAI
from llm_toll import LangChainCallback

handler = LangChainCallback(project="my-chain", max_budget=10.0)
llm = ChatOpenAI(model="gpt-4o", callbacks=[handler])
```

Budget is checked *before* each LLM call (`on_llm_start`), and usage is logged *after* (`on_llm_end`).

## Error Handling

```python
from llm_toll.exceptions import BudgetExceededError, LocalRateLimitError

try:
    result = generate_summary("some text")
except BudgetExceededError as e:
    print(f"Budget exceeded: {e}")
except LocalRateLimitError as e:
    print(f"Rate limit hit: {e}")
```

## CLI Dashboard

View costs and usage from the terminal:

```bash
# Show cost summary across all projects
llm-toll --stats

# Filter by project or model
llm-toll --stats --project my_scraper
llm-toll --stats --model gpt-4o

# Reset a project's budget counter
llm-toll --reset --project my_scraper

# Export usage logs to CSV
llm-toll --export csv > usage.csv
llm-toll --export csv --project my_scraper --output report.csv

# Update pricing from remote source
llm-toll --update-pricing

# Launch web dashboard with charts and analytics
llm-toll --dashboard
llm-toll --dashboard --port 9000
```

The web dashboard shows cost trends, project/model breakdowns, and budget utilization in your browser at `http://127.0.0.1:8050`.

Pricing can also be updated programmatically:

```python
from llm_toll import update_pricing

update_pricing()  # fetches latest pricing, caches for 24h
```

## PostgreSQL Backend (Team-Wide Tracking)

For team-wide cost visibility, use a shared PostgreSQL database:

```bash
pip install llm-toll[postgres]

# Set via environment variable (all @track_costs decorators auto-connect)
export LLM_TOLL_STORE_URL=postgresql://user:pass@host/llm_costs
```

Or configure programmatically:

```python
from llm_toll import create_store, set_store

store = create_store(url="postgresql://user:pass@host/llm_costs")
set_store(store)
```

The CLI also supports `--store-url`:

```bash
llm-toll --stats --store-url postgresql://user:pass@host/llm_costs
```

The PostgreSQL backend uses connection pooling and row-level locking for safe concurrent budget enforcement across multiple processes and machines.

## Development

```bash
# Install dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Lint & format
uv run ruff check .
uv run ruff format .

# Type check
uv run mypy src/llm_toll
```

## License

MIT License — see [LICENSE](LICENSE) for details.
