Metadata-Version: 2.4
Name: routecut
Version: 0.1.0
Summary: Drop-in LLM cost router: classify each prompt by task class and route it to the cheapest sufficient model. Shows the dollars saved on every call.
License: MIT
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Provides-Extra: extras
Requires-Dist: tiktoken>=0.7; extra == "extras"
Requires-Dist: rich>=13.0; extra == "extras"
Provides-Extra: dev
Requires-Dist: routecut[extras]; extra == "dev"
Requires-Dist: pytest>=8.0; extra == "dev"

# routecut

> A drop-in LLM cost router. It classifies each prompt by task class and routes
> it to the **cheapest model that's still sufficient**, then shows you the
> dollars saved on every call. Cut the "route everything to one premium model"
> bill ~50%+ without hand-building a router — and prove the savings.

Most LLM traffic gets sent to one frontier model "to be safe," but most of that
traffic is routine (drafts, formatting, simple tool-call filling). `routecut`
classifies each prompt, picks the cheapest sufficient model under a policy you
can read and edit, falls back/escalates when it's unsure, and accounts the cost
saved versus a premium baseline — per call.

## Why

OpenRouter and LiteLLM are the *transport* (call many models through one API);
you still pick the model. `routecut` is the *decision*: classify the prompt,
choose the cheapest-sufficient model, and make the savings visible. It can even
ride on top of a unified API. Local-first, BYO keys, transparent policy.

## Install

```bash
pip install -e .            # core, zero required deps
pip install -e ".[extras]"  # + tiktoken (accurate token counts) + rich
```

Python 3.11+.

## Quick start

```python
from routecut import Router

router = Router.from_config()          # routes.toml + pricing.toml, BYO keys via env
resp = router.chat(messages=[{"role": "user", "content": "translate hello to French"}])

print(resp.text)
print(resp.routing.model)        # e.g. "qwen-turbo" — the cheap model it chose
print(resp.routing.saved_usd)    # $ saved vs the premium baseline, this call
```

It speaks the OpenAI chat shape and routes to **OpenAI, DeepSeek, Qwen
(DashScope), Moonshot, MiniMax, and any OpenAI-compatible endpoint** by provider
prefix — set the keys for whichever providers your routes use.

## See it save money (no API key needed)

```bash
python examples/demo_savings.py
```

Routes a realistic mix of prompts (drafts, a tool call, hard reasoning) through
a fake provider and prints which model each went to and the running savings vs
always-premium. Then:

```bash
routecut savings     # total spent vs baseline, savings %, by class + provider
routecut calls       # recent routed calls with cost + $ saved
```

## How it routes

1. **Classify** the prompt: `draft | tool_use | reasoning | code_explore |
   code_plan | code_execute`, each with a **confidence** (transparent heuristics;
   no model in the hot path).
2. **Decide** via `routes.toml`: pick the cheapest model the policy allows for
   that class. If confidence is below `min_confidence`, **escalate** one tier up
   (bias errors toward "spent a bit more", never "worse answer").
3. **Call** the chosen model. On error or a cheap quality signal (empty output,
   bare refusal), **fall back** down the route's chain.
4. **Account**: `saved_usd = baseline_cost - actual_cost`, using the same token
   counts and pricing table for both. Reported honestly — it can be negative if
   a call was escalated to/above the baseline.

## Configure (data, not code)

`routes.toml` — the routing policy. Edit freely; no release needed:

```toml
baseline_model = "claude-opus-4.1"   # what savings are measured against

[policy]
min_confidence = 0.6                  # below this, escalate one tier up

[route.draft]
model = "qwen-turbo"
fallback = "gpt-4o-mini"

[route.reasoning]
model = "claude-sonnet"
escalate_to = "claude-opus-4.1"       # used when classifier confidence is low
```

`pricing.toml` — per-model $/1M tokens (also `ROUTECUT_PRICING=/path`). Unknown
models fall back to a high default so savings never look better than reality.

## CLI

| Command | What it does |
|---|---|
| `routecut savings` | total spent vs baseline, savings %, by class + provider |
| `routecut calls` | recent routed calls (class, model, cost, $ saved) |
| `routecut policy` | print routes + pricing; validate every routed model is priced |
| `routecut doctor` | check provider keys present + pricing coverage |

## Honest limits (status: MVP)

- **Classification is heuristic v1.** It emits a confidence and escalates when
  unsure, but it can mis-route. The fallback/escalation path is the safety net;
  tune `routes.toml` and `min_confidence` to your workload. A tiny-model
  classifier for ambiguous cases is planned, not shipped.
- **"Sufficient" is not yet a measured quality score.** v1 optimizes cost under
  a policy *you* control and catches gross failures (empty/refusal) for
  fallback. A real quality eval is a separate, later scope — we don't overclaim
  it.
- **Not built yet:** the LangGraph `ConditionalEdge` preset, the coding-agent
  gateway preset, and the HTTP proxy for non-Python stacks. The SDK ships first.
  See `../pain-radar/specs/05-llm-cost-router/spec.md`.

## License

MIT.
