Metadata-Version: 2.4
Name: thinkrouter
Version: 0.3.0
Summary: Cut LLM reasoning-token costs by 60% with one line of code.
License: MIT
Project-URL: Homepage, https://github.com/saikoushiknalubola/thinkrouter
Project-URL: Repository, https://github.com/saikoushiknalubola/thinkrouter
Project-URL: Issues, https://github.com/saikoushiknalubola/thinkrouter/issues
Keywords: llm,openai,anthropic,inference,optimization,routing,reasoning,cost-reduction,o1,deepseek,claude,chain-of-thought,token-optimization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.25.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: tenacity>=8.2.0
Requires-Dist: typing-extensions>=4.5.0
Provides-Extra: classifier
Requires-Dist: transformers>=4.40.0; extra == "classifier"
Requires-Dist: torch>=2.0.0; extra == "classifier"
Provides-Extra: openai
Requires-Dist: openai>=1.30.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.28.0; extra == "anthropic"
Provides-Extra: server
Requires-Dist: fastapi>=0.110.0; extra == "server"
Requires-Dist: uvicorn[standard]>=0.29.0; extra == "server"
Requires-Dist: python-dotenv>=1.0.0; extra == "server"
Requires-Dist: slowapi>=0.1.9; extra == "server"
Provides-Extra: all
Requires-Dist: thinkrouter[anthropic,classifier,openai,server]; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: mypy>=1.9.0; extra == "dev"
Dynamic: license-file

# ThinkRouter

<div align="center">

[![CI](https://github.com/saikoushiknalubola/thinkrouter/actions/workflows/ci.yml/badge.svg)](https://github.com/saikoushiknalubola/thinkrouter/actions/workflows/ci.yml)
[![PyPI version](https://badge.fury.io/py/thinkrouter.svg)](https://badge.fury.io/py/thinkrouter)
[![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1D7lZVyRauv3oeQU7QRSilMcwBGqunG79?usp=sharing)

**Pre-inference query routing for LLM reasoning models.**  
Cut thinking-token costs by 60% with one line of code.

</div>

---

## The problem

Reasoning models (o1, DeepSeek-R1, Claude thinking) apply the same 8,000-token compute budget to every query — whether it is simple arithmetic or a complex proof.

```
"What is 2 + 3?"                   →  8,000 thinking tokens   ← 99% wasted
"Prove that sqrt(2) is irrational"  →  8,000 thinking tokens   ← correctly used
```

At 100,000 queries per day, that is **$192,635/month in avoidable spend.**

---

## The solution

```python
from thinkrouter import ThinkRouter

client   = ThinkRouter(provider="openai")
response = client.chat("What is the capital of France?")
# Routed to NO_THINK → 50 tokens used, not 8,000

client.usage.print_dashboard()
```

```
  ThinkRouter — Usage Dashboard
  ──────────────────────────────────────────────
  Total calls          : 13
  Tokens saved         : 55,650
  Compute savings      : 53.5%
  Avg classifier time  : 0.02 ms

  Routing breakdown:
    no_think        :      7  (53.8%)  — Direct answer
    short_think     :      0  ( 0.0%)  — Moderate reasoning
    full_think      :      6  (46.2%)  — Full extended reasoning
```

---

## How it works

ThinkRouter intercepts each query, runs a lightweight classifier in under 1ms, and routes to the minimum compute budget:

| Tier | Budget | Use case |
|------|--------|----------|
| NO_THINK | 50 tokens | Arithmetic, definitions, lookups, translations |
| SHORT | 800 tokens | Multi-step reasoning, moderate chaining |
| FULL | 8,000 tokens | Proofs, system design, algorithm implementation |

---

## Installation

```bash
# Base install — works immediately, zero ML dependencies
pip install thinkrouter

# With fine-tuned DistilBERT classifier (higher accuracy)
pip install thinkrouter[classifier]

# With OpenAI client
pip install thinkrouter[openai]

# With Anthropic client
pip install thinkrouter[anthropic]

# Everything
pip install thinkrouter[all]
```

---

## Quick start

### Try it now — no API key needed

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1D7lZVyRauv3oeQU7QRSilMcwBGqunG79?usp=sharing)

### OpenAI

```python
from thinkrouter import ThinkRouter

client = ThinkRouter(
    provider="openai",
    api_key="sk-...",      # or set OPENAI_API_KEY
    model="gpt-4o",
    verbose=True,
)

response = client.chat("Explain how merge sort works.")
print(response.content)
print(response.routing)
# ClassifierResult(tier=FULL, confidence=0.87, budget=8000 tokens, latency=1.2ms)

client.usage.print_dashboard()
```

### Anthropic

```python
client = ThinkRouter(
    provider="anthropic",
    api_key="sk-ant-...",  # or set ANTHROPIC_API_KEY
    model="claude-haiku-4-5-20251001",
)

response = client.chat("What is 144 divided by 12?")
# Routed to NO_THINK → 50 tokens, not 8,000
```

### Streaming

```python
for chunk in client.stream("Explain quantum entanglement step by step."):
    print(chunk, end="", flush=True)
```

### Classify without an API call

```python
results = client.classify_batch([
    "What is 7 * 8?",
    "Design a distributed caching system.",
    "How many days are in a leap year?",
])

for r in results:
    print(f"{r.tier.name:<12}  budget={r.token_budget:>6} tokens  conf={r.confidence:.2f}")
```

```
NO_THINK      budget=    50 tokens  conf=0.88
FULL          budget=  8000 tokens  conf=0.85
NO_THINK      budget=    50 tokens  conf=0.80
```

---

## Cost savings at scale

| Volume | Savings/day | Savings/month |
|--------|------------|---------------|
| 10,000 queries/day | $642 | $19,263 |
| 100,000 queries/day | $6,421 | $192,635 |
| 1,000,000 queries/day | $64,212 | $1,926,346 |

*Based on 53.5% savings rate, $15/million reasoning tokens (approximate o1 rate).*

---

## Classifier backends

### Heuristic (default)

Zero dependencies. Regex patterns and word-count heuristics. Runs in under 1ms.

```python
client = ThinkRouter(classifier_backend="heuristic")
```

### DistilBERT

Fine-tuned on GSM8K. Achieves 93%+ quality retention at 60% compute savings.  
Requires `pip install thinkrouter[classifier]`.

```python
client = ThinkRouter(
    classifier_backend="distilbert",
    confidence_threshold=0.75,
)
```

---

## Confidence threshold

| Threshold | Savings | Quality retained | Use case |
|-----------|---------|-----------------|----------|
| 0.65 | ~59% | ~91% | High cost sensitivity |
| **0.75** | **~55%** | **~93%** | **Recommended** |
| 0.85 | ~44% | ~96% | Quality-sensitive |

Queries below the threshold fall back to FULL — never degrades output quality.

---

## API reference

### ThinkRouter

```python
ThinkRouter(
    provider             = "openai",      # "openai" | "anthropic" | "generic"
    api_key              = None,          # falls back to OPENAI_API_KEY / ANTHROPIC_API_KEY
    model                = None,          # default model for all calls
    classifier_backend   = "heuristic",   # "heuristic" | "distilbert"
    confidence_threshold = 0.75,
    max_records          = 10_000,
    verbose              = False,
)
```

### RouterResponse

```python
response.content       # str — generated text
response.routing       # ClassifierResult
response.provider      # "openai" | "anthropic"
response.model         # model identifier
response.usage_tokens  # {"prompt_tokens": N, "completion_tokens": M, ...}
```

### ClassifierResult

```python
result.tier          # Tier.NO_THINK | Tier.SHORT | Tier.FULL
result.confidence    # float in [0, 1]
result.token_budget  # int — thinking tokens assigned
result.latency_ms    # classifier wall-clock time in ms
result.backend       # "heuristic" | "distilbert:cuda" | "distilbert:cpu"
```

---

## Running tests

```bash
git clone https://github.com/saikoushiknalubola/thinkrouter.git
cd thinkrouter
pip install -e ".[dev]"
pytest tests/ -v
```

---

## Roadmap

- [x] Heuristic classifier
- [x] OpenAI and Anthropic adapters
- [x] Streaming support
- [x] Thread-safe usage dashboard
- [x] GitHub Actions CI (Python 3.9–3.12)
- [ ] DistilBERT model on HuggingFace Hub
- [ ] Multi-domain training (MMLU, HumanEval, ARC-Challenge)
- [ ] Async support (`achat()`, `astream()`)
- [ ] Continuous budget regression
- [ ] Hosted API proxy (api.thinkrouter.ai)

---

## Research basis

- Zhao et al. (2025). *SelfBudgeter*. arXiv:2505.11274 — 74.47% savings validated
- Wang et al. (2025). *TALE-EP*. ACL Findings 2025 — 67% output token reduction
- Sanh et al. (2019). *DistilBERT*. arXiv:1910.01108
- Cobbe et al. (2021). *GSM8K*. arXiv:2110.14168

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Issues and pull requests welcome.

---

## License

MIT — see [LICENSE](LICENSE).
