Metadata-Version: 2.4
Name: npc-mom-router
Version: 0.1.0
Summary: Plug-and-play Mixture-of-Models router. Route cheap requests to cheap models, expensive ones to specialists, track real cost savings.
Project-URL: Homepage, https://github.com/ramankrishna/npc-mom-router
Project-URL: Repository, https://github.com/ramankrishna/npc-mom-router
Project-URL: Issues, https://github.com/ramankrishna/npc-mom-router/issues
Author-email: Rama Krishna Bachu <ram@bottensor.xyz>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: anthropic,inference,llm,mixture-of-models,openai,routing,vllm
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Requires-Dist: anthropic>=0.40
Requires-Dist: httpx>=0.27
Requires-Dist: openai>=1.40
Requires-Dist: pydantic>=2.0
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: respx>=0.21; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Description-Content-Type: text/markdown

# npc-mom-router

[![PyPI](https://img.shields.io/pypi/v/npc-mom-router)](https://pypi.org/project/npc-mom-router/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)

**Plug-and-play Mixture-of-Models router.** Route cheap requests to cheap models, expensive ones to specialists, and track real cost savings.

## Why Mixture-of-Models routing?

Most LLM workloads are not uniformly hard. Simple lookups, format conversions, and short factual questions can be answered accurately by a small, fast model at a fraction of the cost of a frontier model. Routing requests intelligently based on complexity lets you serve the same quality of answers at significantly lower cost—without changing your application's interface.

npc-mom-router sits between your application and your model backends. It classifies each incoming request as `fast` or `heavy`, dispatches to the appropriate backend, and records the token usage and dollar cost of every call. The ledger computes how much you saved compared to always routing to the heavy backend, so you can quantify the benefit in real dollars.

## Install

```bash
pip install npc-mom-router
```

## 30-second quickstart

```python
from npc_mom_router import MoMClient, BackendConfig, ZeroShotRouter

router = ZeroShotRouter(
    base_url="https://api.groq.com/openai/v1",
    api_key="YOUR_GROQ_KEY",
    model="llama-3.1-8b-instant",
)

client = MoMClient(
    router=router,
    backends={
        "fast": BackendConfig(
            kind="oai_compat",
            base_url="https://api.groq.com/openai/v1",
            api_key="YOUR_GROQ_KEY",
            model="llama-3.3-70b-versatile",
            cost_per_1m_input=0.59,
            cost_per_1m_output=0.79,
        ),
        "heavy": BackendConfig(
            kind="anthropic",
            api_key="YOUR_ANTHROPIC_KEY",
            model="claude-sonnet-4-5",
            cost_per_1m_input=3.0,
            cost_per_1m_output=15.0,
        ),
    },
)

resp = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "What's the capital of France?"}],
)
print(f"Route: {resp._mom.route} ({resp._mom.reason})")
print(f"Answer: {resp.choices[0].message.content}")
print(f"Cost: ${resp._mom.cost_usd:.6f}")
```

## NPC Fast router (local vLLM)

Run a tiny routing model locally for zero-latency, zero-cost classification:

```python
from npc_mom_router import MoMClient, BackendConfig, NPCFastRouter

router = NPCFastRouter(
    base_url="http://localhost:8001/v1",
    model="npc-fast-1.7b",
)

client = MoMClient(
    router=router,
    backends={
        "fast": BackendConfig(
            kind="vllm",
            base_url="http://localhost:8000/v1",
            api_key="placeholder",
            model="Qwen/Qwen2.5-7B-Instruct",
            cost_per_1m_input=0.05,
            cost_per_1m_output=0.10,
        ),
        "heavy": BackendConfig(
            kind="openai",
            api_key="YOUR_OPENAI_KEY",
            model="gpt-4o",
            cost_per_1m_input=2.50,
            cost_per_1m_output=10.00,
        ),
    },
)

result = client.route_and_complete(
    [{"role": "user", "content": "Explain the transformer architecture in depth."}]
)
print(result.decision.route, result.cost_entry.usd)
```

## Async client

```python
import asyncio
from npc_mom_router import AsyncMoMClient, BackendConfig, ZeroShotRouter

# ... same setup as above, just use AsyncMoMClient ...

async def main():
    resp = await client.chat.completions.create(
        model="auto",
        messages=[{"role": "user", "content": "List the G7 countries."}],
    )
    print(resp._mom.route, resp._mom.cost_usd)

asyncio.run(main())
```

## Cost tracking

Every request is logged to an in-memory ledger. The ledger re-prices fast-routed requests as if they had hit the heavy backend to compute counterfactual savings.

```python
s = client.ledger.summary()
# {
#   "total_requests": 100,
#   "fast_requests": 73,
#   "heavy_requests": 27,
#   "total_cost_usd": 0.0412,
#   "savings_vs_always_heavy_usd": 0.3891
# }

client.ledger.dump("ledger.json")  # writes full per-request JSON
```

## Backend reference

| `kind`        | Description                          | Default base URL                     |
|---------------|--------------------------------------|--------------------------------------|
| `oai_compat`  | Any OpenAI-compatible API            | Required                             |
| `openai`      | OpenAI (api.openai.com)              | `https://api.openai.com/v1`          |
| `anthropic`   | Anthropic (native SDK)               | `https://api.anthropic.com`          |
| `groq`        | Groq (OAI-compat)                    | `https://api.groq.com/openai/v1`     |
| `vllm`        | Local vLLM server (OAI-compat)       | `http://localhost:8000/v1`           |

Each `BackendConfig` takes `cost_per_1m_input` and `cost_per_1m_output` (USD) for cost tracking.

## Router reference

| Router           | How it works                                         |
|------------------|------------------------------------------------------|
| `ZeroShotRouter` | Prompts any OAI-compat model; parses JSON response  |
| `NPCFastRouter`  | Calls a local vLLM endpoint; sub-10ms routing       |

Both routers return `RoutingDecision(route="fast"|"heavy", reason="...")`. On any failure or malformed response, they fall back to `heavy` to preserve correctness.

**Custom routers:** implement `route(messages) -> RoutingDecision` and `async_route(messages) -> RoutingDecision`.

## Roadmap

Support for streaming responses, per-model latency tracking, a pluggable cost-model registry, and a simple CLI dashboard are planned for v0.2. Pull requests welcome.

## License

Apache 2.0 — see [LICENSE](LICENSE).
Copyright 2026 Rama Krishna Bachu.
