Metadata-Version: 2.4
Name: gemma4-adaptive-router
Version: 0.1.0
Summary: Complexity + VRAM-aware routing for local dual-tier LLM deployments
Project-URL: Homepage, https://github.com/angelnicolasc/Stratum
Project-URL: Repository, https://github.com/angelnicolasc/Stratum
Project-URL: Issues, https://github.com/angelnicolasc/Stratum/issues
License: Apache-2.0
Keywords: gemma,inference,llm,on-prem,routing,vllm,vram
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: fastapi>=0.110.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: litellm>=1.40.0
Requires-Dist: nvidia-ml-py>=12.535.108
Requires-Dist: prometheus-client>=0.20.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: uvicorn[standard]>=0.29.0
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == 'dev'
Requires-Dist: mypy>=1.9; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Description-Content-Type: text/markdown

# gemma4-adaptive-router

> Complexity + VRAM-aware routing for local dual-tier LLM deployments.  
> The only public implementation of sub-millisecond complexity scoring coupled to real-time VRAM scheduling for consumer GPU on-prem inference.

[![PyPI](https://img.shields.io/pypi/v/gemma4-adaptive-router)](https://pypi.org/project/gemma4-adaptive-router/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
[![Apache-2.0](https://img.shields.io/badge/license-Apache--2.0-green)](../LICENSE)

## Install

```bash
pip install gemma4-adaptive-router
```

## Quickstart

```python
from adaptive_router import AdaptiveRouter, RoutingConfig
import time

config = RoutingConfig(
    complexity_threshold=0.65,   # score >= this → tier_high
    vram_headroom_gb=1.5,        # free VRAM below this → force tier_low
    latency_sla_ms=2000.0,       # EMA latency above this → force tier_low
    sla_warmup_seed_ms=800.0,    # seed EMA to avoid cold-start burst on tier_high
)

router = AdaptiveRouter(config)

# Route a query
tier = router.route("Explain the proof of Fermat's Last Theorem step by step")
# → "tier_high" (complex math query, VRAM available)

# After you get the response, report the actual latency so the SLA rule adapts
t0 = time.monotonic()
# ... call your LLM endpoint ...
router.observe(tier, latency_ms=(time.monotonic() - t0) * 1000)

# Clean up background VRAM monitor thread
router.shutdown()
```

## Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                      adaptive_router/                       │
│                                                             │
│  Layer 1: ComplexityScorer                                  │
│  ├── Rule-based, 6 dimensions, sub-ms, zero external calls  │
│  ├── math (0.25) · code (0.25) · depth (0.20)              │
│  ├── tokens (0.15) · entities (0.10) · negation (0.05)     │
│  └── Output: score in [0.0, 1.0]                           │
│                                                             │
│  Layer 2: VRAMMonitor (daemon thread)                       │
│  ├── pynvml direct — no subprocess nvidia-smi overhead     │
│  ├── Polling at 10-20ms with atomic shared state           │
│  └── Thread-safe VRAMState (free_gb, used_gb, util_pct)    │
│                                                             │
│  Layer 3: RoutingDecision (chain of rules)                  │
│  ├── complexity_rule: score < threshold → tier_low          │
│  ├── vram_rule: free_gb < headroom → tier_low               │
│  ├── sla_rule: EMA latency > SLA → tier_low                 │
│  └── Default fallthrough: tier_high                         │
└─────────────────────────────────────────────────────────────┘
```

## Configuration reference

| Field | Type | Default | Effect |
|---|---|---|---|
| `complexity_threshold` | `float` | `0.65` | Score ≥ this routes to tier_high |
| `vram_headroom_gb` | `float` | `1.5` | Free VRAM below this forces tier_low |
| `latency_sla_ms` | `float` | `2000.0` | EMA latency above this forces tier_low |
| `vram_poll_interval_ms` | `int` | `15` | VRAM polling frequency |
| `sla_warmup_seed_ms` | `float` | `0.0` | EMA seed at startup — set to p50 of tier_high to avoid cold-start burst |

Load from YAML:

```python
from adaptive_router.config import load_config

config = load_config("router_config.yaml")
```

```yaml
# router_config.yaml
complexity_threshold: 0.65
vram_headroom_gb: 1.5
latency_sla_ms: 2000.0
vram_poll_interval_ms: 15
sla_warmup_seed_ms: 800.0
```

## Deploy as FastAPI proxy

```bash
python -m adaptive_router.middleware \
    --tier-high-url http://llama-cpp:8080/v1 \
    --tier-low-url  http://vllm:8000/v1 \
    --port 9000
```

Exposes `/v1/chat/completions` (proxied), `/health`, and `/metrics` (Prometheus).

## Custom routing rules

```python
from adaptive_router import AdaptiveRouter, RoutingConfig, RouterState

def my_rule(query: str, state: RouterState):
    # Force tier_high for any query mentioning "contract"
    if "contract" in query.lower():
        return "tier_high"
    return None  # abstain, let next rule decide

router = AdaptiveRouter(config, rules=[my_rule])
```

## Known limitations

1. **16GB does not fit 26B in vLLM natively.** The router mitigates this by routing complex queries to llama.cpp. The real fix for 200 concurrent users wanting 26B is dual-GPU.
2. **Blackwell SM120 rough edges.** FP8 KV and MTP require specific workarounds. See [VRAM-REALITY-CHECK.md](../docs/VRAM-REALITY-CHECK.md).
3. **EXL2 is not designed for multi-user production.** TabbyAPI maintainers state this explicitly. It's in benchmarks for completeness only.
4. **SGLang does not always win.** RadixAttention benefits depend on prefix overlap. See [RADIXATTENTION-OPERATIVE-TABLE.md](../docs/RADIXATTENTION-OPERATIVE-TABLE.md).
5. **MTP in SM120 is not plug-and-play.** With BF16 + workarounds it works; with NVFP4 it produces garbage.
6. **Cold-start burst on tier_high.** With `sla_warmup_seed_ms=0.0`, the EMA starts at 0ms — below any SLA. The first ~15-20 complex queries all hit tier_high before the EMA reflects real latency. Set `sla_warmup_seed_ms` to your expected p50.
7. **This is v0.1.** Functional and tested, but not battle-tested at thousands of users. It's the starting point.

## Running tests

```bash
pip install -e ".[dev]"
pytest tests/ -v  # No GPU required — pynvml is mocked in all fixtures
```
