Metadata-Version: 2.4
Name: zen-router
Version: 0.2.0
Summary: Adaptive test-time-compute routing for LLM reasoning: cheap samples first, escalate to native thinking only on disagreement.
Author: Victor Augusto
License: MIT
Project-URL: Homepage, https://github.com/Victor-Alves0/ZEN
Project-URL: Repository, https://github.com/Victor-Alves0/ZEN
Keywords: llm,reasoning,test-time-compute,routing,efficiency,self-consistency
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.28
Provides-Extra: experiments
Requires-Dist: datasets>=2.19; extra == "experiments"
Requires-Dist: numpy; extra == "experiments"
Requires-Dist: torch; extra == "experiments"
Dynamic: license-file

# ZEN — Sample · Agree · Escalate

[![PyPI](https://img.shields.io/pypi/v/zen-router)](https://pypi.org/project/zen-router/)
[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](https://pypi.org/project/zen-router/)
[![Tests](https://img.shields.io/badge/tests-31%20passing%20offline-brightgreen)](tests/test_offline.py)
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)

**Adaptive test-time compute for LLMs: stop paying for thinking your model doesn't need.**

ZEN routes every request through the cheapest path that can solve it. It draws
two cheap (non-thinking) samples first and only escalates to expensive native
thinking when they disagree — using **disagreement as a free difficulty signal**.
On hard benchmarks it matches or beats native-thinking accuracy at lower token
cost; on easy traffic it answers for a fraction of the price.

```bash
pip install zen-router
```

```python
from zen import ZenGateway

gw = ZenGateway()                       # any OpenAI-compatible endpoint
resp = gw.route("What is 15% of 240?")  # classifies, routes, answers

print(resp.text)     # "36"
print(resp.path)     # "consensus"  (solved cheaply — no thinking spent)
print(resp.tokens)   # ~220
```

No training. No GPU. No logprobs required. Provider-agnostic.

## Why

Thinking/reasoning modes are powerful and expensive — and they charge you for
every hidden reasoning token. A trivial question costs **10×** more with
thinking enabled (we measured 6 vs 66 completion tokens for `24*17`). Chat apps
today either burn that on every message or make the user toggle thinking by
hand. ZEN makes the decision per request, automatically, with an auditable
token account for every call.

## How it works

```
request ──> 2 cheap samples (parallel, thinking off)
                │
       agree? ──┴── yes ──> answer                     (~4k tokens on AIME)
                │
                no   (= this one is actually hard)
                │
                ▼
       3rd cheap sample + 1 native-thinking sample     (parallel)
                │
                ▼
       weighted vote {cheap ×1 each, native ×2} ──> answer   (~19k tokens)
```

## Benchmarks

DeepSeek-V4-Flash via OpenRouter, single-sample protocol, tokens counted across
**all** calls each method makes. N=30 per benchmark — treat ±9pp as noise.
Full tables and methodology: [docs/RESULTS.md](docs/RESULTS.md).

**AIME 2025** (hard for the model — native thinking spends ~15k tokens/problem):

| method | accuracy | mean tokens |
|---|---|---|
| single cheap call | 40.0% | 2.0k |
| self-consistency@10 | 43.3% | 18.2k |
| best-of-5 + LLM judge | 50.0% | 13.9k |
| native thinking (1 call) | 56.7% | 14.6k |
| **ZEN (vote)** | **56.7–66.7%** (2 runs) | **12.8k** |

**AIME 2024** (easy for the model — native thinking self-regulates to ~9k):

| method | accuracy | mean tokens |
|---|---|---|
| native thinking (1 call) | 76.7% | 9.2k |
| **ZEN (vote)** | 73.3% | 9.8k |

Honest reading: ZEN wins when the task challenges the model (cuts waste, adds
vote robustness). When the task is easy for the model, it is a statistical tie
with slight overhead — modern thinking modes already self-regulate. Rule of
thumb from the data: ZEN pays off when **≥ ~35–55%** of your traffic is cheaply
solvable.

## The three layers

| layer | what it does | use it for |
|---|---|---|
| `ZenGateway` | classifies each message (chat / question / task) and dispatches the right amount of compute | chat apps, AI workspaces — an automatic thinking mode |
| `ZenRouter` | cheap consensus → thinking escalation, for verifiable answers | math, MCQ, facts, extraction, code-with-tests |
| `ZenPlanner` | plan-and-execute: decompose, run steps with threaded context, synthesize | multi-step tasks, agent pipelines |

### ZenGateway — the automatic thinking mode

```python
from zen import ZenGateway

gw = ZenGateway()
gw.route("hey, what do you think about coffee?")   # chat  -> 1 cheap call
gw.route("Which planet has the most moons?")       # question -> consensus route
gw.route("Compare SQLite and PostgreSQL and recommend one.")  # task -> planner
gw.route(msg, kind="question")                     # or force the kind yourself
```

**Tool-safety rule (built in):** pass `tools_present=True` on turns where the
model may call tools. The gateway then makes exactly **one** routed call — it
never samples in parallel around side-effectful tools (two samples would run
your tools twice). Your agent loop handles the tool cycle around it.

### ZenRouter — verifiable Q→A

```python
from zen import ZenRouter

router = ZenRouter()                    # math-style prompt/parser by default
result = router.solve("If 3x + 7 = 22, what is x?")
result.answer, result.tokens, result.path   # 5, ~4k, "consensus"

ZenRouter(
    variant="vote",          # "vote" (validated best) | "eager" | "hybrid"
    temperature=0.7,         # sampling diversity for consensus
    native_weight=2,         # native sample's weight in the final vote
    think_budget=16000,      # completion budget of the native call
    parser=my_extractor,     # swap the answer parser for your domain
    raw_log="samples.jsonl", # dump every sample for offline analysis
)
```

### ZenPlanner — plan-and-execute with an agent hook

```python
from zen import ZenPlanner

def my_agent_step(step_description, context):
    # run your tool-calling agent (e.g. SIFT) on this step, return text
    return my_agent.run(step_description, context)

planner = ZenPlanner(executor=my_agent_step)   # omit executor = pure reasoning
result = planner.run("Research X, compare with Y, write a recommendation.")
result.text     # final deliverable
result.steps    # [(step, result), ...]
```

The planner fixes the classic plan-and-execute inefficiency: each step receives
the task + plan + **clipped results** of prior steps — not the full reasoning
history — so cost grows linearly, not quadratically. A step that fails is
retried once with native thinking (per-step escalation). Plans with fewer than
two steps skip orchestration entirely.

## Configuration

Point ZEN at any OpenAI-compatible endpoint:

```bash
export ZEN_LLM_BASE_URL=https://openrouter.ai/api/v1
export ZEN_LLM_MODEL=deepseek/deepseek-v4-flash
export ZEN_LLM_API_KEY=sk-or-...
```

or use a built-in profile (reads the key from `OPENROUTER_API_KEY` or a
git-ignored `.secrets.json`):

```python
from zen import config
config.apply_profile("deepseek-v4-flash")
```

Escalation needs a model exposing a thinking toggle (DeepSeek, Qwen, GPT
reasoning-effort, Claude extended thinking...). Models without one still work —
ZEN then behaves as consensus routing.

Everything is injectable — `client_factory=`, `classifier=`, `parser=`,
`executor=` — so ZEN stays provider-agnostic and fully testable offline:

```bash
python tests/test_offline.py    # 31 tests, zero API calls
```

## Negative results we kept (so you don't rediscover them)

- **Confidence gating** (accept a high-logprob first sample): logprob coverage
  through OpenRouter is partial (~47%) and confidence did **not** predict
  correctness on AIME — the non-thinking model is often confidently wrong.
  Available as `gate_tail_logprob=`, off by default.
- **Halving the native think budget** (`think_budget=8000`): saved only ~10%
  of total tokens and cost accuracy exactly where thinking was needed.
- **Truncating candidates** in judge/aggregation prompts silently destroys
  them — keep the head *and* the tail (`zen.parsing.clip`).

Honest caveats: results come from one model family and N=30 math benchmarks;
consensus needs comparable short answers (v0.2 targets verifiable outputs —
swap `parser=` for your domain, open-ended text is future work).

## Repo layout

```
zen/           the package: gateway, router, planner, client, parsing, config
tests/         offline test suite (no API key needed)
experiments/   evaluation harnesses (AIME/MATH/GSM8K) + the RL research line
docs/          full results tables + development log
```

## Related work

ZEN distills ideas from self-consistency (Wang et al.), adaptive-consensus and
fast/slow-routing research (AdaptThink, DART),
[DeepConf](https://arxiv.org/abs/2508.15260),
[RSA](https://arxiv.org/abs/2509.26626) and
[RL of Thoughts](https://arxiv.org/abs/2505.14140) — the tiny-controller idea
that started this project. ZEN's contribution is the disagreement-routed
cheap→thinking cascade with honest, per-call token accounting.

Works well next to [SIFT](https://github.com/Victor-Alves0/SIFT) (tool
retrieval and calling by the same author): SIFT decides **what the model can
do**, ZEN decides **how hard it should think**.

## License

MIT
