Metadata-Version: 2.4
Name: moe-policylang
Version: 1.3.0
Summary: A domain-specific language for Mixture-of-Experts scheduling policies
Author: Jesse Pokora
License: MIT
Project-URL: Homepage, https://github.com/jesse-pokora/MoE-PolicyLang
Project-URL: Repository, https://github.com/jesse-pokora/MoE-PolicyLang
Project-URL: Documentation, https://github.com/jesse-pokora/MoE-PolicyLang/blob/main/docs/MANUAL.md
Keywords: moe,mixture-of-experts,scheduling,dsl,offloading,llm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lark<2,>=1.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pyyaml; extra == "dev"
Provides-Extra: gpu
Requires-Dist: torch>=2.0; extra == "gpu"
Requires-Dist: transformers>=5.0.0; extra == "gpu"
Requires-Dist: accelerate; extra == "gpu"
Requires-Dist: requests>=2.33.0; extra == "gpu"
Requires-Dist: urllib3>=2.7.0; extra == "gpu"
Requires-Dist: filelock>=3.20.3; extra == "gpu"
Provides-Extra: vllm
Requires-Dist: vllm>=0.21; extra == "vllm"
Provides-Extra: cython
Requires-Dist: cython>=3.0; extra == "cython"
Provides-Extra: eval
Requires-Dist: matplotlib>=3.7; extra == "eval"
Requires-Dist: pyyaml; extra == "eval"
Requires-Dist: pandas; extra == "eval"
Requires-Dist: pillow>=12.2.0; extra == "eval"
Provides-Extra: all
Requires-Dist: moe-policylang[cython,dev,eval,gpu,vllm]; extra == "all"
Dynamic: license-file
Dynamic: requires-python

# MoE-PolicyLang

**A scheduling language for Mixture-of-Experts models.**

> Author: **Jesse Pokora** &middot; License: [MIT](LICENSE)

---

## What Is This?

Large language models like Mixtral, DeepSeek, and Qwen use **Mixture-of-Experts (MoE)** — instead of one giant network, they have dozens of smaller "expert" networks and a router that picks which ones to use for each token. By design, only a fraction of experts are active at any time, so the rest are **offloaded** to CPU memory — this is intentional, not a limitation.

But managing that offloading is complex. *Which* experts to keep on GPU? *When* to prefetch the next ones? *Where* to run cache misses — wait for the GPU transfer, or fall back to CPU? And *how* to adapt as the workload shifts?

**Every existing system hardcodes these decisions** inside its runtime — modifying any strategy requires understanding and rewriting the system's expert-management module. MoE-PolicyLang lifts the *policy* out of the runtime into a small, declarative language that compiles to the same cache/evict/prefetch hooks these systems consume internally.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/constrained_throughput.png" width="600" alt="Throughput and hit rate comparison across policies on consumer GPU">
</p>

---

## The Language

A MoE-PolicyLang policy is a `.moe` file with four composable blocks:

```
policy balanced {
    cache {
        capacity = 16
        eviction = lfu
        frequency_decay = 0.9
    }
    prefetch {
        strategy = history
        budget = 4
    }
    schedule { mode = hybrid }
    adapt {
        when hit_rate < 0.4 for 100 accesses
            { eviction = lru }
    }
}
```

| Block | Controls | Options |
|-------|----------|---------|
| **cache** | Which experts stay on GPU | LRU, LFU, score-based, frequency-threshold |
| **prefetch** | Proactive loading | History, affinity, lookahead |
| **schedule** | Where to run cache misses | GPU-only, CPU-fallback, hybrid |
| **adapt** | Runtime self-tuning | Conditional rules that hot-swap components |

**Switching from LRU to LFU?** Change one word. **Adding prefetching?** Two lines.

---

## Two Lines to Attach

```python
import moe_policylang
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924")

# Auto-generate a tuned policy from your model + GPU, attach it
mgr = moe_policylang.auto_attach(model)
output = model.generate(...)
print(mgr.get_stats())  # hit rate, transfers, evictions
```

Or write a policy explicitly:

```python
mgr = moe_policylang.attach(model, """
    policy aggressive {
        cache { capacity = 8  eviction = lru }
    }
""")
```

Or load a `.moe` file:

```python
mgr = moe_policylang.attach(model, open("my_policy.moe").read())
```

---

## Why a Language, Not YAML?

The `cache`, `prefetch`, and `schedule` blocks are key-value config — a JSON schema with Pydantic could handle them.  What pushes this beyond declarative config is the **`adapt` block**: a small embedded rule language that monitors runtime metrics and hot-swaps policy components conditionally.

```
adapt {
    when hit_rate < 0.4 for 100 accesses { eviction = lru }
}
```

This is not key-value config — it's a conditional rule with a metric, a threshold, a window, and a rewrite target. The grammar constrains what you can write (no arbitrary code in a scheduling policy), and 20 semantic rules catch bad policies at parse time, not mid-inference.

We also ship a **Python eDSL** (`@sched.policy` decorator) and an **auto-attach** API — three surfaces because the use cases differ: `.moe` files for sharing/diffing policies, the eDSL for programmatic policy construction, and `auto_attach` for zero-config deployment. The standalone grammar is load-bearing for the `adapt` semantics; the other two are convenience wrappers.

---

## Results

### Dispatch overhead

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/latency_with_ci.png" width="600" alt="Dispatch overhead with 95% confidence intervals">
</p>

Per-layer dispatch (the Python hook that decides cache/evict/prefetch) adds **< 3.2%** of MoE forward-pass time on A100 (6–47 µs/layer vs. 1,459 µs baseline). This measures the *policy decision* overhead, not the cost of cache misses or weight transfers — those depend on the policy and workload.

### Policy authoring effort

To implement a *new policy variant* in each system, a developer must understand and modify the system's expert-management module. MoE-PolicyLang replaces that authoring effort with a short `.moe` file — the 14–40× reduction measures lines *a user writes to express a policy*, not total system code (MoE-PolicyLang's own runtime is ~4,300 LOC).

| System | Expert-mgmt module | DSL equivalent | Authoring reduction |
|--------|-------------------|---------------|-------------------|
| Fiddler | 280 LOC | 7 lines | 40× |
| HybriMoE | ~500 LOC | 14 lines | 36× |
| MoE-Infinity | 520 LOC | 16 lines | 33× |
| vLLM | 300 LOC | 12 lines | 25× |
| ExpertFlow | ~400 LOC | 16 lines | 25× |
| FineMoE | ~350 LOC | 25 lines | 14× |

**Methodology**: non-blank, non-comment lines in the primary expert-management module. Measured sources: Fiddler — `set_expert_loc()` + `execute_fiddler()` in `src/fiddler/mixtral.py` (280 LOC); MoE-Infinity — `expert_prefetcher.py` + `expert_cache.py` (520 LOC); vLLM — `MixtralMoE` expert dispatch in `vllm/model_executor/` (300 LOC). Counts marked ~ are estimated from paper descriptions of closed-source systems. Switching between strategies (e.g., LRU → LFU) requires changing **one word** in the DSL vs. rewriting cache data structures in the hand-coded approach.

### Policy selection matters when the cache can't hold the working set

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/capacity_sweep.png" width="600" alt="Cache hit rate vs capacity for Mixtral and DeepSeek">
</p>

Capacity sweeps on offline traces show the architecture dependence clearly:
- **Mixtral-8×7B** (8 experts, top-2): saturates at cap=8 (~100% hit rate — all experts fit). Policy choice barely matters here.
- **DeepSeek-V2-Lite** (64 experts, top-6): reaches only 51% hit rate at cap=32 (half the experts). LFU consistently outperforms LRU across all budgets because DeepSeek has significant frequency skew (some experts activated 3–5× more often). This is the regime where policy selection and per-layer budgeting (below) make a real difference.

### EPCB: Per-layer cache budgeting (with a negative result)

Not all layers see the same routing pattern — some concentrate on a few experts, others spread across many. **Empirical Per-layer Cache Budgeting (EPCB)** has two findings, one positive and one negative:

**1. The regime caveat (read this first).** Per-layer caching only helps when the per-layer budget covers each layer's active working set. On **16 GB consumer hardware** — the regime most readers care about — per-layer caching *hurts* throughput by 16% because the per-layer budgets are too small to cover each layer's working set, and the aggregated cache pushes the CUDA allocator to the VRAM ceiling. **Flat shared caching is the default recommendation** for memory-constrained deployments. Per-layer wins when there is VRAM headroom *and* high expert counts (DeepSeek-V2-Lite on A100; see below).

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/per_layer_regime.png" width="600" alt="When per-layer caching wins vs hurts: DeepSeek/A100 lies in the wins region; Qwen/RTX 5080 in the hurts region">
</p>

**2. Per-layer cache structure is the load-bearing lever** (when the regime permits). At matched total budget on DeepSeek-V2-Lite (A100-80GB), replacing a shared cache with per-layer caches yields **+14.7pp hit rate** in offline trace replay and eliminates all CPU↔GPU transfers in steady state. Bit-identical output verified against fully-resident baseline.

The headline throughput gain is large — 1.60 → 10.22 tok/s (+540%) — but this compares shared-32 to per-layer-864 (27× more total slots). The matched-budget +14.7pp hit rate and transfer elimination are the load-bearing findings; the 540% wall-clock number includes the capacity expansion.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/cache_structure.png" width="600" alt="Flat shared cache leaves layers uncovered; per-layer caches at matched total budget cover every layer">
</p>

This structural difference maps directly to MoE-aware baselines. Fiddler's expert placement is a hardcoded global popularity ranking — structurally equivalent to the flat cache on the left. On an A100-80GB where 85% of Mixtral's experts fit on-device (217/256), the ranking barely matters because almost everything is resident. On a constrained GPU where only a fraction of experts fit, a global ranking starves cold layers (left heatmap) while MoE-PolicyLang's per-layer policy maintains coverage at every layer (right heatmap). The structural lever is the DSL's defense: it enables cache topologies that flat global rankings cannot express. The physical payoff is PCIe stall elimination — expert offloading is memory-bandwidth-bound, so every cache miss costs a CPU→GPU transfer. Per-layer caches that cover each layer's working set reduce misses to zero in steady state, which is how a hit-rate improvement translates directly into 6.4× wall-clock throughput (10.22 vs 1.60 tok/s on DeepSeek-V2-Lite at matched total budget).

#### Fiddler head-to-head (A100-80GB, Mixtral-8x7B)

We ran Fiddler and MoE-PolicyLang on the same hardware, model, prompt, and methodology (n=5, greedy decoding, 64 tokens):

| Config | tok/s (±σ) | 95% CI | GPU Peak | Hit Rate | Transfers |
|--------|-----------|--------|----------|----------|-----------|
| **Fiddler** | **4.17 ± 0.02** | [4.16, 4.18] | 80.6 GB | 88.3% | — |
| MPL fiddler_equiv (cap=2) | 0.18 ± 0.00 | [0.18, 0.18] | 6.4 GB | 19.5% | 4,283 |
| MPL balanced (cap=4) | 0.29 ± 0.00 | [0.29, 0.29] | 39.5 GB | 46.4% | 2,665 |
| MPL generous (cap=6) | 0.45 ± 0.00 | [0.45, 0.45] | 61.7 GB | 71.0% | 1,726 |

All MPL configs produce **bit-identical output**. Fiddler is 9–23× faster.

**The gap is mechanism, not policy.** Fiddler uses an optimized C++/CUDA transfer pipeline with pre-allocated GPU memory slots and direct DMA. MoE-PolicyLang dispatches through Python-level `Tensor.to()` calls in the HuggingFace forward pass. At Fiddler's 85% GPU residency (217/256 experts on-device), the *placement strategy* is not load-bearing — the model mostly fits. MoE-PolicyLang is a **policy specification layer**, not a serving system: it specifies *which* experts to cache, evict, and prefetch, but does not implement the physical mechanism that moves expert tensors between devices. The vLLM integration (below) demonstrates that the policy layer composes cleanly with a production inference engine — the same DSL captures routing decisions from vLLM's optimized MoE kernel path without modification.

**3. The allocation signal does not matter.** We tested six signals (Shannon entropy, inverse top-k mass, inverse variance, inverse KL, inverse Gini, uniform) and none differentiates from uniform by more than 2.5pp in hit rate, and all six collapse to within noise of uniform in wall-clock on two models. **Uniform allocation is the default.** Shannon entropy is available as an opt-in for models with high inter-layer entropy spread (ΔH ≳ 1 nat), but we measured it to be within noise of uniform on every model tested end-to-end.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/deepseek_entropy_allocation.png" width="600" alt="Per-layer entropy and capacity allocation">
</p>

| Strategy | Total slots | Hit Rate | Δ vs shared | Wall-clock (A100) |
|----------|------------|----------|---|---|
| Shared cache | 32 | 48.6% | baseline | 1.60 tok/s |
| Per-layer uniform | 864 (27×) | 63.3% | +14.7pp | 10.22 tok/s |
| Per-layer entropy | 864 (27×) | 65.5% | +16.9pp | 10.17 tok/s (≈ uniform) |

### Live inference on consumer GPU

**When the model doesn't fit**: Qwen1.5-MoE-A2.7B (~28.6 GB fp16) on RTX 5080 (16 GB VRAM). Without MoE-PolicyLang, the only option is `device_map="auto"` at 0.57 tok/s. With a 4-line DSL policy:

| Config | Strategy | Cap | VRAM | tok/s | 95% CI |
|--------|----------|-----|------|-------|--------|
| Baseline (`auto`) | — | — | 12.0 GB | 0.57±0.00 | — |
| Skeleton | LRU (cap=1) | 1 | 4.7 GB | 4.23±0.22 | [4.04, 4.42] |
| Aggressive | LRU | 2 | 5.2 GB | 4.17±0.03 | [4.14, 4.20] |
| Balanced | LFU+hist. | 4 | 7.3 GB | 4.35±0.06 | [4.30, 4.40] |
| **Generous** | **LFU+hist.** | **8** | **10.1 GB** | **4.61±0.08** | **[4.54, 4.68]** |

**Cost-performance note**: 4.61 tok/s on a $1k consumer GPU — comparable absolute throughput to Fiddler's 4.17 tok/s on a $15k A100 (different model; not directly comparable).

**Decomposition**: ~92% of the 8.1× speedup comes from expert-aware loading (skeleton on GPU, experts on CPU) — even a capacity-1 "every dispatch is a miss" config reaches 4.23 tok/s (7.4×). Caching adds the remaining +0.38 tok/s. The DSL's contribution is not the loading mechanism (which any system could implement) but making the remaining 8% — the policy layer that chooses *what* to cache, evict, and prefetch — accessible without runtime modification, composable across strategies, and adaptable at runtime via `adapt` rules that no static config can express. On higher-expert-count models, the policy layer's share grows: on DeepSeek (A100), matched-budget per-layer allocation gains +14.7pp hit rate over flat caching with the *same* total slot count — a pure policy-structure effect with no capacity confound.

n=5, bootstrap 95% CIs. Output correctness: greedy decoding (`do_sample=False`) produces bit-identical token sequences across all policy configs vs. `device_map="auto"` baseline (4 prompts × 3 policies = 12 comparisons); perplexity on wikitext-2 matches within 0.024%.

**When the model fits** (overhead measurement): OLMoE-1B-7B (~14 GB) fits entirely on 16 GB VRAM. Here, vanilla (no hooks) is fastest at 39.2 tok/s — the policy hooks add 12–14% overhead. This is not the target scenario for offloading — it measures overhead when there is nothing to offload. MoE-PolicyLang is for models that *don't fit*.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/hitrate_with_ci.png" width="600" alt="Hit rate with bootstrap confidence intervals">
</p>

---

## Installation

From PyPI:
```bash
pip install moe-policylang           # DSL only (no GPU deps)
pip install moe-policylang[gpu]      # + torch, transformers, accelerate
pip install moe-policylang[vllm]     # + vLLM (GPTQ/AWQ quantized inference)
pip install moe-policylang[all]      # everything
```

**Quantized models**: Use `[vllm]` — vLLM handles GPTQ and AWQ quantization with optimized kernels. MoE-PolicyLang observes routing decisions and applies policy logic without managing tensors directly.

For **Blackwell GPUs** (RTX 5080/5090), set these env vars before running vLLM:
```bash
export VLLM_USE_FLASHINFER_SAMPLER=0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_FLASH_ATTN_VERSION=2
```

From source (development):
```bash
git clone https://github.com/jesse-pokora/MoE-PolicyLang.git
cd MoE-PolicyLang
pip install -e ".[dev,gpu]"
```

Cython fast path (for complex policies):
```bash
pip install moe-policylang[cython]
python setup_cython.py build_ext --inplace
```
Python dispatch ranges from 6 µs/layer (simple LRU) to 47 µs/layer (composed policies with triggers). The Cython path targets the high end — `freq_threshold` and `composed_full` drop from 28–47 µs to < 10 µs/layer. Simple policies like `lru_basic` (6 µs) see no benefit.

---

## Tested Models

MoE-PolicyLang auto-detects MoE structure from any HuggingFace model — no model-specific code required. We have evaluated on:

| Model | Experts × Layers | Routing | Hardware | Backend |
|-------|-----------------|---------|----------|----------|
| Mixtral-8×7B-Instruct | 8 × 32 | top-2 | A100-80 GB | HF Transformers |
| DeepSeek-V2-Lite | 64 × 27 | top-6 | A100-80 GB | HF Transformers |
| Qwen1.5-MoE-A2.7B | 60 × 24 | top-4 | RTX 5080 (16 GB) | HF Transformers |
| Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 | 60 × 24 | top-4 | RTX 5080 (16 GB) | **vLLM** |
| OLMoE-1B-7B | 64 × 16 | top-8 | RTX 5080 (16 GB) | HF Transformers |

---

## vLLM Integration

MoE-PolicyLang integrates with [vLLM](https://github.com/vllm-project/vllm) for production-grade quantized MoE inference. The `VLLMPolicyRunner` instruments vLLM's router layers to capture expert routing decisions and feeds them through the policy system — the same DSL/compiler/hooks work regardless of whether the mechanism layer is HuggingFace's eager execution or vLLM's optimized kernels.

```python
from moe_policylang.integrations.vllm_backend import VLLMPolicyRunner

runner = VLLMPolicyRunner(
    model="Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
    policy_dsl='''
        policy demo {
            cache { capacity = 8  eviction = lru }
            prefetch { strategy = lookahead  lookahead = 1 }
            schedule { mode = gpu_only }
        }
    ''',
    quantization="gptq",
)

results = runner.generate(["What is expert routing?"], max_tokens=30)
print(results["text"])          # generated text
print(results["policy_stats"])  # cache hits, prefetch accuracy, etc.
```

**Verified on**: RTX 5080 (16 GB), vLLM 0.21, GPTQ-Int4 quantization. Captures 744 routing events across 24 layers × 60 experts, achieving 14.7% cache hit rate and 72% prefetch accuracy with a minimal 8-slot LRU policy.

**Key capability**: Proves the DSL is backend-agnostic — the policy specification layer is independent of the inference engine, validating the separation-of-concerns architecture.

---

## Project Structure

```
moe_policylang/
├── grammar.lark           # Lark LALR grammar (62 productions)
├── parser.py              # Grammar → PolicyIR
├── ir.py                  # Intermediate representation
├── validator.py           # 20 semantic validation rules
├── compiler.py            # IR → CompiledPolicy
├── auto.py                # Auto-generate policies from model + GPU
├── dsl.py                 # Python eDSL (@sched.policy decorator)
├── adaptive.py            # Adaptive policies (adapt blocks)
├── autotuner.py           # Grid-search policy optimizer
├── cli.py                 # CLI: validate, compile, run
├── runtime/
│   ├── hooks.py           # 5-step per-layer dispatch protocol
│   ├── cache.py           # LRU / LFU / Score / FreqThreshold
│   ├── prefetch.py        # Affinity / History / Lookahead
│   ├── scheduler.py       # GPU-only / CPU-fallback / Hybrid
│   ├── per_layer.py       # EPCB — entropy-proportional caching
│   ├── triggers.py        # Memory-pressure & TTL eviction
│   └── _fast/             # Cython-accelerated paths
└── integrations/
    ├── __init__.py         # attach() — main user API
    ├── huggingface.py      # HuggingFace Transformers hooks
    ├── vllm_backend.py     # vLLM integration (routing trace + policy replay)
    ├── weight_placement.py # Expert offloading manager
    └── async_transfer.py   # CUDA stream async transfers
```

---

## Running Experiments

```bash
# Offline trace replay (no GPU needed)
python scripts/run_eval.py
python scripts/run_sweep.py

# Live inference on consumer GPU
python scripts/run_dsl_demo.py
python scripts/run_constrained_e2e.py

# Generate all paper figures
python scripts/generate_figures.py

# Benchmarks & evaluations (requires CUDA GPU + model weights)
python scripts/bench_qwen_multirun.py   # Qwen throughput (Table 4)
python scripts/bench_coldstart.py       # Cold-start throughput analysis
python scripts/bench_power.py           # Power/energy measurement
python scripts/eval_quality.py          # Perplexity evaluation (wikitext-2)
python scripts/ablation_epcb_sensitivity.py  # EPCB hyperparameter sweep
python scripts/plot_coldstart.py        # Generate cold-start figure
```

---

## Tests

```bash
python -m pytest tests/ -q
```

453+ tests covering parsing, validation, compilation, runtime dispatch,
adaptive policies, per-layer EPCB, and integration hooks.

---

## Documentation

See [`docs/MANUAL.md`](docs/MANUAL.md) for the full language reference,
runtime API, and policy authoring guide.

---

## Citation

```bibtex
@misc{pokora2026moepolicylang,
  title={MoE-PolicyLang: A Domain-Specific Language for Mixture-of-Experts Scheduling Policies},
  author={Pokora, Jesse},
  year={2026},
  url={https://github.com/jesse-pokora/MoE-PolicyLang}
}
```

---

## License

[MIT License](LICENSE) — Copyright (c) 2026 Jesse Pokora
