Metadata-Version: 2.4
Name: moe-policylang
Version: 1.2.5
Summary: A domain-specific language for Mixture-of-Experts scheduling policies
Author: Jesse Pokora
License: MIT
Project-URL: Homepage, https://github.com/jesse-pokora/MoE-PolicyLang
Project-URL: Repository, https://github.com/jesse-pokora/MoE-PolicyLang
Project-URL: Documentation, https://github.com/jesse-pokora/MoE-PolicyLang/blob/main/docs/MANUAL.md
Keywords: moe,mixture-of-experts,scheduling,dsl,offloading,llm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lark<2,>=1.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pyyaml; extra == "dev"
Provides-Extra: gpu
Requires-Dist: torch>=2.0; extra == "gpu"
Requires-Dist: transformers>=5.0.0; extra == "gpu"
Requires-Dist: accelerate; extra == "gpu"
Requires-Dist: requests>=2.33.0; extra == "gpu"
Requires-Dist: urllib3>=2.7.0; extra == "gpu"
Requires-Dist: filelock>=3.20.3; extra == "gpu"
Provides-Extra: cython
Requires-Dist: cython>=3.0; extra == "cython"
Provides-Extra: eval
Requires-Dist: matplotlib>=3.7; extra == "eval"
Requires-Dist: pyyaml; extra == "eval"
Requires-Dist: pandas; extra == "eval"
Requires-Dist: pillow>=12.2.0; extra == "eval"
Provides-Extra: all
Requires-Dist: moe-policylang[cython,dev,eval,gpu]; extra == "all"
Dynamic: license-file
Dynamic: requires-python

# MoE-PolicyLang

**A scheduling language for Mixture-of-Experts models.**

> Author: **Jesse Pokora** &middot; License: [MIT](LICENSE)

---

## What Is This?

Large language models like Mixtral, DeepSeek, and Qwen use **Mixture-of-Experts (MoE)** — instead of one giant network, they have dozens of smaller "expert" networks and a router that picks which ones to use for each token. By design, only a fraction of experts are active at any time, so the rest are **offloaded** to CPU memory — this is intentional, not a limitation.

But managing that offloading is complex. *Which* experts to keep on GPU? *When* to prefetch the next ones? *Where* to run cache misses — wait for the GPU transfer, or fall back to CPU? And *how* to adapt as the workload shifts?

**Every existing system hardcodes these decisions** inside its runtime — modifying any strategy requires understanding and rewriting the system's expert-management module. MoE-PolicyLang lifts the *policy* out of the runtime into a small, declarative language that compiles to the same cache/evict/prefetch hooks these systems consume internally.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/constrained_throughput.png" width="600" alt="Throughput and hit rate comparison across policies on consumer GPU">
</p>

---

## The Language

A MoE-PolicyLang policy is a `.moe` file with four composable blocks:

```
policy balanced {
    cache {
        capacity = 16
        eviction = lfu
        frequency_decay = 0.9
    }
    prefetch {
        strategy = history
        budget = 4
    }
    schedule { mode = hybrid }
    adapt {
        when hit_rate < 0.4 for 100 accesses
            { eviction = lru }
    }
}
```

| Block | Controls | Options |
|-------|----------|---------|
| **cache** | Which experts stay on GPU | LRU, LFU, score-based, frequency-threshold |
| **prefetch** | Proactive loading | History, affinity, lookahead |
| **schedule** | Where to run cache misses | GPU-only, CPU-fallback, hybrid |
| **adapt** | Runtime self-tuning | Conditional rules that hot-swap components |

**Switching from LRU to LFU?** Change one word. **Adding prefetching?** Two lines.

---

## Two Lines to Attach

```python
import moe_policylang
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924")

# Auto-generate a tuned policy from your model + GPU, attach it
mgr = moe_policylang.auto_attach(model)
output = model.generate(...)
print(mgr.get_stats())  # hit rate, transfers, evictions
```

Or write a policy explicitly:

```python
mgr = moe_policylang.attach(model, """
    policy aggressive {
        cache { capacity = 8  eviction = lru }
    }
""")
```

Or load a `.moe` file:

```python
mgr = moe_policylang.attach(model, open("my_policy.moe").read())
```

---

## Why a Language, Not YAML?

The `cache`, `prefetch`, and `schedule` blocks are key-value config — a JSON schema with Pydantic could handle them.  What pushes this beyond declarative config is the **`adapt` block**: a small embedded rule language that monitors runtime metrics and hot-swaps policy components conditionally.

```
adapt {
    when hit_rate < 0.4 for 100 accesses { eviction = lru }
}
```

This is not key-value config — it's a conditional rule with a metric, a threshold, a window, and a rewrite target. The grammar constrains what you can write (no arbitrary code in a scheduling policy), and 20 semantic rules catch bad policies at parse time, not mid-inference.

We also ship a **Python eDSL** (`@sched.policy` decorator) and an **auto-attach** API — three surfaces because the use cases differ: `.moe` files for sharing/diffing policies, the eDSL for programmatic policy construction, and `auto_attach` for zero-config deployment. The standalone grammar is load-bearing for the `adapt` semantics; the other two are convenience wrappers.

---

## Results

### Dispatch overhead

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/latency_with_ci.png" width="600" alt="Dispatch overhead with 95% confidence intervals">
</p>

Per-layer dispatch (the Python hook that decides cache/evict/prefetch) adds **< 3.2%** of MoE forward-pass time on A100 (6–47 µs/layer vs. 1,459 µs baseline). This measures the *policy decision* overhead, not the cost of cache misses or weight transfers — those depend on the policy and workload.

### Policy authoring effort

To implement a *new policy variant* in each system, a developer must understand and modify the system's expert-management module. MoE-PolicyLang replaces that authoring effort with a short `.moe` file — the 14–40× reduction measures lines *a user writes to express a policy*, not total system code (MoE-PolicyLang's own runtime is ~4,300 LOC).

| System | Expert-mgmt module | DSL equivalent | Authoring reduction |
|--------|-------------------|---------------|-------------------|
| Fiddler | 280 LOC | 7 lines | 40× |
| HybriMoE | ~500 LOC | 14 lines | 36× |
| MoE-Infinity | 520 LOC | 16 lines | 33× |
| vLLM | 300 LOC | 12 lines | 25× |
| ExpertFlow | ~400 LOC | 16 lines | 25× |
| FineMoE | ~350 LOC | 25 lines | 14× |

**Methodology**: non-blank, non-comment lines in the primary expert-management module. Measured sources: Fiddler — `set_expert_loc()` + `execute_fiddler()` in `src/fiddler/mixtral.py` (280 LOC); MoE-Infinity — `expert_prefetcher.py` + `expert_cache.py` (520 LOC); vLLM — `MixtralMoE` expert dispatch in `vllm/model_executor/` (300 LOC). Counts marked ~ are estimated from paper descriptions of closed-source systems. Switching between strategies (e.g., LRU → LFU) requires changing **one word** in the DSL vs. rewriting cache data structures in the hand-coded approach.

### Policy selection matters when the cache can't hold the working set

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/capacity_sweep.png" width="600" alt="Cache hit rate vs capacity for Mixtral and DeepSeek">
</p>

Capacity sweeps on offline traces show the architecture dependence clearly:
- **Mixtral-8×7B** (8 experts, top-2): saturates at cap=8 (~100% hit rate — all experts fit). Policy choice barely matters here.
- **DeepSeek-V2-Lite** (64 experts, top-6): reaches only 51% hit rate at cap=32 (half the experts). LFU consistently outperforms LRU across all budgets because DeepSeek has significant frequency skew (some experts activated 3–5× more often). This is the regime where policy selection and per-layer budgeting (below) make a real difference.

### EPCB: Per-layer cache budgeting (with a negative result)

Not all layers see the same routing pattern — some concentrate on a few experts, others spread across many. **Empirical Per-layer Cache Budgeting (EPCB)** has two findings, one positive and one negative:

**1. The regime caveat (read this first).** Per-layer caching only helps when the per-layer budget covers each layer's active working set. On **16 GB consumer hardware** — the regime most readers care about — per-layer caching *hurts* throughput by 16% because the per-layer budgets are too small to cover each layer's working set, and the aggregated cache pushes the CUDA allocator to the VRAM ceiling. **Flat shared caching is the default recommendation** for memory-constrained deployments. Per-layer wins when there is VRAM headroom *and* high expert counts (DeepSeek-V2-Lite on A100; see below).

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/per_layer_regime.png" width="600" alt="When per-layer caching wins vs hurts: DeepSeek/A100 lies in the wins region; Qwen/RTX 5080 in the hurts region">
</p>

**2. Per-layer cache structure is the load-bearing lever** (when the regime permits). At matched total budget on DeepSeek-V2-Lite (A100-80GB), replacing a shared cache with per-layer caches yields **+14.7pp hit rate** in offline trace replay and eliminates all CPU↔GPU transfers in steady state. Bit-identical output verified against fully-resident baseline.

The headline throughput gain is large — 1.60 → 10.22 tok/s (+540%) — but this compares shared-32 to per-layer-864 (27× more total slots). The matched-budget +14.7pp hit rate and transfer elimination are the load-bearing findings; the 540% wall-clock number includes the capacity expansion.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/cache_structure.png" width="600" alt="Flat shared cache leaves layers uncovered; per-layer caches at matched total budget cover every layer">
</p>

This structural difference maps directly to MoE-aware baselines. Fiddler's expert placement is a hardcoded global popularity ranking — structurally equivalent to the flat cache on the left. On an A100-80GB where 85% of Mixtral's experts fit on-device (217/256), the ranking barely matters because almost everything is resident. On a constrained GPU where only a fraction of experts fit, a global ranking starves cold layers (left heatmap) while MoE-PolicyLang's per-layer policy maintains coverage at every layer (right heatmap). The structural lever is the DSL's defense: it enables cache topologies that flat global rankings cannot express. The physical payoff is PCIe stall elimination — expert offloading is memory-bandwidth-bound, so every cache miss costs a CPU→GPU transfer. Per-layer caches that cover each layer's working set reduce misses to zero in steady state, which is how a hit-rate improvement translates directly into 6.4× wall-clock throughput (10.22 vs 1.60 tok/s on DeepSeek-V2-Lite at matched total budget).

#### Fiddler head-to-head (A100-80GB, Mixtral-8x7B)

We ran Fiddler and MoE-PolicyLang on the same hardware, model, prompt, and methodology (n=5, greedy decoding, 64 tokens):

| Config | tok/s (±σ) | 95% CI | GPU Peak | Hit Rate | Transfers |
|--------|-----------|--------|----------|----------|-----------|
| **Fiddler** | **4.17 ± 0.02** | [4.16, 4.18] | 80.6 GB | 88.3% | — |
| MPL fiddler_equiv (cap=2) | 0.18 ± 0.00 | [0.18, 0.18] | 6.4 GB | 19.5% | 4,283 |
| MPL balanced (cap=4) | 0.29 ± 0.00 | [0.29, 0.29] | 39.5 GB | 46.4% | 2,665 |
| MPL generous (cap=6) | 0.45 ± 0.00 | [0.45, 0.45] | 61.7 GB | 71.0% | 1,726 |

All MPL configs produce **bit-identical output**. Fiddler is 9–23× faster.

**The gap is mechanism, not policy.** Fiddler uses an optimized C++/CUDA transfer pipeline with pre-allocated GPU memory slots and direct DMA. MoE-PolicyLang dispatches through Python-level `Tensor.to()` calls in the HuggingFace forward pass. At Fiddler's 85% GPU residency (217/256 experts on-device), the *placement strategy* is not load-bearing — the model mostly fits. MoE-PolicyLang is a **policy specification layer**, not a serving system: it specifies *which* experts to cache, evict, and prefetch, but does not implement the physical mechanism that moves expert tensors between devices. Integrating MoE-PolicyLang's policy layer into an optimized transfer pipeline (e.g., via vLLM or a C++/CUDA backend) is future work.

**3. The allocation signal does not matter.** We tested six signals (Shannon entropy, inverse top-k mass, inverse variance, inverse KL, inverse Gini, uniform) and none differentiates from uniform by more than 2.5pp in hit rate, and all six collapse to within noise of uniform in wall-clock on two models. **Uniform allocation is the default.** Shannon entropy is available as an opt-in for models with high inter-layer entropy spread (ΔH ≳ 1 nat), but we measured it to be within noise of uniform on every model tested end-to-end.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/deepseek_entropy_allocation.png" width="600" alt="Per-layer entropy and capacity allocation">
</p>

| Strategy | Total slots | Hit Rate | Δ vs shared | Wall-clock (A100) |
|----------|------------|----------|---|---|
| Shared cache | 32 | 48.6% | baseline | 1.60 tok/s |
| Per-layer uniform | 864 (27×) | 63.3% | +14.7pp | 10.22 tok/s |
| Per-layer entropy | 864 (27×) | 65.5% | +16.9pp | 10.17 tok/s (≈ uniform) |

### Live inference on consumer GPU

**When the model doesn't fit**: Qwen1.5-MoE-A2.7B (~28.6 GB fp16) on RTX 5080 (16 GB VRAM). Without MoE-PolicyLang, the only option is `device_map="auto"` at 0.57 tok/s. With a 4-line DSL policy:

| Config | Strategy | Cap | VRAM | tok/s | 95% CI |
|--------|----------|-----|------|-------|--------|
| Baseline (`auto`) | — | — | 12.0 GB | 0.57±0.00 | — |
| Skeleton | LRU (cap=1) | 1 | 4.7 GB | 4.23±0.22 | [4.04, 4.42] |
| Aggressive | LRU | 2 | 5.2 GB | 4.17±0.03 | [4.14, 4.20] |
| Balanced | LFU+hist. | 4 | 7.3 GB | 4.35±0.06 | [4.30, 4.40] |
| **Generous** | **LFU+hist.** | **8** | **10.1 GB** | **4.61±0.08** | **[4.54, 4.68]** |

**Cost-performance note**: 4.61 tok/s on a $1k consumer GPU — comparable absolute throughput to Fiddler's 4.17 tok/s on a $15k A100 (different model; not directly comparable).

**Decomposition**: ~92% of the 8.1× speedup comes from expert-aware loading (skeleton on GPU, experts on CPU) — even a capacity-1 "every dispatch is a miss" config reaches 4.23 tok/s (7.4×). Caching adds the remaining +0.38 tok/s. The DSL's contribution is not the loading mechanism (which any system could implement) but making the remaining 8% — the policy layer that chooses *what* to cache, evict, and prefetch — accessible without runtime modification, composable across strategies, and adaptable at runtime via `adapt` rules that no static config can express. On higher-expert-count models, the policy layer's share grows: on DeepSeek (A100), matched-budget per-layer allocation gains +14.7pp hit rate over flat caching with the *same* total slot count — a pure policy-structure effect with no capacity confound.

n=5, bootstrap 95% CIs. Output correctness: greedy decoding (`do_sample=False`) produces bit-identical token sequences across all policy configs vs. `device_map="auto"` baseline (4 prompts × 3 policies = 12 comparisons); perplexity on wikitext-2 matches within 0.024%.

**When the model fits** (overhead measurement): OLMoE-1B-7B (~14 GB) fits entirely on 16 GB VRAM. Here, vanilla (no hooks) is fastest at 39.2 tok/s — the policy hooks add 12–14% overhead. This is not the target scenario for offloading — it measures overhead when there is nothing to offload. MoE-PolicyLang is for models that *don't fit*.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/hitrate_with_ci.png" width="600" alt="Hit rate with bootstrap confidence intervals">
</p>

---

## Installation

From PyPI:
```bash
pip install moe-policylang           # DSL only (no GPU deps)
pip install moe-policylang[gpu]      # + torch, transformers, accelerate
pip install moe-policylang[all]      # everything
```

From source (development):
```bash
git clone https://github.com/jesse-pokora/MoE-PolicyLang.git
cd MoE-PolicyLang
pip install -e ".[dev,gpu]"
```

Cython fast path (for complex policies):
```bash
pip install moe-policylang[cython]
python setup_cython.py build_ext --inplace
```
Python dispatch ranges from 6 µs/layer (simple LRU) to 47 µs/layer (composed policies with triggers). The Cython path targets the high end — `freq_threshold` and `composed_full` drop from 28–47 µs to < 10 µs/layer. Simple policies like `lru_basic` (6 µs) see no benefit.

---

## Tested Models

MoE-PolicyLang auto-detects MoE structure from any HuggingFace model — no model-specific code required. We have evaluated on:

| Model | Experts × Layers | Routing | Hardware |
|-------|-----------------|---------|----------|
| Mixtral-8×7B-Instruct | 8 × 32 | top-2 | A100-80 GB |
| DeepSeek-V2-Lite | 64 × 27 | top-6 | A100-80 GB |
| Qwen1.5-MoE-A2.7B | 60 × 24 | top-4 | RTX 5080 (16 GB) |
| OLMoE-1B-7B | 64 × 16 | top-8 | RTX 5080 (16 GB) |

---

## Project Structure

```
moe_policylang/
├── grammar.lark           # Lark LALR grammar (62 productions)
├── parser.py              # Grammar → PolicyIR
├── ir.py                  # Intermediate representation
├── validator.py           # 20 semantic validation rules
├── compiler.py            # IR → CompiledPolicy
├── auto.py                # Auto-generate policies from model + GPU
├── dsl.py                 # Python eDSL (@sched.policy decorator)
├── adaptive.py            # Adaptive policies (adapt blocks)
├── autotuner.py           # Grid-search policy optimizer
├── cli.py                 # CLI: validate, compile, run
├── runtime/
│   ├── hooks.py           # 5-step per-layer dispatch protocol
│   ├── cache.py           # LRU / LFU / Score / FreqThreshold
│   ├── prefetch.py        # Affinity / History / Lookahead
│   ├── scheduler.py       # GPU-only / CPU-fallback / Hybrid
│   ├── per_layer.py       # EPCB — entropy-proportional caching
│   ├── triggers.py        # Memory-pressure & TTL eviction
│   └── _fast/             # Cython-accelerated paths
└── integrations/
    ├── __init__.py         # attach() — main user API
    ├── huggingface.py      # HuggingFace Transformers hooks
    ├── weight_placement.py # Expert offloading manager
    └── async_transfer.py   # CUDA stream async transfers
```

---

## Running Experiments

```bash
# Offline trace replay (no GPU needed)
python scripts/run_eval.py
python scripts/run_sweep.py

# Live inference on consumer GPU
python scripts/run_dsl_demo.py
python scripts/run_constrained_e2e.py

# Generate all paper figures
python scripts/generate_figures.py

# Benchmarks & evaluations (requires CUDA GPU + model weights)
python scripts/bench_qwen_multirun.py   # Qwen throughput (Table 4)
python scripts/bench_coldstart.py       # Cold-start throughput analysis
python scripts/bench_power.py           # Power/energy measurement
python scripts/eval_quality.py          # Perplexity evaluation (wikitext-2)
python scripts/ablation_epcb_sensitivity.py  # EPCB hyperparameter sweep
python scripts/plot_coldstart.py        # Generate cold-start figure
```

---

## Tests

```bash
python -m pytest tests/ -q
```

453+ tests covering parsing, validation, compilation, runtime dispatch,
adaptive policies, per-layer EPCB, and integration hooks.

---

## Documentation

See [`docs/MANUAL.md`](docs/MANUAL.md) for the full language reference,
runtime API, and policy authoring guide.

---

## Citation

```bibtex
@misc{pokora2026moepolicylang,
  title={MoE-PolicyLang: A Domain-Specific Language for Mixture-of-Experts Scheduling Policies},
  author={Pokora, Jesse},
  year={2026},
  url={https://github.com/jesse-pokora/MoE-PolicyLang}
}
```

---

## License

[MIT License](LICENSE) — Copyright (c) 2026 Jesse Pokora
