Metadata-Version: 2.4
Name: moe-policylang
Version: 1.1.2
Summary: A domain-specific language for Mixture-of-Experts scheduling policies
Author: Jesse Pokora
License: MIT
Project-URL: Homepage, https://github.com/jesse-pokora/MoE-PolicyLang
Project-URL: Repository, https://github.com/jesse-pokora/MoE-PolicyLang
Project-URL: Documentation, https://github.com/jesse-pokora/MoE-PolicyLang/blob/main/docs/MANUAL.md
Keywords: moe,mixture-of-experts,scheduling,dsl,offloading,llm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lark<2,>=1.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pyyaml; extra == "dev"
Provides-Extra: gpu
Requires-Dist: torch>=2.0; extra == "gpu"
Requires-Dist: transformers>=5.0.0; extra == "gpu"
Requires-Dist: accelerate; extra == "gpu"
Requires-Dist: requests>=2.33.0; extra == "gpu"
Requires-Dist: urllib3>=2.7.0; extra == "gpu"
Requires-Dist: filelock>=3.20.3; extra == "gpu"
Provides-Extra: cython
Requires-Dist: cython>=3.0; extra == "cython"
Provides-Extra: eval
Requires-Dist: matplotlib>=3.7; extra == "eval"
Requires-Dist: pyyaml; extra == "eval"
Requires-Dist: pandas; extra == "eval"
Requires-Dist: pillow>=12.2.0; extra == "eval"
Provides-Extra: all
Requires-Dist: moe-policylang[cython,dev,eval,gpu]; extra == "all"
Dynamic: license-file
Dynamic: requires-python

# MoE-PolicyLang

**A scheduling language for Mixture-of-Experts models.**

> Author: **Jesse Pokora** &middot; License: [MIT](LICENSE)

---

## What Is This?

Large language models like Mixtral, DeepSeek, and Qwen use **Mixture-of-Experts (MoE)** — instead of one giant network, they have dozens of smaller "expert" networks and a router that picks which ones to use for each token. By design, only a fraction of experts are active at any time, so the rest are **offloaded** to CPU memory — this is intentional, not a limitation.

But managing that offloading is complex. *Which* experts to keep on GPU? *When* to prefetch the next ones? *Where* to run cache misses — wait for the GPU transfer, or fall back to CPU? And *how* to adapt as the workload shifts?

**Every existing system hardcodes these decisions** in hundreds of lines of C++/CUDA. MoE-PolicyLang replaces all of that with a small, declarative language.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/constrained_throughput.png" width="600" alt="Throughput and hit rate comparison across policies on consumer GPU">
</p>

---

## The Language

A MoE-PolicyLang policy is a `.moe` file with four composable blocks:

```
policy balanced {
    cache {
        capacity = 16
        eviction = lfu
        frequency_decay = 0.9
    }
    prefetch {
        strategy = history
        budget = 4
    }
    schedule { mode = hybrid }
    adapt {
        when hit_rate < 0.4 for 100 accesses
            { eviction = lru }
    }
}
```

| Block | Controls | Options |
|-------|----------|---------|
| **cache** | Which experts stay on GPU | LRU, LFU, score-based, frequency-threshold |
| **prefetch** | Proactive loading | History, affinity, lookahead |
| **schedule** | Where to run cache misses | GPU-only, CPU-fallback, hybrid |
| **adapt** | Runtime self-tuning | Conditional rules that hot-swap components |

**Switching from LRU to LFU?** Change one word. **Adding prefetching?** Two lines.

---

## Two Lines to Attach

```python
import moe_policylang
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924")

# Auto-generate a tuned policy from your model + GPU, attach it
mgr = moe_policylang.auto_attach(model)
output = model.generate(...)
print(mgr.get_stats())  # hit rate, transfers, evictions
```

Or write a policy explicitly:

```python
mgr = moe_policylang.attach(model, """
    policy aggressive {
        cache { capacity = 8  eviction = lru }
    }
""")
```

Or load a `.moe` file:

```python
mgr = moe_policylang.attach(model, open("my_policy.moe").read())
```

---

## Why a Language?

Python dicts could configure this. The DSL adds three things they can't:

1. **Static validation** — 20 semantic rules catch bad policies at parse time, not mid-inference
2. **Portability** — `.moe` files are shareable, diffable, and tool-agnostic
3. **Constraint** — you can't write arbitrary code in a scheduling policy; the grammar limits you to what makes sense

---

## Results

### The abstraction is effectively free

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/latency_with_ci.png" width="600" alt="Dispatch overhead with 95% confidence intervals">
</p>

All policies add **< 3.2% overhead** on A100 (6–47 µs/layer vs. 1,459 µs for MoE forward pass).

### 14–40× less code than published systems

| System | Their LOC | MoE-PolicyLang | Reduction |
|--------|-----------|-----------|----------|
| Fiddler | **280** | 7 lines | 40× |
| HybriMoE | ~500 | 14 lines | 36× |
| MoE-Infinity | **520** | 16 lines | 33× |
| vLLM | **300** | 12 lines | 25× |
| ExpertFlow | ~400 | 16 lines | 25× |
| FineMoE | ~350 | 25 lines | 14× |

**Bold** LOC counts are measured from open-source repos (primary
expert-management function or module — e.g., Fiddler's `set_expert_loc()`
in `src/fiddler/mixtral.py`); others estimated from paper descriptions.

### Policy selection produces measurable differences

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/capacity_sweep.png" width="600" alt="Cache hit rate vs capacity for Mixtral and DeepSeek">
</p>

Different policies → different real performance. Mixtral saturates quickly (8 experts); DeepSeek (64 experts) needs smarter strategies.

### EPCB: Per-layer cache budgeting (with an honest negative result)

Not all layers see the same routing pattern — some concentrate on a few experts, others spread across many. **Empirical Per-layer Cache Budgeting (EPCB)** has two parts:

1. **Per-layer cache structure** is the load-bearing lever: replacing a single shared cache with per-layer caches at the same total budget yields +14.7pp hit rate on DeepSeek-V2-Lite in trace replay, and **+540% wall-clock** on A100 end-to-end (1.60 → 10.22 tok/s, eliminating all CPU↔GPU transfers in steady state). Bit-identical output verified against fully-resident baseline.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/cache_structure.png" width="600" alt="Flat shared cache leaves layers uncovered; per-layer caches at matched total budget cover every layer">
</p>

2. **The allocation signal does not matter.** We tested six signals (Shannon entropy, inverse top-k mass, inverse variance, inverse KL, inverse Gini, uniform) and none differentiates from uniform allocation by more than 2.5pp in hit rate at any budget, and all six collapse to within noise of uniform in wall-clock on two models tested end-to-end. We retain Shannon entropy as the default allocator for principled reasons, but recommend uniform allocation for simplicity in practice.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/deepseek_entropy_allocation.png" width="600" alt="Per-layer entropy and capacity allocation">
</p>

| Strategy | Hit Rate | Δ vs shared | Wall-clock (A100) |
|----------|----------|---|---|
| Shared cache (32 slots) | 48.6% | baseline | 1.60 tok/s |
| Per-layer uniform (864 slots) | 63.3% | +14.7pp | 10.22 tok/s (+540%) |
| Per-layer Shannon entropy (864 slots) | 65.5% | +16.9pp | 10.17 tok/s (≈ uniform) |

**Regime caveat**: per-layer caching only helps when the per-layer budget covers each layer's active working set. On 16 GB consumer hardware where the budget is tight (Qwen on RTX 5080), per-layer caching *hurts* throughput by 16%; flat shared caching is the better default in that regime.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/per_layer_regime.png" width="600" alt="When per-layer caching wins vs hurts: DeepSeek/A100 lies in the wins region; Qwen/RTX 5080 in the hurts region">
</p>

### Live inference on consumer GPU

OLMoE-1B-7B on RTX 5080 Laptop (16 GB VRAM):

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/hitrate_with_ci.png" width="600" alt="Hit rate with bootstrap confidence intervals">
</p>

| Policy | Cap | Hit Rate | tok/s |
|--------|-----|----------|-------|
| Vanilla (no hooks) | — | — | 39.2 |
| Naive LRU | 4 | 2.4% | 34.6 |
| LRU | 16 | 26.3% | 34.7 |
| LFU+History | 16 | 27.1% | 33.8 |
| **EPCB** | **16** | **47.3%** | 33.6 |

---

## Installation

From PyPI:
```bash
pip install moe-policylang           # DSL only (no GPU deps)
pip install moe-policylang[gpu]      # + torch, transformers, accelerate
pip install moe-policylang[all]      # everything
```

From source (development):
```bash
git clone https://github.com/jesse-pokora/MoE-PolicyLang.git
cd MoE-PolicyLang
pip install -e ".[dev,gpu]"
```

Cython fast path (< 10 µs/layer):
```bash
pip install moe-policylang[cython]
python setup_cython.py build_ext --inplace
```

---

## Tested Models

MoE-PolicyLang auto-detects MoE structure from any HuggingFace model — no model-specific code required. We have evaluated on:

| Model | Experts × Layers | Routing | Hardware |
|-------|-----------------|---------|----------|
| Mixtral-8×7B-Instruct | 8 × 32 | top-2 | A100-80 GB |
| DeepSeek-V2-Lite | 64 × 27 | top-6 | A100-80 GB |
| Qwen1.5-MoE-A2.7B | 60 × 24 | top-4 | RTX 5080 (16 GB) |
| OLMoE-1B-7B | 64 × 16 | top-8 | RTX 5080 (16 GB) |

---

## Project Structure

```
moe_policylang/
├── grammar.lark           # Lark LALR grammar (62 productions)
├── parser.py              # Grammar → PolicyIR
├── ir.py                  # Intermediate representation
├── validator.py           # 20 semantic validation rules
├── compiler.py            # IR → CompiledPolicy
├── auto.py                # Auto-generate policies from model + GPU
├── dsl.py                 # Python eDSL (@sched.policy decorator)
├── adaptive.py            # Adaptive policies (adapt blocks)
├── autotuner.py           # Grid-search policy optimizer
├── cli.py                 # CLI: validate, compile, run
├── runtime/
│   ├── hooks.py           # 5-step per-layer dispatch protocol
│   ├── cache.py           # LRU / LFU / Score / FreqThreshold
│   ├── prefetch.py        # Affinity / History / Lookahead
│   ├── scheduler.py       # GPU-only / CPU-fallback / Hybrid
│   ├── per_layer.py       # EPCB — entropy-proportional caching
│   ├── triggers.py        # Memory-pressure & TTL eviction
│   └── _fast/             # Cython-accelerated paths
└── integrations/
    ├── __init__.py         # attach() — main user API
    ├── huggingface.py      # HuggingFace Transformers hooks
    ├── weight_placement.py # Expert offloading manager
    └── async_transfer.py   # CUDA stream async transfers
```

---

## Running Experiments

```bash
# Offline trace replay (no GPU needed)
python scripts/run_eval.py
python scripts/run_sweep.py

# Live inference on consumer GPU
python scripts/run_dsl_demo.py
python scripts/run_constrained_e2e.py

# Generate all paper figures
python scripts/generate_figures.py

# Benchmarks & evaluations (requires CUDA GPU + model weights)
python scripts/bench_qwen_multirun.py   # Qwen throughput (Table 4)
python scripts/bench_coldstart.py       # Cold-start throughput analysis
python scripts/bench_power.py           # Power/energy measurement
python scripts/eval_quality.py          # Perplexity evaluation (wikitext-2)
python scripts/ablation_epcb_sensitivity.py  # EPCB hyperparameter sweep
python scripts/plot_coldstart.py        # Generate cold-start figure
```

---

## Tests

```bash
python -m pytest tests/ -q
```

398+ tests covering parsing, validation, compilation, runtime dispatch,
adaptive policies, per-layer EPCB, and integration hooks.

---

## Documentation

See [`docs/MANUAL.md`](docs/MANUAL.md) for the full language reference,
runtime API, and policy authoring guide.

---

## Citation

```bibtex
@misc{pokora2026moepolicylang,
  title={MoE-PolicyLang: A Domain-Specific Language for Mixture-of-Experts Scheduling Policies},
  author={Pokora, Jesse},
  year={2026},
  url={https://github.com/jesse-pokora/MoE-PolicyLang}
}
```

---

## License

[MIT License](LICENSE) — Copyright (c) 2026 Jesse Pokora
