Metadata-Version: 2.4
Name: moe-policylang
Version: 1.3.2
Summary: A domain-specific language for Mixture-of-Experts scheduling policies
Author: Jesse Pokora
License: MIT
Project-URL: Homepage, https://github.com/jesse-pokora/MoE-PolicyLang
Project-URL: Repository, https://github.com/jesse-pokora/MoE-PolicyLang
Project-URL: Documentation, https://github.com/jesse-pokora/MoE-PolicyLang/blob/main/docs/MANUAL.md
Keywords: moe,mixture-of-experts,scheduling,dsl,offloading,llm
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: lark<2,>=1.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pyyaml; extra == "dev"
Provides-Extra: gpu
Requires-Dist: torch>=2.0; extra == "gpu"
Requires-Dist: transformers>=5.0.0; extra == "gpu"
Requires-Dist: accelerate; extra == "gpu"
Requires-Dist: requests>=2.33.0; extra == "gpu"
Requires-Dist: urllib3>=2.7.0; extra == "gpu"
Requires-Dist: filelock>=3.20.3; extra == "gpu"
Provides-Extra: vllm
Requires-Dist: vllm>=0.21; extra == "vllm"
Provides-Extra: cython
Requires-Dist: cython>=3.0; extra == "cython"
Provides-Extra: eval
Requires-Dist: matplotlib>=3.7; extra == "eval"
Requires-Dist: pyyaml; extra == "eval"
Requires-Dist: pandas; extra == "eval"
Requires-Dist: pillow>=12.2.0; extra == "eval"
Provides-Extra: all
Requires-Dist: moe-policylang[cython,dev,eval,gpu,vllm]; extra == "all"
Dynamic: license-file
Dynamic: requires-python

# MoE-PolicyLang

**A scheduling language for Mixture-of-Experts models.**

> Author: **Jesse Pokora** &middot; License: [MIT](LICENSE)

---

## TL;DR

- Mixture-of-Experts (MoE) models pack a lot of weights into
  "experts", but only a few experts fire per token. The rest can sit
  in CPU RAM and be pulled across PCIe to the GPU on demand.
- *Which* experts to keep on GPU, *when* to prefetch the next ones,
  and *what to do on a miss* is a policy question. Every existing
  MoE serving system hardcodes its policy inside the runtime.
- MoE-PolicyLang is a small declarative language for that policy.
  Swapping LRU for LFU is a one-word change; the cache, prefetch,
  scheduler, and eviction-trigger components stay the same.
- Headline number: Qwen1.5-MoE (28.6 GB fp16) runs on a 16 GB
  RTX 5080 at 4.61 tok/s, 8.1x faster than HuggingFace's default
  `device_map="auto"`, with bit-identical output.

---

## The problem

A Mixture-of-Experts layer replaces a single feed-forward block with
N "experts" plus a small router. For each token the router picks the
top-k experts (typically k=2 to k=8) and only those fire. Mixtral
uses 8 experts per layer (top-2), Qwen1.5-MoE uses 60 (top-4),
DeepSeek-V2-Lite uses 64 (top-6). Most experts sit idle on any given
token, but the weights still have to be reachable in case the router
picks them.

For a 28 GB MoE model on a 16 GB GPU, that means **offloading**:
keep some expert weights on GPU, the rest in CPU RAM, and move
weights across the PCIe bus when the router asks for one that isn't
resident. PCIe transfer is roughly two orders of magnitude slower
than reading from GPU memory, so the worst case (every expert a
miss) is painful: HuggingFace's `device_map="auto"` runs Qwen1.5-MoE
on a 5080 at 0.57 tok/s.

Doing better requires four coordinated decisions:

| Decision | Question | Example strategies |
|---|---|---|
| **Cache**     | Which experts stay on GPU? | LRU (drop least-recently-used), LFU (drop least-frequently-used), score-based |
| **Prefetch**  | Which to load before they're requested? | Affinity (layer L → L+1 patterns), history, lookahead |
| **Schedule** | What to do on a cache miss?  | Wait for the GPU transfer, run on CPU, decide per-call |
| **Adapt**     | When to change strategy? | Conditional rules on runtime metrics |

Every existing MoE serving system (ExpertFlow, Fiddler, MoE-Infinity,
HybriMoE, ProMoE, FineMoE) hardcodes these four decisions inside its
runtime. Changing any one strategy means reading and modifying the
system's expert-management module: roughly 200 to 2,000 LOC depending
on the system.

MoE-PolicyLang lets you write the policy as a short `.moe` file and
attach it to a HuggingFace or vLLM model. The runtime hooks that
consume the policy stay the same; only the policy text changes.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/constrained_throughput.png" width="600" alt="Throughput and hit rate comparison across policies on consumer GPU">
</p>

---

## The Language

A MoE-PolicyLang policy is a `.moe` file with four composable blocks:

```
policy balanced {
    cache {
        capacity = 16
        eviction = lfu
        frequency_decay = 0.9
    }
    prefetch {
        strategy = history
        budget = 4
    }
    schedule { mode = hybrid }
    adapt {
        when hit_rate < 0.4 for 100 accesses
            { eviction = lru }
    }
}
```

| Block | Controls | Strategies |
|-------|----------|------------|
| **cache** | Which experts stay on GPU | **LRU** drops the least-recently-used; **LFU** drops the least-frequently-used with decay; **score** ranks by router gate value; **freq-threshold** keeps anything above a frequency cutoff |
| **prefetch** | Experts loaded before they're requested | **History** uses a running co-occurrence matrix; **affinity** uses layer L → L+1 patterns; **lookahead** peeks ahead in the router output |
| **schedule** | What happens on a cache miss | **gpu-only** waits for the transfer; **cpu-fallback** runs the missed expert on CPU; **hybrid** decides per-call based on estimated latency |
| **adapt** | Self-tuning at runtime | Conditional rules of the form `when <metric> <op> <value> for <window> { <override> }` |

Switching from LRU to LFU is a one-word change. Adding prefetching
is two lines.

---

## Attaching to a model

```python
import moe_policylang
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("allenai/OLMoE-1B-7B-0924")

# Auto-generate a tuned policy from the model + GPU, attach it
mgr = moe_policylang.auto_attach(model)
output = model.generate(...)
print(mgr.get_stats())  # hit rate, transfers, evictions
```

Or write a policy explicitly:

```python
mgr = moe_policylang.attach(model, """
    policy aggressive {
        cache { capacity = 8  eviction = lru }
    }
""")
```

Or load a `.moe` file:

```python
mgr = moe_policylang.attach(model, open("my_policy.moe").read())
```

`attach()` parses the policy, runs the validator, compiles it to a
`PolicyHook`, and registers forward hooks on every MoE layer of the
model. From that point on, normal `model.generate()` calls trigger
the hooks.

---

## How a policy runs at runtime

On each MoE layer, after the router picks its top-k experts, the
hook runs five steps:

1. **Cache lookup.** For each selected expert, is its weight matrix
   already on GPU? Each lookup records a hit or a miss.
2. **Schedule.** For each miss, the scheduler decides whether to
   wait for a CPU→GPU transfer or to run that expert's compute on
   CPU (the `schedule` block picks the policy).
3. **Cache update.** Newly loaded experts go into the cache. If the
   cache is at capacity, the eviction rule (LRU, LFU, score, …)
   picks something to drop.
4. **Prefetch.** The prefetcher predicts which experts the next few
   layers will want and starts moving them across PCIe.
5. **Trigger check.** Memory-pressure and TTL eviction triggers run,
   evicting beyond the cache replacement rule if needed.

The hook is plain Python and adds 6 µs/layer (simple LRU) to
47 µs/layer (composed policy with triggers), against an MoE
forward-pass baseline of ~1,500 µs on A100 — under 3.2% of layer
time.

---

## Why a Language, Not YAML?

The `cache`, `prefetch`, and `schedule` blocks are key-value config
and could be handled by a JSON schema with Pydantic. Two things
push it past that:

**1. PLCB expresses a cache topology, not a parameter set.**
A flat schema like `Cache(eviction="lfu", capacity=32)` picks
algorithms and scalars. PLCB picks a *structure*: N independent
caches per layer with their own allocator, total budget, rebalance
interval, and min/max bounds.

```
per_layer {
    allocation         = uniform
    total_budget       = 864
    rebalance_interval = 500
    min_capacity       = 4
    max_capacity       = 48
}
```

You can't reasonably encode "maintain 27 independent caches at
total budget 864 with optional entropy-proportional allocation" in
a flat schema — the specification *is* structural rather than
scalar. PLCB is the existence proof that the grammar earns its
keep.

**2. `adapt` blocks express conditional runtime behavior.**
Hot-swap rules that monitor metrics and rewrite the policy aren't
config — they're a small embedded rule language with metric,
threshold, window, and rewrite target:

```
adapt {
    when hit_rate < 0.4 for 100 accesses { eviction = lru }
}
```

The grammar restricts what you can write (no arbitrary code in a
scheduling policy), and 20 semantic rules catch bad policies at
parse time rather than mid-inference.

There are also two other surfaces: a Python eDSL (`@sched.policy`
decorator) for programmatic construction, and `auto_attach` for
zero-config deployment. The `.moe` grammar is what makes the
PLCB and `adapt` semantics work; the others are convenience
wrappers.

---

## Results

### Dispatch overhead

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/latency_with_ci.png" width="600" alt="Dispatch overhead with 95% confidence intervals">
</p>

Per-layer dispatch (the Python hook that decides cache/evict/prefetch)
adds under 3.2% of MoE forward-pass time on A100: 6–47 µs/layer
against a 1,459 µs baseline. This is the policy decision overhead;
cache misses and weight transfers are accounted for separately and
depend on the policy and workload.

### Policy authoring effort

To add a new policy variant to one of these systems, a developer
needs to read and modify the system's expert-management module.
MoE-PolicyLang replaces that with a short `.moe` file. The 14–40x
numbers below count lines a user writes to express a policy; they
do not include MoE-PolicyLang's own runtime, which is around 4,300
LOC.

| System | Expert-mgmt module | DSL equivalent | Authoring reduction |
|--------|-------------------|---------------|-------------------|
| Fiddler | 280 LOC | 7 lines | 40x |
| HybriMoE | ~500 LOC | 14 lines | 36x |
| MoE-Infinity | 520 LOC | 16 lines | 33x |
| vLLM | 300 LOC | 12 lines | 25x |
| ExpertFlow | ~400 LOC | 16 lines | 25x |
| FineMoE | ~350 LOC | 25 lines | 14x |

Methodology: non-blank, non-comment lines in the primary
expert-management module. Measured sources: Fiddler from
`set_expert_loc()` + `execute_fiddler()` in `src/fiddler/mixtral.py`
(280 LOC); MoE-Infinity from `expert_prefetcher.py` +
`expert_cache.py` (520 LOC); vLLM from `MixtralMoE` expert dispatch
in `vllm/model_executor/` (300 LOC). Counts marked `~` are estimated
from paper descriptions of closed-source systems. Switching between
strategies (LRU to LFU) is a one-word change in the DSL versus
rewriting cache data structures in the hand-coded version.

### Policy selection matters when the cache can't hold the working set

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/capacity_sweep.png" width="600" alt="Cache hit rate vs capacity for Mixtral and DeepSeek">
</p>

Capacity sweeps on offline traces:
- Mixtral-8x7B (8 experts, top-2) saturates at cap=8 with around
  100% hit rate, since all experts fit. Policy choice barely
  matters here.
- DeepSeek-V2-Lite (64 experts, top-6) reaches only 51% hit rate at
  cap=32 (half the experts). LFU consistently beats LRU across
  budgets because DeepSeek has significant frequency skew (some
  experts activated 3-5x more often). This is the regime where
  policy selection and per-layer budgeting make a measurable
  difference.

### EPCB: Per-layer cache budgeting (with a negative result)

A **flat cache** holds, say, 32 expert weights total, shared
across all 27 MoE layers of DeepSeek-V2-Lite. If layer 0 is hot,
LFU keeps its experts. Layers that came in early and haven't
appeared recently get evicted. In steady state, a flat cap=32
cache on DeepSeek covers only ~11 of the 27 layers — the other 16
have zero experts cached and miss every dispatch.

A **per-layer cache** splits the same total budget (864 = 27×32
slots) into one cache per layer. Each layer keeps its own hot
experts.

Empirical Per-layer Cache Budgeting (EPCB) has two findings, one
positive and one negative.

1. The regime caveat (read this first). Per-layer caching only
helps when the per-layer budget covers each layer's active working
set. On 16 GB consumer hardware, which is the case most readers
will hit, per-layer caching hurts throughput by 16%: the per-layer
budgets are too small to cover each layer's working set, and the
aggregated cache pushes the CUDA allocator to the VRAM ceiling.
Flat shared caching is the default recommendation for
memory-constrained deployments. Per-layer wins when there is VRAM
headroom and high expert counts (DeepSeek-V2-Lite on A100, below).

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/per_layer_regime.png" width="600" alt="When per-layer caching wins vs hurts: DeepSeek/A100 lies in the wins region; Qwen/RTX 5080 in the hurts region">
</p>

2. When the regime permits, per-layer cache structure is what
matters. At matched total budget on DeepSeek-V2-Lite (A100-80GB),
replacing a shared cache with per-layer caches gives +14.7pp hit
rate in offline trace replay and eliminates all CPU/GPU transfers
in steady state. Output is bit-identical to the fully-resident
baseline.

The headline throughput gain (1.60 to 10.22 tok/s, +540%) compares
shared-32 to per-layer-864, which is 27x more total slots. The
matched-budget +14.7pp hit rate and transfer elimination are the
core findings; the 540% wall-clock number folds in the capacity
expansion.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/cache_structure.png" width="600" alt="Flat shared cache leaves layers uncovered; per-layer caches at matched total budget cover every layer">
</p>

This structural difference maps directly onto MoE-aware baselines.
Fiddler's expert placement is a hardcoded global popularity ranking,
which is structurally equivalent to the flat cache on the left. On
an A100-80GB where 85% of Mixtral's experts fit on-device (217/256),
the ranking barely matters because almost everything is resident.
On a constrained GPU where only a fraction of experts fit, a global
ranking starves cold layers (left heatmap), while a per-layer
policy maintains coverage at every layer (right heatmap). Per-layer
caching enables topologies that a flat global ranking cannot
express. The mechanical payoff is PCIe stall elimination: expert
offloading is memory-bandwidth-bound, so every cache miss costs a
CPU-to-GPU transfer. When per-layer caches cover each layer's
working set, steady-state misses drop to zero, which is why a
hit-rate improvement turns into a 6.4x wall-clock gain (10.22 vs
1.60 tok/s on DeepSeek-V2-Lite at matched total budget).

#### Fiddler head-to-head (A100-80GB, Mixtral-8x7B)

Fiddler and MoE-PolicyLang on the same hardware, model, prompt,
and methodology (n=5, greedy decoding, 64 tokens):

| Config | tok/s (±σ) | 95% CI | GPU Peak | Hit Rate | Transfers |
|--------|-----------|--------|----------|----------|-----------|
| Fiddler | 4.17 ± 0.02 | [4.16, 4.18] | 80.6 GB | 88.3% | — |
| MPL fiddler_equiv (cap=2) | 0.18 ± 0.00 | [0.18, 0.18] | 6.4 GB | 19.5% | 4,283 |
| MPL balanced (cap=4) | 0.29 ± 0.00 | [0.29, 0.29] | 39.5 GB | 46.4% | 2,665 |
| MPL generous (cap=6) | 0.45 ± 0.00 | [0.45, 0.45] | 61.7 GB | 71.0% | 1,726 |

All MPL configs produce bit-identical output. Fiddler is 9-23x
faster.

The gap is mechanism, not policy. Fiddler uses an optimized
C++/CUDA transfer pipeline with pre-allocated GPU memory slots and
direct DMA. MoE-PolicyLang dispatches through Python-level
`Tensor.to()` calls in the HuggingFace forward pass. At Fiddler's
85% GPU residency (217/256 experts on-device), placement strategy
isn't doing the work; the model mostly fits.

MoE-PolicyLang is a policy specification layer, not a serving
system. It specifies which experts to cache, evict, and prefetch,
but does not implement the physical mechanism that moves expert
tensors between devices. The vLLM integration below shows the
policy layer composing cleanly with a production inference engine:
the same DSL captures routing decisions from vLLM's optimized MoE
kernel path without modification.

3. The allocation signal does not matter. We tested six signals
(Shannon entropy, inverse top-k mass, inverse variance, inverse KL,
inverse Gini, uniform). None differentiates from uniform by more
than 2.5pp in hit rate, and all six collapse to within noise of
uniform in wall-clock on two models. Uniform is the default.
Shannon entropy is opt-in for models with high inter-layer entropy
spread (ΔH around 1 nat or more), but it was within noise of
uniform on every model tested end-to-end.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/deepseek_entropy_allocation.png" width="600" alt="Per-layer entropy and capacity allocation">
</p>

| Strategy | Total slots | Hit Rate | Δ vs shared | Wall-clock (A100) |
|----------|------------|----------|---|---|
| Shared cache | 32 | 48.6% | baseline | 1.60 tok/s |
| Per-layer uniform | 864 (27x) | 63.3% | +14.7pp | 10.22 tok/s |
| Per-layer entropy | 864 (27x) | 65.5% | +16.9pp | 10.17 tok/s (~ uniform) |

### Live inference on consumer GPU

When the model doesn't fit: Qwen1.5-MoE-A2.7B (~28.6 GB fp16) on
RTX 5080 (16 GB VRAM). Without MoE-PolicyLang, the only option is
`device_map="auto"` at 0.57 tok/s. With a 4-line DSL policy:

| Config | Strategy | Cap | VRAM | tok/s | 95% CI |
|--------|----------|-----|------|-------|--------|
| Baseline (`auto`) | — | — | 12.0 GB | 0.57±0.00 | — |
| Skeleton | LRU (cap=1) | 1 | 4.7 GB | 4.23±0.22 | [4.04, 4.42] |
| Aggressive | LRU | 2 | 5.2 GB | 4.17±0.03 | [4.14, 4.20] |
| Balanced | LFU+hist. | 4 | 7.3 GB | 4.35±0.06 | [4.30, 4.40] |
| Generous | LFU+hist. | 8 | 10.1 GB | 4.61±0.08 | [4.54, 4.68] |

Cost-performance: 4.61 tok/s on a $1k consumer GPU is comparable
in absolute throughput to Fiddler's 4.17 tok/s on a $15k A100,
though the models are different and the numbers are not directly
comparable.

Decomposition: roughly 92% of the 8.1x speedup comes from
expert-aware loading (skeleton on GPU, experts on CPU). Even a
capacity-1 "every dispatch is a miss" config reaches 4.23 tok/s
(7.4x). Caching adds the remaining +0.38 tok/s. The DSL's
contribution is not the loading mechanism (any system could
implement that) but the remaining 8%: the policy layer that
decides what to cache, evict, and prefetch, accessible without
runtime modification and adaptable at runtime via `adapt` rules
that no static config expresses. On models with more experts, the
policy layer's share grows. On DeepSeek (A100), matched-budget
per-layer allocation gains +14.7pp hit rate over flat caching at
the same total slot count, a pure policy-structure effect with no
capacity confound.

n=5, bootstrap 95% CIs. For output correctness: greedy decoding
(`do_sample=False`) produces bit-identical token sequences across
all policy configs vs `device_map="auto"` baseline (4 prompts x 3
policies = 12 comparisons); perplexity on wikitext-2 matches within
0.024%.

When the model fits (overhead measurement): OLMoE-1B-7B (~14 GB)
fits entirely on 16 GB VRAM. There vanilla (no hooks) is fastest at
39.2 tok/s, with the policy hooks adding 12-14% overhead. This is
not the target scenario; it measures overhead when there is nothing
to offload. MoE-PolicyLang is for models that don't fit.

<p align="center">
  <img src="https://raw.githubusercontent.com/jesse-pokora/MoE-PolicyLang/master/docs/images/hitrate_with_ci.png" width="600" alt="Hit rate with bootstrap confidence intervals">
</p>

---

## Installation

From PyPI:
```bash
pip install moe-policylang           # DSL only (no GPU deps)
pip install moe-policylang[gpu]      # + torch, transformers, accelerate
pip install moe-policylang[vllm]     # + vLLM (GPTQ/AWQ quantized inference)
pip install moe-policylang[all]      # everything
```

For quantized models, use the `[vllm]` extra. vLLM handles GPTQ
and AWQ quantization with optimized kernels; MoE-PolicyLang
observes routing decisions and applies policy logic without
managing the tensors directly.

For Blackwell GPUs (RTX 5080/5090), set these env vars before
running vLLM:
```bash
export VLLM_USE_FLASHINFER_SAMPLER=0
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
export VLLM_FLASH_ATTN_VERSION=2
```

From source (development):
```bash
git clone https://github.com/jesse-pokora/MoE-PolicyLang.git
cd MoE-PolicyLang
pip install -e ".[dev,gpu]"
```

Cython fast path (for complex policies):
```bash
pip install moe-policylang[cython]
python setup_cython.py build_ext --inplace
```
Python dispatch ranges from 6 µs/layer (simple LRU) to 47 µs/layer
(composed policies with triggers). The Cython path targets the
high end: `freq_threshold` and `composed_full` drop from 28-47 µs
to under 10 µs/layer. Simple policies like `lru_basic` (6 µs) see
no benefit.

---

## Tested Models

MoE-PolicyLang auto-detects MoE structure from any HuggingFace
model with no model-specific code required. Evaluated on:

| Model | Experts x Layers | Routing | Hardware | Backend |
|-------|-----------------|---------|----------|----------|
| Mixtral-8x7B-Instruct | 8 x 32 | top-2 | A100-80 GB | HF Transformers |
| DeepSeek-V2-Lite | 64 x 27 | top-6 | A100-80 GB | HF Transformers |
| Qwen1.5-MoE-A2.7B | 60 x 24 | top-4 | RTX 5080 (16 GB) | HF Transformers |
| Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4 | 60 x 24 | top-4 | RTX 5080 (16 GB) | vLLM |
| OLMoE-1B-7B | 64 x 16 | top-8 | RTX 5080 (16 GB) | HF Transformers |

---

## vLLM Integration

MoE-PolicyLang integrates with [vLLM](https://github.com/vllm-project/vllm)
for production-grade quantized MoE inference. The
`VLLMPolicyRunner` instruments vLLM's router layers to capture
expert routing decisions and feeds them through the policy system.
The same DSL, compiler, and hooks work whether the mechanism layer
is HuggingFace's eager execution or vLLM's optimized kernels.

```python
from moe_policylang.integrations.vllm_backend import VLLMPolicyRunner

runner = VLLMPolicyRunner(
    model="Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4",
    policy_dsl='''
        policy demo {
            cache { capacity = 8  eviction = lru }
            prefetch { strategy = lookahead  lookahead = 1 }
            schedule { mode = gpu_only }
        }
    ''',
    quantization="gptq",
)

results = runner.generate(["What is expert routing?"], max_tokens=30)
print(results["text"])          # generated text
print(results["policy_stats"])  # cache hits, prefetch accuracy, etc.
```

Verified on RTX 5080 (16 GB), vLLM 0.21, GPTQ-Int4 quantization.
Captures 744 routing events across 24 layers x 60 experts, with a
14.7% cache hit rate and 72% prefetch accuracy from a minimal
8-slot LRU policy.

This shows the DSL is backend-agnostic: the policy specification
layer is independent of the inference engine.

---

## Project Structure

```
moe_policylang/
├── grammar.lark           # Lark LALR grammar (62 productions)
├── parser.py              # Grammar → PolicyIR
├── ir.py                  # Intermediate representation
├── validator.py           # 20 semantic validation rules
├── compiler.py            # IR → CompiledPolicy
├── auto.py                # Auto-generate policies from model + GPU
├── dsl.py                 # Python eDSL (@sched.policy decorator)
├── adaptive.py            # Adaptive policies (adapt blocks)
├── autotuner.py           # Grid-search policy optimizer
├── cli.py                 # CLI: validate, compile, run
├── runtime/
│   ├── hooks.py           # 5-step per-layer dispatch protocol
│   ├── cache.py           # LRU / LFU / Score / FreqThreshold
│   ├── prefetch.py        # Affinity / History / Lookahead
│   ├── scheduler.py       # GPU-only / CPU-fallback / Hybrid
│   ├── per_layer.py       # EPCB — entropy-proportional caching
│   ├── triggers.py        # Memory-pressure & TTL eviction
│   └── _fast/             # Cython-accelerated paths
└── integrations/
    ├── __init__.py         # attach() — main user API
    ├── huggingface.py      # HuggingFace Transformers hooks
    ├── vllm_backend.py     # vLLM integration (routing trace + policy replay)
    ├── weight_placement.py # Expert offloading manager
    └── async_transfer.py   # CUDA stream async transfers
```

---

## Running Experiments

```bash
# Offline trace replay (no GPU needed)
python scripts/run_eval.py
python scripts/run_sweep.py

# Live inference on consumer GPU
python scripts/run_dsl_demo.py
python scripts/run_constrained_e2e.py

# Generate all paper figures
python scripts/generate_figures.py

# Benchmarks & evaluations (requires CUDA GPU + model weights)
python scripts/bench_qwen_multirun.py   # Qwen throughput (Table 4)
python scripts/bench_coldstart.py       # Cold-start throughput analysis
python scripts/bench_power.py           # Power/energy measurement
python scripts/eval_quality.py          # Perplexity evaluation (wikitext-2)
python scripts/ablation_epcb_sensitivity.py  # EPCB hyperparameter sweep
python scripts/plot_coldstart.py        # Generate cold-start figure
```

---

## Tests

```bash
python -m pytest tests/ -q
```

453+ tests covering parsing, validation, compilation, runtime dispatch,
adaptive policies, per-layer EPCB, and integration hooks.

---

## Documentation

See [`docs/MANUAL.md`](docs/MANUAL.md) for the full language reference,
runtime API, and policy authoring guide.

---

## Glossary

- **MoE (Mixture-of-Experts).** A Transformer layer that replaces a
  single feed-forward block with N expert networks plus a small
  router that picks the top-k experts per token.
- **Expert.** One feed-forward sub-network inside an MoE layer.
  Mixtral has 8 per layer, Qwen1.5-MoE has 60, DeepSeek-V2-Lite has 64.
- **Router / top-k routing.** The small classifier inside each MoE
  layer that scores experts and picks the k highest per token.
- **Offloading.** Keeping some weights in CPU RAM and moving them
  to GPU on demand. The cost is the PCIe transfer.
- **PCIe.** The bus between CPU memory and the GPU. Roughly two
  orders of magnitude slower than reading from GPU HBM/GDDR, so
  cache misses are expensive.
- **Skeleton.** Everything in the model that isn't an expert:
  embeddings, attention, layer norms, LM head. MoE-PolicyLang pins
  the skeleton on GPU (≈3.7 GB for Qwen1.5-MoE) and only the
  expert weights move.
- **Cache hit / miss.** A hit means the expert the router picked
  is already on GPU. A miss means we have to fetch it (or run it on
  CPU, if `schedule = cpu_fallback`).
- **LRU / LFU.** Least-Recently-Used and Least-Frequently-Used
  cache eviction. LRU drops whatever hasn't been touched lately;
  LFU drops whatever has the lowest activation count (with a decay
  factor so old hot experts age out).
- **fp16 / GPTQ / AWQ.** Weight precisions. fp16 is the standard
  half-precision format used in this paper's experiments. GPTQ and
  AWQ are 4-bit quantization formats that vLLM consumes; they trade
  a small amount of perplexity for a large VRAM reduction.
- **KV-cache.** The cache of attention keys/values from previous
  tokens during generation. It grows with sequence length and
  competes with expert weights for VRAM.
- **pp (percentage points).** Used for hit-rate deltas:
  +14.7 pp means 48.6% → 63.3%, not a 14.7% relative change.
- **EPCB.** Empirical Per-layer Cache Budgeting. See the section
  above; the load-bearing part is the per-layer cache structure,
  not the entropy allocator that gave the technique its name.

---

## Citation

```bibtex
@misc{pokora2026moepolicylang,
  title={MoE-PolicyLang: A Domain-Specific Language for Mixture-of-Experts Scheduling Policies},
  author={Pokora, Jesse},
  year={2026},
  url={https://github.com/jesse-pokora/MoE-PolicyLang}
}
```

---

## License

[MIT License](LICENSE) — Copyright (c) 2026 Jesse Pokora
