Metadata-Version: 2.4
Name: celmoe-vp
Version: 1.1.1
Summary: Composable Expert Layer MoE primitives with hierarchical routing, fusion, and multi-loss orchestration.
Author: F000NK, Voluntas Progressus
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: torch-directml>=0.2.5

# celmoe-vp

`celmoe-vp` is a domain-agnostic library for building hierarchical expert systems with explicit routing, learned fusion, and multi-scope loss composition.

PyPI package name:

```bash
pip install celmoe-vp
```

Import name:

```python
import celmoe
```

This package is not morphology-specific. It is the reusable architectural layer underneath higher-level applications such as `morphoformer`.

## Core idea

CELMoE stands for a composable expert-layer MoE design where the framework gives you mechanisms rather than a task-specific network:

- hierarchical expert registration
- per-sample routing across levels
- optional gradient isolation between parent and child experts
- learned fusion of level outputs
- an API for global, per-level, and bridge losses

You bring the actual expert implementation.

## Main concepts

### Configuration objects

The public config layer is built from dataclasses:

- `ExpertConfig`
- `HierarchyLevelConfig`
- `CELMoEConfig`

`ExpertConfig` describes a single expert and carries:

- `name`
- `loss_weight`
- `stop_gradient_from_parent`
- `metadata`

`HierarchyLevelConfig` describes a level such as `universal`, `family`, `language`, `vision`, `region`, or any other hierarchy you want.

`CELMoEConfig` defines:

- `hidden_size`
- `levels`
- `use_learned_fusion`
- `fusion_dropout`

### Expert contract

Experts implement the `ExpertModule` interface:

```python
from celmoe import ExpertBatch, ExpertModule, ExpertResult

class MyExpert(ExpertModule):
    def forward(self, batch: ExpertBatch) -> ExpertResult:
        return ExpertResult(hidden=batch.hidden)
```

`ExpertBatch` contains:

- `hidden`
- `context`
- `mask`
- `state`

`ExpertResult` contains:

- `hidden`
- `losses`
- `aux`

That keeps the framework generic enough for sequence models, classifiers, multimodal systems, or other expert topologies.

### Hierarchical execution

`HierarchicalCELMoE` runs enabled levels in order, applies the routed expert for each sample, collects level-local losses, and fuses the resulting hidden states.

The default fusion module is `LearnedFusion`.

Routing format:

```python
{
    "universal": ["core", "core", "core"],
    "family": ["slavic", "romance", "slavic"],
    "language": ["rus", "spa", "bul"],
}
```

Each list is batch-aligned. If a level has only one expert, CELMoE can auto-broadcast it. If a level defines `fallback_expert`, that fallback can also be broadcast automatically.

## Multi-loss API

One of the main reasons this package exists is the loss interface.

`CELMoELossAPI` lets you register three different loss scopes:

- global losses over the whole `CELMoEOutput`
- level losses for a specific `LevelOutput`
- bridge losses between parent and child levels

Public methods:

- `add_global`
- `add_level`
- `add_bridge`
- `compute`

That means you can express setups like:

- a parent regularization loss
- a child specialization loss
- a shared final consistency loss
- a bridge loss between parent and child representations

without hardcoding any task-specific assumptions inside the framework.

## Quick example

```python
from typing import Mapping

import torch
import torch.nn as nn

from celmoe import (
    CELMoEConfig,
    CELMoELossAPI,
    ExpertBatch,
    ExpertConfig,
    ExpertModule,
    ExpertResult,
    HierarchicalCELMoE,
    HierarchyLevelConfig,
)


class LinearExpert(ExpertModule):
    def __init__(self, hidden_size: int) -> None:
        super().__init__()
        self.proj = nn.Linear(hidden_size, hidden_size)

    def forward(self, batch: ExpertBatch) -> ExpertResult:
        hidden = self.proj(batch.hidden)
        return ExpertResult(hidden=hidden)


def build_expert(level_name: str, expert_name: str, metadata: Mapping[str, object]) -> ExpertModule:
    del level_name, expert_name, metadata
    return LinearExpert(hidden_size=256)


config = CELMoEConfig(
    hidden_size=256,
    levels=[
        HierarchyLevelConfig(
            name="parent",
            experts={"core": ExpertConfig(name="core")},
            fallback_expert="core",
        ),
        HierarchyLevelConfig(
            name="child",
            experts={
                "a": ExpertConfig(name="a", stop_gradient_from_parent=True),
                "b": ExpertConfig(name="b", stop_gradient_from_parent=True),
            },
        ),
    ],
)

model = HierarchicalCELMoE(config, expert_factory=build_expert)

hidden = torch.randn(4, 32, 256)
output = model(
    hidden,
    routing={
        "parent": ["core"] * 4,
        "child": ["a", "b", "a", "b"],
    },
)

loss_api = CELMoELossAPI()
loss_api.add_global("l2", lambda out, targets, context: out.fused_hidden.pow(2).mean())
bundle = loss_api.compute(output)
print(bundle.total)
```

## Data structures

Important public objects:

- `LevelOutput`
- `CELMoEOutput`
- `LossTerm`
- `LossBundle`

`LossBundle` is especially useful when you want both:

- a differentiable total loss for backpropagation
- a flat scalar dictionary for logs and dashboards

## Gradient isolation

Per-expert gradient isolation is controlled through `stop_gradient_from_parent`.

When enabled, the hidden state passed to that expert is detached before the expert runs. This is useful when you want:

- hierarchical specialization
- reduced interference across levels
- staged training or freeze/unfreeze workflows

## Good use cases

`celmoe-vp` fits well when you have:

- multilingual systems with language or family experts
- domain experts on top of a shared trunk
- regional or product-specific experts
- curriculum setups with parent/child objectives
- research code that needs explicit loss composition

## What it intentionally does not include

This package does not prescribe:

- embeddings
- tokenization
- Transformer block internals
- data loading
- optimizer logic
- checkpointing

Those belong in adjacent packages. `celmoe-vp` is the orchestration layer, not the full stack.

## Typing and packaging

The package is strictly typed and ships `py.typed`.

That matters because CELMoE is meant to be published and consumed as an independent package, not only as a local monorepo internal. Downstream packages can rely on its public dataclasses and expert contracts without importing project-specific code.
