Metadata-Version: 2.4
Name: morphoformer
Version: 5.0.0
Summary: Morphoformer with CELMoE-based multilingual morphology, typed training pipeline, and publishable CLI.
Author: F000NK, Voluntas Progressus
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: numpy
Requires-Dist: celmoe-vp==2.0.1
Requires-Dist: chartoken-vp==2.1.5
Requires-Dist: sigmorphon-vp==2.1.5
Requires-Dist: torchblocks-vp==2.1.5
Requires-Dist: trainkit-vp==2.3.5
Requires-Dist: morphlog-vp==2.1.1
Requires-Dist: vpterm-vp==2.0.1
Provides-Extra: directml
Requires-Dist: morph-directml-vp>=1.0.2; python_version < "3.14" and extra == "directml"
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == "dev"
Requires-Dist: pyright>=1.1.0; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"

# morphoformer

`morphoformer` is the application package of the `Morph_v4` stack. It combines character-level vocabularies, dataset tooling, typed training utilities, reusable Transformer blocks, and the generic CELMoE hierarchy into a trainable multilingual morphology system.

PyPI package name:

```bash
pip install morphoformer
```

Import name:

```python
import morphoformer
```

## What this package is

Unlike the libraries under `libs/`, `morphoformer` is not just a toolkit piece. It is the runnable application layer:

- configuration loading
- CLI commands
- model wiring
- trainer
- inference entry points

It depends on these independently publishable packages:

- `chartoken-vp`
- `celmoe-vp`
- `sigmorphon-vp`
- `torchblocks-vp`
- `trainkit-vp`

## Architecture summary

The current model builds a three-level expert hierarchy:

- `universal`
- `family`
- `language`

The actual orchestration is handled by `HierarchicalCELMoE`. `morphoformer` supplies the morphology-specific expert blocks, embeddings, routing, and output heads.

Input side:

- character embeddings
- feature embeddings
- language embeddings
- feature-to-token broadcast fusion

Expert side:

- `MorphExpertStack` built from `torchblocks-vp`
- configurable attention, norm, feedforward, adapter, convolution, and position modules
- routing by language family and language code

Output side:

- `logits`
- `universal_logits`
- `family_logits`
- `language_logits`

Those outputs are consumed by the multi-loss training setup in `trainkit-vp`.

## Installation

Requirements:

- Python `>=3.14`
- PyTorch `>=2.0`

Install from PyPI:

```bash
pip install morphoformer
```

For local development from this repository, publish or install the dependent libraries first, because they are versioned as separate packages.

## CLI

The package exposes the `morphoformer` console command.

Available subcommands:

- `download`
- `inspect-config`
- `train`
- `infer`

### Download data

List languages:

```bash
morphoformer download --list-languages
```

Download specific languages and merge them:

```bash
morphoformer download --lang rus,krl,afb --out-dir data --merge
```

Download everything known by the downloader:

```bash
morphoformer download --lang all --out-dir data
```

### Inspect config

```bash
morphoformer inspect-config --config dev/config.toml
```

### Train

```bash
morphoformer train --config dev/config.toml
```

The trainer writes the best checkpoint into the configured output directory.

### Infer

```bash
morphoformer infer `
  --config dev/config.toml `
  --checkpoint artifacts/v4_omni/best.pt `
  --lemma write `
  --tags "V;PST" `
  --lang eng
```

## Configuration

The TOML config is loaded into typed dataclasses:

- `DataConfig`
- `LanguageConfig`
- `ModelConfig`
- `OptimizerConfig`
- `TrainConfig`
- `DecodeConfig`
- `MorphoformerConfig`

Main config sections:

- `[data]`
- `[model]`
- `[optimizer]`
- `[train]`
- `[decode]`
- `[languages.<code>]`

Example:

```toml
[data]
train_path = "data/merged_train.tsv"
dev_path = "data/merged_dev.tsv"
max_len = 96
max_features = 12

[model]
d_model = 768
dim_ff = 2304
num_heads = 12
num_kv_heads = 4
dropout = 0.12
max_positions = 256
feature_dim = 128
attention = "gqa"
feedforward = "swiglu"
norm = "rmsnorm"
adapter = "language_conditioned"
universal_layers = 8
family_layers = 2
language_layers = 2

[train]
stage = "joint"
epochs = 10
batch_size = 64
warmup_steps = 500
total_steps = 12000
output_dir = "artifacts/v4_omni"

[languages.rus]
family = "slavic"
```

## Training flow

The trainer does the following:

1. load train and dev TSV data
2. build character and feature vocabularies
3. build the language-to-id map from config
4. pre-encode datasets into `MorphDataset`
5. instantiate `Morphoformer`
6. freeze or unfreeze stages according to `train.stage`
7. optimize with `AdamW`, warmup cosine schedule, and AMP when enabled
8. evaluate on the dev set each epoch
9. save the best checkpoint

The loss is a weighted combination of:

- final output loss
- universal expert loss
- family expert loss
- language expert loss

## Checkpoint contents

Saved checkpoints include:

- `model_state`
- `optimizer_state`
- `char_vocab`
- `feature_vocab`
- `language_to_id`
- `epoch`

That is enough to restore the model together with the exact vocabularies used during training.

## Inference path

`predict_form(...)`:

- encodes the lemma with `CharVocab`
- encodes tags with `FeatureVocab`
- maps the language string to `language_id`
- runs greedy decoding through the model
- decodes predicted ids back into a surface string

## Hierarchical model and loss (advanced)

By default the model uses the classic three levels (`universal` → `family` → `language`) with one global stack depth per segment (`universal_layers`, `family_layers`, `language_layers`). You can go further:

### Per-layer block overrides (torchblocks)

Each expert can carry a list of `LayerBlockPartial` entries (attention, feedforward, norm, adapter, conv, dropout, head dims, etc.). They are stored on `ExpertDefinition.layer_overrides` and serialized through celmoe `metadata` into `MorphExpertStack`, which builds one `MorphExpertBlock` per layer via `resolve_block_config` (`morphoformer.model.block_config`).

### Declarative hierarchy

- **`MorphoformerHierarchySpec`** (`morphoformer.config.schema`) lists **`HierarchyLevelDef`** entries: level name, `experts` map (`ExpertDefinition`: `num_layers`, `layer_overrides`, `stop_gradient_from_parent`), optional `fallback_expert`, and **`routing`**: `auto` | `constant` | `family` | `language` (`auto` infers from the level name when possible).

**TOML (рекомендуется):** задайте дерево под **`[model]`** с массивом таблиц **`[[model.hierarchy.levels]]`**. После каждого такого блока таблицы **`[model.hierarchy.levels.experts.<expert_id>]`** относятся к *последнему* объявленному уровню (так устроен TOML). Либо задайте **`experts`** списком вложенных таблиц **`[[model.hierarchy.levels.experts]]`** с полями `name`, `num_layers`, опционально `layer_overrides` (массив inline-таблиц с полями из `LayerBlockPartial`). См. **`dev/config.toml`**.

- **`[model.hierarchy.expert_pools]`** — именованные списки id экспертов (произвольные ключи).
- **`expert_pool`** на уровне — имя пула из `expert_pools`; **`expert_ids`** — тот же список, но inline в уровне.
- **`experts_from`**: только **`"languages"`** или **`"families"`** — подставить id из **`[languages.*]`** (имена языков или уникальные `family`). Любое другое строковое значение в `experts_from` (если не задан `expert_pool`) считается **именем пула** (без отдельного ключа `expert_pool`).
- **`default_num_layers`** / **`default_stop_gradient_from_parent`** на уровне — база для шаблона до слияния с **`expert_template`**; иначе применяется эвристика по имени уровня и `ModelConfig.universal_layers` / `family_layers` / `language_layers`.
- **`expert_template`** (частичный `ExpertDefinition`, `num_layers = 0` наследует дефолт уровня); точечные переопределения — **`[model.hierarchy.levels.experts.<id>]`**.

**JSON:** **`model.hierarchy_path`** — путь относительно файла конфига; корень JSON — объект с ключом **`levels`** (как в `hierarchy_spec_from_dict`).

**По умолчанию:** если **`model.hierarchy`** в TOML не задан и **`hierarchy_path`** пуст, иерархия строится из **`[languages.*]`** и **`universal_layers` / `family_layers` / `language_layers`**.

Если заданы и **`hierarchy_path`**, и встроенный **`model.hierarchy`**, приоритет у **встроенного** TOML.

### Loss composition

Training uses **`FlexibleHierarchyLoss`** (`morphoformer.training.hierarchy_loss`) with **`LossCompositionConfig`**:

- **`[[train.loss.components]]`** — если задан хотя бы один элемент, **сумма total** строится только из перечисленных **`key`** × **`weight`** (промежуточные ключи: **`final`**, **`level/<level_name>`**, **`bridge/<bridge_name>`**). Тогда **`final_weight`**, **`level_weights`** и **`weight`** у **`[[train.loss.bridges]]`** **не участвуют** в сумме (мосты всё равно считаются по **`[[train.loss.bridges]]`**; вклад в total задаётся строкой с `key = "bridge/..."` в **`components`**).
- **`final_weight`** — fused head CE (если **`components`** пуст).
- **`level_weights`** — map **level name** → weight (если **`components`** пуст).
- **`aliases`** — короткое **отображаемое** имя → **внутренний** ключ в логах (`universal`, `level_family`, `level_language`, …).
- **`groups`** — дополнительные слагаемые к total: **`members`** — ключи промежуточных лоссов (`final`, `level/<имя_уровня>`, `bridge/<имя>`), **`combine`**: `mean` | `sum`, **`weight`**.
- **`bridges`** — согласованность между уровнями: **`name`**, **`parent_level`**, **`child_level`**, **`weight`** (игнорируется при непустом **`components`**), **`metric`**: `mse_logits` | `kl_logits` | `l2_hidden` (для `l2_hidden` модель должна отдавать **`level_hiddens`** в выходе).

**TOML:** секция **`[train.loss]`**, вложенные **`[train.loss.level_weights]`**, **`[train.loss.aliases]`**, опционально **`[[train.loss.components]]`**, **`[[train.loss.groups]]`**, **`[[train.loss.bridges]]`**. Пример — **`dev/config.toml`**.

Legacy **`final_loss_weight`**, **`universal_loss_weight`**, … remain supported; if **`train.loss`** is omitted, they are folded into a **`LossCompositionConfig`** via **`from_legacy_train_weights`**.

### Freezing stages

For non-classic hierarchies, **`trainkit.freeze_stages`** trains only the named level when **`stage`** matches a **`model.level_names`** entry; **`joint`** enables all levels. The original three-level behavior is unchanged when levels are exactly `universal` / `family` / `language`.

## Relationship to `celmoe-vp`

This package is where the task-specific part begins.

`celmoe-vp` itself stays generic and knows nothing about morphology. `morphoformer` is responsible for:

- choosing hierarchy levels
- defining expert block structure
- mapping languages to families
- attaching morphology-specific heads
- converting expert outputs into token logits

That split is important because the architecture package and the application package are published separately.

## Publishing and versioning

In `Morph_v4` the libraries are not bundled into one mega-package. Each package is published independently and `morphoformer` depends on versioned releases of the lower-level libs.

That means before publishing `morphoformer`, you should publish compatible versions of:

- `chartoken-vp`
- `celmoe-vp`
- `sigmorphon-vp`
- `torchblocks-vp`
- `trainkit-vp`

The repository includes `publish.ps1` to build, version, and publish the stack in dependency order.
