Metadata-Version: 2.4
Name: morphoformer
Version: 4.7.6
Summary: Morphoformer with CELMoE-based multilingual morphology, typed training pipeline, and publishable CLI.
Author: F000NK, Voluntas Progressus
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: numpy
Requires-Dist: celmoe-vp>=1.1.4
Requires-Dist: chartoken-vp>=2.1.4
Requires-Dist: sigmorphon-vp>=2.1.4
Requires-Dist: torchblocks-vp>=2.1.4
Requires-Dist: trainkit-vp>=2.3.4
Requires-Dist: morphlog-vp==2.0.1
Requires-Dist: vpterm-vp==2.0.0
Provides-Extra: directml
Requires-Dist: morph-directml-vp>=1.0.1; python_version < "3.14" and extra == "directml"
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == "dev"
Requires-Dist: pyright>=1.1.0; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"

# morphoformer

`morphoformer` is the application package of the `Morph_v4` stack. It combines character-level vocabularies, dataset tooling, typed training utilities, reusable Transformer blocks, and the generic CELMoE hierarchy into a trainable multilingual morphology system.

PyPI package name:

```bash
pip install morphoformer
```

Import name:

```python
import morphoformer
```

## What this package is

Unlike the libraries under `libs/`, `morphoformer` is not just a toolkit piece. It is the runnable application layer:

- configuration loading
- CLI commands
- model wiring
- trainer
- inference entry points

It depends on these independently publishable packages:

- `chartoken-vp`
- `celmoe-vp`
- `sigmorphon-vp`
- `torchblocks-vp`
- `trainkit-vp`

## Architecture summary

The current model builds a three-level expert hierarchy:

- `universal`
- `family`
- `language`

The actual orchestration is handled by `HierarchicalCELMoE`. `morphoformer` supplies the morphology-specific expert blocks, embeddings, routing, and output heads.

Input side:

- character embeddings
- feature embeddings
- language embeddings
- feature-to-token broadcast fusion

Expert side:

- `MorphExpertStack` built from `torchblocks-vp`
- configurable attention, norm, feedforward, adapter, convolution, and position modules
- routing by language family and language code

Output side:

- `logits`
- `universal_logits`
- `family_logits`
- `language_logits`

Those outputs are consumed by the multi-loss training setup in `trainkit-vp`.

## Installation

Requirements:

- Python `>=3.14`
- PyTorch `>=2.0`

Install from PyPI:

```bash
pip install morphoformer
```

For local development from this repository, publish or install the dependent libraries first, because they are versioned as separate packages.

## CLI

The package exposes the `morphoformer` console command.

Available subcommands:

- `download`
- `inspect-config`
- `train`
- `infer`

### Download data

List languages:

```bash
morphoformer download --list-languages
```

Download specific languages and merge them:

```bash
morphoformer download --lang rus,krl,afb --out-dir data --merge
```

Download everything known by the downloader:

```bash
morphoformer download --lang all --out-dir data
```

### Inspect config

```bash
morphoformer inspect-config --config dev/config.toml
```

### Train

```bash
morphoformer train --config dev/config.toml
```

The trainer writes the best checkpoint into the configured output directory.

### Infer

```bash
morphoformer infer `
  --config dev/config.toml `
  --checkpoint artifacts/v4_omni/best.pt `
  --lemma write `
  --tags "V;PST" `
  --lang eng
```

## Configuration

The TOML config is loaded into typed dataclasses:

- `DataConfig`
- `LanguageConfig`
- `ModelConfig`
- `OptimizerConfig`
- `TrainConfig`
- `DecodeConfig`
- `MorphoformerConfig`

Main config sections:

- `[data]`
- `[model]`
- `[optimizer]`
- `[train]`
- `[decode]`
- `[languages.<code>]`

Example:

```toml
[data]
train_path = "data/merged_train.tsv"
dev_path = "data/merged_dev.tsv"
max_len = 96
max_features = 12

[model]
d_model = 768
dim_ff = 2304
num_heads = 12
num_kv_heads = 4
dropout = 0.12
max_positions = 256
feature_dim = 128
attention = "gqa"
feedforward = "swiglu"
norm = "rmsnorm"
adapter = "language_conditioned"
universal_layers = 8
family_layers = 2
language_layers = 2

[train]
stage = "joint"
epochs = 10
batch_size = 64
warmup_steps = 500
total_steps = 12000
output_dir = "artifacts/v4_omni"

[languages.rus]
family = "slavic"
```

## Training flow

The trainer does the following:

1. load train and dev TSV data
2. build character and feature vocabularies
3. build the language-to-id map from config
4. pre-encode datasets into `MorphDataset`
5. instantiate `Morphoformer`
6. freeze or unfreeze stages according to `train.stage`
7. optimize with `AdamW`, warmup cosine schedule, and AMP when enabled
8. evaluate on the dev set each epoch
9. save the best checkpoint

The loss is a weighted combination of:

- final output loss
- universal expert loss
- family expert loss
- language expert loss

## Checkpoint contents

Saved checkpoints include:

- `model_state`
- `optimizer_state`
- `char_vocab`
- `feature_vocab`
- `language_to_id`
- `epoch`

That is enough to restore the model together with the exact vocabularies used during training.

## Inference path

`predict_form(...)`:

- encodes the lemma with `CharVocab`
- encodes tags with `FeatureVocab`
- maps the language string to `language_id`
- runs greedy decoding through the model
- decodes predicted ids back into a surface string

## Relationship to `celmoe-vp`

This package is where the task-specific part begins.

`celmoe-vp` itself stays generic and knows nothing about morphology. `morphoformer` is responsible for:

- choosing hierarchy levels
- defining expert block structure
- mapping languages to families
- attaching morphology-specific heads
- converting expert outputs into token logits

That split is important because the architecture package and the application package are published separately.

## Publishing and versioning

In `Morph_v4` the libraries are not bundled into one mega-package. Each package is published independently and `morphoformer` depends on versioned releases of the lower-level libs.

That means before publishing `morphoformer`, you should publish compatible versions of:

- `chartoken-vp`
- `celmoe-vp`
- `sigmorphon-vp`
- `torchblocks-vp`
- `trainkit-vp`

The repository includes `publish.ps1` to build, version, and publish the stack in dependency order.
