Metadata-Version: 2.4
Name: morphoformer
Version: 2.2.1
Summary: MorphFormer: multilingual morphological reinflection via character-level Transformer
Author: voluntasprogressus
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.14
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Dynamic: license-file

# MorphFormer v2

[![PyPI version](https://badge.fury.io/py/morphoformer.svg)](https://pypi.org/project/morphoformer/)
[![Python 3.14+](https://img.shields.io/badge/python-3.14+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![OS Independent](https://img.shields.io/badge/OS-Windows%20%7C%20Linux%20%7C%20macOS-lightgrey.svg)]()

Character-level Transformer для мультиязычной морфологической реинфлексии. Поддержка 35+ языков, language-conditioned адаптеры, GQA, RoPE, SwiGLU.

## Установка

```bash
pip install morphoformer
```

Или из исходников:

```bash
cd Morph_v2
pip install -e .
```

**Требования:** Python >= 3.14, PyTorch >= 2.0

## Быстрый старт

```bash
# Скачать данные (35+ языков SigMorphon 2021)
morphoformer download --lang rus,deu,fra --out-dir data --merge

# Обучить (авто-подбор batch/checkpointing под VRAM)
morphoformer train --preset medium --data "data/*_train.tsv" --device cuda

# Инференс
morphoformer infer \
  --checkpoint checkpoints/morphformer_epoch50.pt \
  --word "бежать" --morph "V;IND;PRS;3;SG" --lang rus

# Интерактивный REPL
morphoformer serve --checkpoint checkpoints/morphformer_epoch50.pt
```

## Пресеты

| Пресет | d_model | Encoder | Decoder | dim_ff | ~Params | VRAM |
|--------|---------|---------|---------|--------|---------|------|
| `small` | 384 | 4 layers | 3 layers | 1024 | ~7M | < 4 GB |
| `medium` | 512 | 8 layers | 6 layers | 1376 | ~45M | 4–8 GB |
| `large` | 768 | 10 layers | 8 layers | 2048 | ~120M | >= 8 GB |

```bash
morphoformer init-config --preset medium --out config.toml
```

## Архитектура

- **Encoder-Decoder Transformer** с pre-norm (RMSNorm)
- **GQA** (Grouped Query Attention) — сжатый KV-кэш
- **RoPE** — Rotary Position Embeddings
- **SwiGLU** FFN
- **Conformer-style Conv** — depthwise conv1d между SelfAttn и FFN в encoder
- **Language-Conditioned Adapters** — gated bottleneck per language
- **Structured Morph Encoder** — embed + pool / cross-attention вместо посимвольной строки
- **Weight tying** — output projection = char embedding
- **CUDA Stream Prefetch** — async батчи через отдельный stream
- **Auto Memory Planning** — авто gradient checkpointing + batch resize по VRAM

## CLI

| Команда | Описание |
|---------|----------|
| `train` | Обучение модели |
| `infer` | Одиночный инференс |
| `serve` | Интерактивный REPL |
| `download` | Скачать данные SigMorphon/UniMorph |
| `modules` | Список зарегистрированных модулей |
| `init-config` | Создать TOML-шаблон |

```bash
morphoformer --help
morphoformer train --help
```

## Формат данных

TSV: `лемма\tпризнаки\tсловоформа\tязык`

```
бежать	V;IND;PRS;3;SG	бежит	rus
gehen	V;IND;PRS;3;SG	geht	deu
aller	V;IND;PRS;3;SG	va	fra
```

## Конфигурация

```toml
[model]
d_model = 512
num_heads = 8
num_kv_heads = 2
dim_ff = 1376
dropout = 0.15
num_languages = 50

[model.encoder]
num_layers = 8
conv = "local"
adapter = "language_conditioned"
adapter_bottleneck = 128

[model.decoder]
num_layers = 6

[model.morph_encoder]
type = "pooled"          # "attention" для large

[training]
epochs = 50
batch_size = 64
lr = 5e-4
warmup_steps = 4000
device = "auto"          # cuda / rocm / xpu / mps / cpu
```

Примеры: [`config_examples/`](config_examples/)

## Python API

```python
import torch
from morphoformer import modules  # регистрирует все модули
from morphoformer.modules import MorphFormer
from morphoformer.data import CharVocab, FeatureVocab
from morphoformer.inference import greedy_decode

ckpt = torch.load("model.pt", map_location="cpu", weights_only=False)
char_vocab = CharVocab.from_dict(ckpt["char_vocab"])
feature_vocab = FeatureVocab.from_dict(ckpt["feature_vocab"])
# ... build model, load state_dict ...

result = greedy_decode(
    model, char_vocab, feature_vocab,
    "бежать", ["V", "IND", "PRS", "3", "SG"], lang_id=0,
    device=torch.device("cuda"), max_len=96, max_out=128,
)
```

## Модули (pluggable registry)

```bash
$ morphoformer modules
[attention]  gqa, mha, cross
[feedforward]  swiglu, gelu
[norm]  rmsnorm, layernorm
[conv]  local, none
[adapter]  language_conditioned, bottleneck, none
[morph_encoder]  pooled, attention
[position]  rope
```

Свой модуль:

```python
from morphoformer.modules.registry import register

@register("feedforward", "my_ffn")
class MyFFN(nn.Module):
    ...
```

## Устройства

```bash
morphoformer train --device auto    # CUDA → XPU → MPS → CPU
morphoformer train --device cuda    # NVIDIA
morphoformer train --device rocm    # AMD
morphoformer train --device xpu     # Intel Arc
morphoformer train --device mps     # Apple Silicon
```

## Структура проекта

```
morphoformer/
├── cli.py                 # CLI entry point
├── config/
│   ├── schema.py          # Dataclass → TOML mapping
│   ├── loader.py          # TOML load/save
│   └── presets.py         # small / medium / large
├── data/
│   ├── vocab.py           # CharVocab (NFKC, SOS/EOS/PAD)
│   ├── feature_vocab.py   # FeatureVocab (UniMorph tags)
│   ├── dataset.py         # MorphDataset + CUDA stream prefetch
│   └── download.py        # SigMorphon downloader
├── modules/
│   ├── registry.py        # @register / get / list_modules
│   ├── transformer.py     # MorphFormer (main model)
│   ├── encoder.py         # Encoder + EncoderLayer
│   ├── decoder.py         # Decoder + DecoderLayer + KV cache
│   ├── attention.py       # GQA, MHA, CrossAttention
│   ├── feedforward.py     # SwiGLU, GeLU
│   ├── position.py        # RoPE
│   ├── conv.py            # Depthwise conv (Conformer)
│   ├── adapter.py         # Language-conditioned adapters
│   └── morph_encoder.py   # Pooled / Attention morph encoder
├── inference/
│   ├── decode.py          # Greedy decode + KV cache
│   └── cache.py           # KVCache dataclass
└── training/
    ├── trainer.py         # Training loop (AMP, grad accum, checkpoints)
    ├── scheduler.py       # Cosine LR + warmup
    └── memory.py          # VRAM profiling + auto batch planning
```

## Публикация

```bash
pip install build twine
python -m build
python -m twine upload dist/*
```

Подробнее: [USAGE.md](USAGE.md)

## Лицензия

[MIT](LICENSE)
