Metadata-Version: 2.4
Name: unfairseq
Version: 0.0.1
Summary: Un-fairseq: UnFormers (Universal Transformers) — config-driven enc-dec chassis covering NLLB/mBART/Marian/mT5/UL2/t5gemma/TranslateGemma/Qwen/Gemma, plus Matryoshka encoder, Garg 2019 supervised attention, PyTorch IBM Models 1/2/HMM/4, Brown+k-means clustering, and portable char/byte alignment.
Author: alvations
License: MIT License
        
        Copyright (c) 2026 alvations
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/alvations/unfairseq
Project-URL: Issues, https://github.com/alvations/unfairseq/issues
Keywords: transformer,nmt,translation,alignment,matryoshka,ibm-models,pytorch,huggingface
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.2
Requires-Dist: transformers<5,>=4.45
Requires-Dist: datasets>=2.20
Requires-Dist: numpy<2,>=1.24
Requires-Dist: sentencepiece
Requires-Dist: tokenizers>=0.15
Requires-Dist: accelerate>=0.26
Requires-Dist: PyICU>=2.11
Provides-Extra: align
Requires-Dist: eflomal; extra == "align"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Dynamic: license-file

# unfairseq / UnFormers

**UnFormers** (Universal Transformers) is a single configurable encoder-decoder
Transformer implementation that covers the architectural choices of modern
NMT / seq2seq model families through presets. One codebase, one set of
modules, one HF-compatible `PreTrainedModel` — and a preset picks the knobs
(attention kind, positional encoding, norm, FFN, bias policy, …) to
reconstruct NLLB / mBART / Marian / mT5 / UL2 / t5gemma / TranslateGemma /
Qwen / Gemma.

On top of the core it ships:

- **Matryoshka encoder** (MatFormer-style depth pruning): train once, serve at
  multiple depths, prune permanently after training.
- **Supervised attention word alignment** (Garg 2019) on a configurable
  decoder-layer / head, applied at every Matryoshka granularity.
- **Neural IBM alignment** (IBM Model 1 / 2 / HMM) in pure PyTorch, GPU-
  batched, subword-native — an eflomal replacement that aligns directly on
  your tokenizer's ids.
- **Portable alignment format**: char-span (or UTF-8 byte-span) records plus
  word-level aggregation via ICU, so alignments are usable by any downstream
  tokenizer.
- **UL2 mixture-of-denoisers** corpus preprocessing (R / X / S denoisers).
- **Expert-parallel MoE**, **KV cache** for generation, **gradient
  checkpointing**, and **warm_start** (Net2Net + bert2bert) to seed UnFormer
  weights from any HF checkpoint.


## Installation

UnFormers has one native dependency chain you need to handle before `pip
install`: **PyICU**, which wraps ICU4C (the Unicode library Chrome/Firefox/
Java all use). ICU ships the word-break dictionaries for CJK / Thai / Khmer /
Lao / Myanmar that make word-level alignment work for those languages.

### 1. Install ICU4C (system library)

**macOS (Homebrew):**

```bash
brew install icu4c
# Homebrew doesn't symlink icu4c by default; tell pkg-config where to find it:
echo 'export PATH="/usr/local/opt/icu4c/bin:/usr/local/opt/icu4c/sbin:$PATH"' >> ~/.zshrc
echo 'export PKG_CONFIG_PATH="/usr/local/opt/icu4c/lib/pkgconfig"' >> ~/.zshrc
```

Apple Silicon paths are under `/opt/homebrew/opt/icu4c/...` instead of
`/usr/local/opt/...`.

**Debian / Ubuntu:**

```bash
sudo apt install pkg-config libicu-dev
```

**Fedora / RHEL:**

```bash
sudo dnf install libicu-devel
```

**Alpine:**

```bash
apk add icu-dev pkgconfig
```

**Windows:** grab the ICU binaries from
<https://icu.unicode.org/download> and ensure `icu-config` is on PATH, or use
a pre-built PyICU wheel from the Python wheels index (2.16+ has Windows
wheels).

### 2. Install PyICU (Python binding)

```bash
# after the system icu4c is in place:
pip install PyICU>=2.11
```

If PyICU's build fails with "`u_init_74` not found" or similar, you have a
version mismatch — `icu-config --version` must match the ICU the wheel was
built against. Rebuild against your local ICU with:

```bash
PYICU_INCLUDES="$(icu-config --cppflags)" \
PYICU_LFLAGS="$(icu-config --ldflags)" \
pip install --no-binary=:all: PyICU
```

### 3. ICU data / dictionaries

ICU's word-break dictionaries for **zh / ja / th / km / lo / my** ship with
the ICU4C install — you do not need to download anything separately. To
verify the bundled dictionaries are available:

```python
import icu
bi = icu.BreakIterator.createWordInstance(icu.Locale("zh"))
bi.setText("我爱北京天安门")
print([bi.current(), bi.next()])  # should return actual boundary offsets
```

If `icu.ICU_VERSION` prints and `BreakIterator` segments Chinese correctly,
you have the dictionaries. They live inside `icudt{VERSION}l.dat` in the ICU
data directory (`icu-config --icudatadir`). On a minimal ICU install ("lite")
the dict files are stripped; install the full ICU package (default on every
major distro).

If you ever need a newer or language-specific ICU data bundle, download
`icu4c-*-data-bin-l.zip` from <https://icu.unicode.org/download> and drop the
`.dat` file into `icu-config --icudatadir`.

### 4. UnFormers itself

```bash
pip install -e .                 # dev install from a checkout
# or from the repo root:
pip install .                    # regular install
pip install .[align]             # + eflomal (optional, we ship our own)
pip install .[dev]               # + pytest, ruff
```

Once installed, sanity-check ICU integration:

```bash
python -c "import icu; print('ICU', icu.ICU_VERSION, 'PyICU', icu.__version__)"
python -c "from unformers.align import get_segmenter; print(get_segmenter('zh')('机器翻译系统'))"
```


## Quick start

### Build a model from a preset with any HF tokenizer

```python
from transformers import AutoTokenizer
from unformers import UnFormerForConditionalGeneration
from unformers.presets import from_preset

tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
cfg = from_preset("ul2-mini-6-3", vocab_size=tok.vocab_size,
                  pad_token_id=tok.pad_token_id,
                  bos_token_id=tok.bos_token_id or tok.eos_token_id,
                  eos_token_id=tok.eos_token_id)
model = UnFormerForConditionalGeneration(cfg)
```

### Train IBM-2 alignments and emit portable JSONL

```bash
python -m unformers.align.cli \
    --input parallel.tsv --src-col 0 --tgt-col 1 \
    --src-lang eng_Latn --tgt-lang zho_Hans \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --aligner-epochs 5 \
    --output aligned.jsonl
```

Each output line is tokenizer-agnostic:

```json
{
  "src_text": "hello world",
  "tgt_text": "你好 世界",
  "src_lang": "eng_Latn",
  "tgt_lang": "zho_Hans",
  "char_alignments": [{"src": [0, 5], "tgt": [0, 2]}, {"src": [6, 11], "tgt": [3, 5]}],
  "word_alignments": [{"src": [0, 5], "tgt": [0, 2]}, {"src": [6, 11], "tgt": [3, 5]}],
  "byte_offsets": false,
  "segmenter_src": "icu:eng_Latn",
  "segmenter_tgt": "icu:zho_Hans"
}
```

Use `--byte` for UTF-8 byte offsets instead of char offsets.

### Train UL2-mini with Matryoshka + supervised attention

```bash
python examples/train_pure_pytorch.py \
    --tokenizer Qwen/Qwen2.5-0.5B \
    --n-pairs 5000 --max-steps 1000 --batch-size 16 \
    --d-model 256 --num-heads 8 --ffn-size 512
```

### Warm-start from an HF checkpoint

```python
from transformers import AutoModelForSeq2SeqLM
from unformers.interop import warm_start

source = AutoModelForSeq2SeqLM.from_pretrained("google/mt5-base")
cfg = from_preset("mt5", vocab_size=source.config.vocab_size, size="mt5-base")
target = UnFormerForConditionalGeneration(cfg)
manifest = warm_start(target, source, strategy="auto")
print(manifest.summary())   # copied=..., padded=..., randomised=...
```


## Preset capability matrix

| Preset            | Positional    | Norm           | FFN       | Attention              | Bias | Align default | Notes                                   |
|-------------------|---------------|----------------|-----------|------------------------|------|---------------|-----------------------------------------|
| `marian`          | sinusoidal    | LayerNorm post | ReLU      | MHA                    | ✓    | opt-in        | classic vanilla Transformer             |
| `nllb`            | sinusoidal    | LayerNorm pre  | ReLU (MoE opt) | MHA               | ✓    | opt-in        | lang codes, MoE via `moe_num_experts`   |
| `mbart`           | learned abs   | LayerNorm pre  | GELU      | MHA                    | ✓    | opt-in        |                                         |
| `mt5`             | T5 rel-bias   | RMSNorm pre    | GeGLU     | MHA                    | ✗    | opt-in        | untied lm_head                          |
| `ul2`             | T5 rel-bias   | RMSNorm pre    | SwiGLU (MoE opt) | MHA             | ✗    | opt-in        | prefix-LM, `[R]`/`[X]`/`[S]` tags       |
| `t5gemma`         | RoPE          | RMSNorm pre    | GeGLU     | GQA                    | ✗    | opt-in        | tied embed, √d scale                    |
| `translategemma`  | RoPE 1M base  | RMSNorm+preresid | GeGLU   | GQA, QK-norm, logit-cap, sliding | ✗ | opt-in   | 5:1 local/global interleave  |
| `qwen3.5`         | RoPE 1M base  | RMSNorm pre    | SwiGLU    | GQA, QK-norm           | ✗    | opt-in        | decoder-only family → enc-dec adapted   |
| `gemma4`          | RoPE multi-freq | RMSNorm+preresid | GeGLU | GQA, QK-norm, logit-cap, sliding | ✗ | opt-in    | local 10k / global 1M RoPE bases     |
| `ul2-mini-6-3`    | RoPE          | RMSNorm pre    | SwiGLU    | MHA                    | ✗    | **on**        | Matryoshka [2,4,6], Garg 2019 demo      |

All presets accept `**kwargs` to override `d_model` / `encoder_layers` /
`decoder_layers` / `num_heads` / `intermediate_size` etc. so you can shrink a
2B preset into a test-sized version:

```python
cfg = from_preset("gemma4", vocab_size=32000, d_model=64,
                  encoder_layers=2, decoder_layers=2,
                  num_heads=4, num_kv_heads=2, head_dim=16, intermediate_size=128)
```

### Alignment supervision is available on every preset

Every preset exposes the same set of `alignment_*` kwargs to `from_preset`.
Garg 2019 supervised cross-attention is off by default for all presets except
`ul2-mini-6-3` (the demo preset), where it's on. Enable and tune on any
preset:

```python
cfg = from_preset(
    "nllb",
    vocab_size=tok.vocab_size, size="nllb-600m-distilled",
    alignment_enabled=True,                     # turn Garg loss on
    alignment_loss_weight=0.05,                 # λ in total = ce + λ * align
    alignment_decoder_layer=-1,                 # which decoder layer to supervise (-1 = top)
    alignment_num_heads=1,                      # first N cross-attn heads, averaged
    alignment_full_context=False,               # second decoder pass w/o causal mask
    alignment_apply_to_all_granularities=True,  # Matryoshka × alignment
)
```

To disable on `ul2-mini-6-3`: pass `alignment_enabled=False`. For full
control pass `alignment=AlignmentConfig(...)` as a kwarg — the explicit
config overrides any individual `alignment_*` kwargs.


## What's in the box

### Architecture (config-driven)

- Attention: MHA / MQA / GQA, QK-norm, attention logit soft-cap, sliding
  window, per-layer local/global interleave.
- Positional: sinusoidal, learned abs, T5 bucketed rel-bias, ALiBi, RoPE
  (single-freq + per-layer multi-freq with NTK / linear scaling).
- Norm: LayerNorm (bias / no-bias), RMSNorm; pre- / post-norm; pre-residual
  norm (Gemma-style).
- FFN: Dense (GELU/ReLU/SiLU), GLU (SwiGLU/GeGLU/ReGLU), MoE (single-GPU +
  expert-parallel).
- Embedding: tied / untied; √d_model scale; final-logit soft-cap & scale.
- Decoder: causal or prefix-LM; every-N cross-attention layers.

### Training

- `UnFormerTrainer` subclasses HF `Seq2SeqTrainer`; use it or fall back to
  `examples/train_pure_pytorch.py` when you don't want `accelerate`.
- Losses: label-smoothed CE, Garg 2019 alignment NLL, Switch-style MoE aux.
- Matryoshka depth sampling: `joint` / `stochastic` / `sandwich`.
- Gradient checkpointing via `model.gradient_checkpointing_enable()` — skips
  the alignment-supervised layer so the Garg loss still backprops.

### Alignment

- `unformers.align.NeuralIBMAligner` — IBM Model 1 / 2 / HMM, factored
  lexical table, GPU-batched, pharaoh output, fwd/rev + grow-diag-final-and
  symmetrisation.
- `unformers.align.PortableAlignment` — char (default) or byte spans + word
  aggregation; `python -m unformers.align.cli` end-to-end runner.

### Data

- `TokenizerWrapper` — any HF tokenizer, handles UL2 denoiser tags and lang
  codes.
- UL2 mixture-of-denoisers (R / X / S) preprocessing.
- `Seq2SeqWithAlignmentCollator` — pads src/tgt, shifts decoder input, turns
  pharaoh alignments into flat loss-index tensors with inverse-frequency
  weights.

### Interop

- `warm_start(target, source, strategy="auto")` — Net2Net (wider / deeper
  identity insertion) + bert2bert (cross-attn init from self-attn when source
  lacks cross-attn) + key-normalisation aliases for T5 / BART / NLLB / Marian
  / mBART / Llama / Qwen / Gemma naming. Returns a `CopyManifest` listing
  copied / padded / identity-inserted / randomised tensors.

### Generation

- `model.generate(...)` via HF `GenerationMixin`. KV cache verified against
  full-forward parity to `2e-5`. Greedy and beam search both work.


## Development

### Run the tests

```bash
pip install -e '.[dev]'
pytest                           # fast tests
pytest -v -m slow -k 0.5B        # large-scale param-tier tests
pytest tests/test_portable_alignment.py -v  # alignment + ICU tests
```

### Layout

```
unformers/
  config.py          # UnFormerConfig + all nested dataclasses
  modules/           # attention, positional, norm, ffn, moe, embedding
  blocks/            # encoder_layer, decoder_layer
  model/             # encoder, decoder, seq2seq (PreTrainedModel)
  presets/           # one file per family + _helpers.py
  align/             # NeuralIBMAligner, portable alignment, segmenters, CLI
  data/              # tokenizer wrapper, collator, UL2 denoisers
  train/             # trainer, losses, Matryoshka policy
  interop/           # warm_start
examples/
  smoke_test.py             # HF Trainer path
  train_pure_pytorch.py     # plain torch loop
tests/
  test_presets.py
  test_preset_sizes.py      # 0.5B / 1B / 2B / 3B tiers (slow)
  test_warm_start.py
  test_gradient_checkpointing.py
  test_moe.py
  test_portable_alignment.py
```


## License

See LICENSE in the repo root.
