Metadata-Version: 2.4
Name: sigmorphon-vp
Version: 2.2.0
Summary: SigMorphon dataset utilities with typed TSV loading, downloads, and MorphDataset generation.
Author: F000NK, Voluntas Progressus
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: chartoken-vp==2.2.0

# sigmorphon-vp

`sigmorphon-vp` is a typed utility package for downloading, converting, merging, and pre-encoding SigMorphon-style morphological reinflection datasets.

PyPI package name:

```bash
pip install sigmorphon-vp
```

Import name:

```python
import sigmorphon
```

The package is designed to work well with `chartoken-vp`, but it stays publishable as its own package.

## What it provides

`sigmorphon-vp` covers the data layer around morphological reinflection:

- download helpers for SigMorphon 2021 Task 0 style data
- conversion from upstream raw files into a consistent internal TSV format
- merge helpers for multi-language training corpora
- typed TSV loading
- an in-memory `MorphDataset` that pre-encodes examples into tensors

## Internal TSV format

The package converts data into a normalized 4-column TSV format:

```text
lemma<TAB>features<TAB>surface<TAB>lang
```

Example:

```text
walk	V;PST	walked	eng
```

Comment lines beginning with `#` are allowed and ignored by the loaders.

## Installation

Requirements:

- Python `>=3.14`
- PyTorch `>=2.0`
- `chartoken-vp>=2.1.0`

Install from PyPI:

```bash
pip install sigmorphon-vp
```

## Downloading datasets

The downloader exposes:

- `DATASETS`
- `download_language`
- `download_all`
- `get_available_languages`
- `merge_tsv`

Quick example:

```python
from pathlib import Path

from sigmorphon import download_all, merge_tsv

out_dir = Path("data")
download_all(["rus", "bul", "spa"], out_dir)

train_files = sorted(out_dir.glob("*_train.tsv"))
merge_tsv(train_files, out_dir / "merged_train.tsv")
```

### What download does

For each requested language, the package:

1. downloads upstream raw files
2. stores them under `raw/`
3. converts them to the internal TSV layout
4. deduplicates rows with an MD5 hash
5. writes `*_train.tsv` and `*_dev.tsv`

If converted files already exist, the downloader skips them.

### Discovering languages

`get_available_languages()` tries to read the list of languages from the SigMorphon GitHub repository. If that request fails, it falls back to the built-in `DATASETS` mapping.

## Loading data

Simple TSV loading:

```python
from sigmorphon import load_tsv

rows = load_tsv("data/rus_train.tsv")
```

Glob pattern loading:

```python
from sigmorphon import load_tsv_pattern

rows = load_tsv_pattern("data/*_train.tsv")
```

Returned row type:

```python
MorphRow = tuple[str, list[str], str, str]
```

That is:

- lemma
- feature list
- surface form
- language code

## `MorphDataset`

`MorphDataset` is the package's main runtime component. It converts rows into ready-to-train tensors using:

- `CharVocab`
- `FeatureVocab`
- a `lang_to_id` mapping

Constructor inputs:

- `rows`
- `char_vocab`
- `feature_vocab`
- `lang_to_id`
- `max_len`
- `max_features`
- `pin_memory`

Produced tensors:

- source character ids
- target character ids
- feature ids
- feature masks
- language ids

## Batch serving

`MorphDataset.epoch_batches(...)` yields batches in the following typed layout:

```python
MorphBatch = tuple[
    torch.Tensor,  # source
    torch.Tensor,  # target
    torch.Tensor,  # language ids
    torch.Tensor,  # feature ids
    torch.Tensor,  # feature mask
]
```

Behavior:

- shuffles sample order each epoch
- supports CUDA-aware non-blocking transfers
- uses a secondary CUDA stream for prefetch when available
- keeps all pre-encoded samples in RAM for fast iteration

Example:

```python
import torch

from chartoken import CharVocab, FeatureVocab
from sigmorphon import MorphDataset, load_tsv

rows = load_tsv("data/rus_train.tsv")
char_vocab = CharVocab.from_texts([lemma for lemma, _, _, _ in rows] + [surface for _, _, surface, _ in rows])
feature_vocab = FeatureVocab.from_tags([tags for _, tags, _, _ in rows])
lang_to_id = {"rus": 0}

dataset = MorphDataset(rows, char_vocab, feature_vocab, lang_to_id)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

for source, target, lang_ids, feature_ids, feature_mask in dataset.epoch_batches(32, device):
    print(source.shape, target.shape)
    break
```

## Utility methods

`MorphDataset.memory_bytes()` estimates how much RAM the encoded tensors occupy.

That is useful when you want to:

- compare eager dataset sizes across languages
- plan batch sizes
- decide whether to split or merge corpora

## Intended scope

This package intentionally focuses on morphological dataset handling. It does not contain:

- neural network layers
- training loops
- checkpointing
- model definitions

That separation keeps the package reusable across multiple applications, not just `morphoformer`.
