Metadata-Version: 2.4
Name: sigmorphon-vp
Version: 1.1.0
Summary: SigMorphon 2021 dataset downloader, TSV parser, deduplication, and GPU-ready MorphDataset with CUDA stream prefetch
Author: F000NK, Voluntas Progressus
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.14
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: chartoken-vp>=1.1.0
Dynamic: license-file

# sigmorphon-vp

SigMorphon 2021 dataset downloader, TSV parser, and GPU-ready dataset with CUDA stream prefetch.

Part of the [MorphFormer](https://pypi.org/project/morphoformer/) project by Voluntas Progressus.

## Installation

```bash
pip install sigmorphon-vp
```

Requires Python >= 3.14, PyTorch >= 2.0, and `chartoken-vp >= 1.1.0`.

## Features

- **Download** SigMorphon 2021 Task 0 data for 11+ languages directly from GitHub
- **TSV parsing** with MD5-based deduplication and automatic column reordering
- **MorphDataset** — pre-encoded character/feature tensors in RAM with pinned memory and CUDA stream prefetch
- **Merge** multiple TSV files into a single training set
- **Language listing** — `get_available_languages()` returns all supported ISO 639-3 codes

## Quick Start

```python
from sigmorphon import download_all, load_tsv, MorphDataset, get_available_languages
from chartoken import CharVocab, FeatureVocab
from pathlib import Path

# See available languages
print(get_available_languages())

# Download Russian and German data
download_all(["rus", "deu"], Path("data/collections"))

# Load and encode
rows = load_tsv("data/collections/rus_train.tsv")
char_vocab = CharVocab.from_texts([r[0] for r in rows] + [r[2] for r in rows])
feat_vocab = FeatureVocab.from_tags([r[1] for r in rows])
lang_to_id = {"rus": 0}

dataset = MorphDataset(rows, char_vocab, feat_vocab, lang_to_id)
```

## API

| Function / Class | Description |
|---|---|
| `download_all(langs, path)` | Download train/dev TSV files for given languages |
| `merge_tsv(pattern, output)` | Merge multiple TSV files into one |
| `load_tsv(path)` | Parse a single TSV file into rows |
| `load_tsv_pattern(pattern)` | Glob-load multiple TSV files |
| `get_available_languages()` | List all supported language codes |
| `MorphDataset` | PyTorch Dataset with CUDA prefetch |
| `DATASETS` | Dict of available SigMorphon 2021 datasets |

## License

MIT
