Metadata-Version: 2.4
Name: chartoken-vp
Version: 1.1.0
Summary: Character-level tokenizer and morphological feature encoder for NLP pipelines (UniMorph, NFKC, padding, serialization)
Author: F000NK, Voluntas Progressus
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.14
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Dynamic: license-file

# chartoken-vp

Character-level tokenizer and morphological feature encoder for NLP pipelines.

Part of the [MorphFormer](https://pypi.org/project/morphoformer/) project by Voluntas Progressus.

## Installation

```bash
pip install chartoken-vp
```

Requires Python >= 3.14 and PyTorch >= 2.0.

## Features

- **CharVocab** — character-level tokenizer with NFKC normalization, SOS/EOS/PAD special tokens
- **FeatureVocab** — morphological feature encoder for UniMorph tag sets with padding masks
- Encode text to padded tensors or plain ID lists
- Serialize/deserialize vocabularies via `to_dict()` / `from_dict()` for checkpoint compatibility
- Dynamic vocabulary expansion from new texts

## Quick Start

```python
from chartoken import CharVocab, FeatureVocab

# Build vocab from texts
vocab = CharVocab.from_texts(["hello", "world"])
ids = vocab.encode("hello", max_len=32)
print(vocab.decode(ids.tolist()))  # "hello"

# Feature vocab for morphological tags
feat_vocab = FeatureVocab.from_tags([["V", "IND", "PRS"], ["N", "SG"]])
feat_ids, feat_mask = feat_vocab.encode(["V", "IND"], max_features=12)
```

## API

| Class / Constant | Description |
|---|---|
| `CharVocab` | Character vocabulary with encode/decode/from_texts/to_dict |
| `FeatureVocab` | UniMorph feature vocabulary with encode/to_dict |
| `PAD`, `SOS`, `EOS` | Special token strings |
| `FEATURE_PAD` | Padding ID for feature sequences |
| `normalize_text` | NFKC text normalization |
| `SPECIAL_TOKENS` | Set of all special tokens |

## License

MIT
