Metadata-Version: 2.4
Name: chartoken-vp
Version: 2.1.0
Summary: Character-level tokenizer and typed morphological feature vocabulary for multilingual NLP pipelines.
Author: F000NK, Voluntas Progressus
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.14
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.14
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0

# chartoken-vp

`chartoken-vp` is a small, typed package for character-level text vocabularies and morphological feature vocabularies.

PyPI package name:

```bash
pip install chartoken-vp
```

Import name:

```python
import chartoken
```

The library is intentionally narrow in scope. It does not try to be a full tokenizer framework. It gives you a stable, strictly typed foundation for:

- character vocabularies for sequence models
- UniMorph-style feature vocabularies
- deterministic serialization for checkpoints
- simple tensor conversion helpers for PyTorch code

## Why this package exists

Morphological reinflection and other low-level text tasks often work better with characters than with subword tokenizers. In those pipelines you usually need two parallel vocabularies:

- one vocabulary for characters in source and target strings
- one vocabulary for morphological tags such as `PST`, `SG`, `NOM`, `V`, and so on

`chartoken-vp` keeps those concerns separate and explicit.

## Main components

### `CharVocab`

`CharVocab` builds a character inventory from raw texts and exposes:

- `from_texts`
- `encode`
- `encode_ids`
- `decode`
- `to_dict`
- `from_dict`

The vocabulary uses three built-in special tokens:

- `PAD = 0`
- `SOS = 1`
- `EOS = 2`

All text is normalized with Unicode NFKC via `normalize_text`.

### `FeatureVocab`

`FeatureVocab` builds a vocabulary over feature tags and exposes:

- `from_tags`
- `encode`
- `encode_tensor`
- `to_dict`
- `from_dict`

Feature sequences are padded with `FEATURE_PAD = 0` and returned together with a float mask.

## Installation

Requirements:

- Python `>=3.14`
- PyTorch `>=2.0`

Install from PyPI:

```bash
pip install chartoken-vp
```

## Quick start

```python
from chartoken import CharVocab, FeatureVocab

texts = ["walk", "walked", "go", "went"]
tag_sets = [
    ["V", "PRS"],
    ["V", "PST"],
    ["V", "PRS"],
    ["V", "PST"],
]

char_vocab = CharVocab.from_texts(texts)
feature_vocab = FeatureVocab.from_tags(tag_sets)

token_ids = char_vocab.encode_ids("walk", max_len=12)
feature_ids, feature_mask = feature_vocab.encode(["V", "PST"], max_features=8)

print(token_ids)
print(feature_ids, feature_mask)
print(char_vocab.decode(token_ids))
```

## Character vocabulary behavior

Encoding works as:

1. normalize input text with NFKC
2. prepend `<sos>`
3. append `<eos>`
4. truncate to `max_len`
5. right-pad with `<pad>`

This makes the output predictable and checkpoint-friendly.

Example:

```python
from chartoken import CharVocab

vocab = CharVocab.from_texts(["lemma", "form"])
tensor = vocab.encode("lemma", max_len=10)
print(tensor.shape)
```

`encode` returns a `torch.Tensor`, while `encode_ids` returns `list[int]`. That split is useful when you want preprocessing logic without eagerly creating tensors.

## Feature vocabulary behavior

Feature tags are treated as an unordered list supplied by the caller. The package:

- maps known tags to integer ids
- truncates to `max_features`
- pads the remainder with `FEATURE_PAD`
- returns a float mask aligned with the ids

Example:

```python
from chartoken import FeatureVocab

vocab = FeatureVocab.from_tags([["N", "SG"], ["N", "PL"], ["V", "PST"]])
ids, mask = vocab.encode(["N", "SG"], max_features=6)
```

If you want tensors directly:

```python
ids_tensor, mask_tensor = vocab.encode_tensor(["N", "SG"], max_features=6)
```

## Serialization

Both vocabularies are serializable to plain dictionaries and back:

```python
state = char_vocab.to_dict()
restored = CharVocab.from_dict(state)
```

This is useful for:

- checkpoint payloads
- experiment reproducibility
- packaging trained models
- keeping training and inference vocabularies aligned

## Typing

This package ships `py.typed` and is meant to be consumed by `pyright`/Pylance-aware codebases.

Typed state objects:

- `CharVocabState`
- `FeatureVocabState`

Exported constants:

- `PAD`
- `SOS`
- `EOS`
- `FEATURE_PAD`
- `SPECIAL_TOKENS`

## Typical integration pattern

`chartoken-vp` is designed to sit underneath dataset and model packages.

A common stack looks like:

1. read raw TSV rows
2. build `CharVocab` from lemmas and surfaces
3. build `FeatureVocab` from tag lists
4. pre-encode examples into tensors
5. save vocab state in checkpoints
6. reuse the same states at inference time

## What this package deliberately does not do

It does not include:

- BPE or sentencepiece tokenization
- dataset downloading
- batching or dataloaders
- model architectures
- training loops

That separation is intentional. `chartoken-vp` should stay easy to publish, easy to test, and easy to embed into larger systems.
