Metadata-Version: 2.4
Name: villm-tok-fast
Version: 0.2.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Dist: transformers>=4.30.0
Summary: ViLLM Fast Tokenizer — Rust backend with embedded SentencePiece for Vietnamese-English code-switching
Keywords: vietnamese,tokenizer,nlp,code-switching,sentencepiece
Author-email: vlinhd11 <vlinhd11@users.noreply.github.com>
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://huggingface.co/vlinhd11/villm-tokenizer
Project-URL: Repository, https://github.com/vlinhd11/villm-tok-fast

# villm-tok-fast

Rust-backed fast tokenizer for Vietnamese-English code-switching, used in the [viLLM](https://github.com/vlinhd11/viLLM) project.

## Installation

```bash
pip install villm-tok-fast
```

## Usage

```python
from villm_tok_fast import create_fast_tokenizer
from transformers import PreTrainedTokenizerFast

# Load from HuggingFace Hub (auto-downloads vocab files)
hf_tokenizer = create_fast_tokenizer("vlinhd11/villm-tokenizer")

# Or load from local directory
hf_tokenizer = create_fast_tokenizer("./path/to/villm-tokenizer")

# Tokenize
enc = hf_tokenizer("Học sinh giỏi tiếng Việt và học lập trình Python")
# tokens: ['Học_sinh', 'giỏi', 'tiếng_Việt', 'và_học', 'lập_trình', '[VI→EN]', '▁Python']

# Batch encode
batch = hf_tokenizer(["Câu thứ nhất", "Câu thứ hai"], padding=True, truncation=True)
```

## Features

- **Language detection** — per-word Vi/EN/Num/Code/Punct classification
- **Vietnamse Viterbi** — merges frequent syllable bigrams into compound tokens (`học_sinh`, `Việt_Nam`)
- **English subword** — embedded SentencePiece trie for English OOV words (`▁Python`, `▁programming`)
- **Code-switch markers** — optional `[VI→EN]` / `[EN→VI]` at language boundaries
- **Byte fallback** — `<0xNN>` for unknown characters

## Performance

~4x faster than the pure Python equivalent (~55k texts/sec vs ~14k).

## Requirements

- Python 3.8+
- transformers >= 4.30.0

## How it works

This package provides a Rust implementation of the ViLLM tokenizer as a PyO3 native extension.
It exposes a Python class compatible with HuggingFace `PreTrainedTokenizerFast` via the
`tokenizer_object=` parameter.

The Rust core handles:
1. Word-level language detection
2. Viterbi dynamic programming for Vietnamese compound formation
3. SentencePiece trie with whole-word preference for English subword
4. Code-switch marker insertion
5. Byte fallback for OOV characters
6. Decoding with smart punctuation join

## License

MIT

