Metadata-Version: 2.4
Name: yakit
Version: 0.1.1
Summary: Yakut language text normalizer using Word2Vec embeddings
Project-URL: Homepage, https://github.com/Michiluser/yakit
Project-URL: Documentation, https://github.com/Michiluser/yakit#readme
Project-URL: Repository, https://github.com/Michiluser/yakit
Project-URL: Issues, https://github.com/Michiluser/yakit/issues
Author: Michil
License: MIT
License-File: LICENSE
Keywords: nlp,sakha,text-normalization,word2vec,yakut
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Russian
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: <3.14,>=3.10
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: gensim>=4.3.0
Requires-Dist: llama-index-llms-ollama>=0.1.0
Requires-Dist: llama-index>=0.10.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: requests>=2.31.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: all
Requires-Dist: huggingface-hub>=0.16.0; extra == 'all'
Requires-Dist: pre-commit>=3.0.0; extra == 'all'
Requires-Dist: pytest-cov>=4.1.0; extra == 'all'
Requires-Dist: pytest>=7.4.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: huggingface-hub>=0.16.0; extra == 'dev'
Requires-Dist: pre-commit>=3.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: download
Requires-Dist: huggingface-hub>=0.16.0; extra == 'download'
Description-Content-Type: text/markdown

# Yakit - Yakut Language Text Normalizer

A Python library for normalizing Yakut (Sakha) language text using Word2Vec embeddings.

## Installation

```bash
pip install yakit
```

For automatic model downloading from Hugging Face Hub:

```bash
pip install yakit[download]
```

## Quick Start

```python
from yakit.normalizers import Word2VecNormalizer

# Initialize normalizer (auto-downloads model on first use)
normalizer = Word2VecNormalizer()

# Normalize text
text = "Мин сахалыы билэбин"
normalized = normalizer.normalize(text)
print(normalized)
```

## Custom Model Path

If you have your own Word2Vec model:

```python
from yakit.normalizers import Word2VecNormalizer

normalizer = Word2VecNormalizer(
    word2vec_path="/path/to/your/model.bin",
    training_data_path="/path/to/train_pairs.txt"  # optional
)
```

## Command Line Interface

```bash
# Normalize text directly
yakit normalize "Мин сахалыы билэбин"

# Normalize a file
yakit normalize -i input.txt -o output.txt

# Download models manually
yakit download

# Show cache info
yakit info
```

## What is Normalization?

Normalization converts text WITHOUT special Yakut characters to text WITH proper Yakut characters:

| Input | Output |
|-------|--------|
| h → | һ |
| г → | ҕ (in certain positions) |
| н → | ҥ (in certain positions) |
| о → | ө (in certain positions) |
| у → | ү (in certain positions) |

## Performance

With optimized hyperparameters:
- Character Accuracy: **97.15%**
- Word Accuracy: **92.09%**
- Exact Match: **61.77%**

## Requirements

- Python 3.10–3.13 (3.14 not yet supported: gensim has no compatible build)
- gensim
- numpy
- tqdm

## License

MIT
