Metadata-Version: 2.4
Name: uzbek-mwe-tokenizer
Version: 0.1.0
Summary: Transformer-based intelligent model for identifying multi-word lexical units in Uzbek.
Home-page: https://huggingface.co/MaksudSharipov/Uzbek-MWE-Tokenizer-uzBERT
Author: Maksud Sharipov
Author-email: maqsbek72@gmail.com
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: torch
Requires-Dist: transformers
Requires-Dist: UzTransliterator
Requires-Dist: sentencepiece
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Uzbek MWE Tokenizer

This package identifies Multi-Word Expressions (MWEs) in Uzbek language texts using a transformer-based intelligent model (UzBERT).

## Citation Requirement
If you use this model in your research or project, please cite the following paper:
> Sharipov, M. S. (2026). Transformer-based intelligent model for identifying multi-word lexical units and reducing syntactic ambiguities in Uzbek language texts. Bulletin of TUIT: Management and Communication Technologies, 2(14), 103-107. DOI: 10.61663/262tuitmct14  
> **Havola:** [https://uzjurnal.uz/2/2026/2/index?issue=14](https://uzjurnal.uz/2/2026/2/index?issue=14)

## Installation

```bash
pip install uzbek-mwe-tokenizer
```

## Usage

You can use the model for both Latin (`lot`) and Cyrillic (`cyr`) texts.

### Example in Cyrillic

```python
from uzbek_mwe_tokenizer import UzbekMWETokenizer

# Initialize tokenizer in Cyrillic mode
tokenizer = UzbekMWETokenizer(mode="cyr")

sentences = [
    "Натижани кўриб, ҳамма ўқувчиларнинг бирданига тарвузи қўлтиғидан тушди.",
    "Қоронғи хонада унинг қаттиқ капалаги учди, гўё ерга кирса, қулоғидан тортиб чиқарадиган ҳолат эди.",
    "Рақибларимизни кўриб бизнинг асло тепа сочимиз тик бўлмади."
]

for gap in sentences:
    print("Gap:", gap)
    mwes = tokenizer.extract_mwe(gap)
    print("Topilgan MWE'lar:", mwes)
    print("-" * 50)
```

**Kutilayotgan Natija (Expected Output):**
```text
Gap: Натижани кўриб, ҳамма ўқувчиларнинг бирданига тарвузи қўлтиғидан тушди.
Topilgan MWE'lar: [{'mwe': 'тарвузи қўлтиғидан тушди', 'confidence': 100.0}]
--------------------------------------------------
Gap: Қоронғи хонада унинг қаттиқ капалаги учди, гўё ерга кирса, қулоғидан тортиб чиқарадиган ҳолат эди.
Topilgan MWE'lar: [{'mwe': 'капалаги учди', 'confidence': 100.0}, {'mwe': 'ерга кирса', 'confidence': 76.0}, {'mwe': 'қулоғидан тортиб чиқарадиган', 'confidence': 99.0}]
--------------------------------------------------
Gap: Рақибларимизни кўриб бизнинг асло тепа сочимиз тик бўлмади.
Topilgan MWE'lar: [{'mwe': 'тепа сочимиз тик бўлмади', 'confidence': 100.0}]
--------------------------------------------------
```

### Example in Latin

The package automatically transliterates text to Cyrillic for the model and translates the output back to Latin.

```python
from uzbek_mwe_tokenizer import UzbekMWETokenizer

# Initialize tokenizer in Latin mode
tokenizer = UzbekMWETokenizer(mode="lot")

text = "Natijani ko'rib, hamma o'quvchilarning birdaniga tarvuzi qo'ltig'idan tushdi."
mwes = tokenizer.extract_mwe(text)
print(mwes)
```

**Kutilayotgan Natija (Expected Output):**
```text
[{'mwe': "tarvuzi qo'ltig'idan tushdi", 'confidence': 100.0}]
```
