Metadata-Version: 2.4
Name: ugtext_processor
Version: 0.1.8
Summary: text processing for uyghur script
Author-email: uyplayer <uyplayer@outlook.com>
License: MIT License
        
        Copyright (c) 2025 uyplayer
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in
        all copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
        THE SOFTWARE.
        
Project-URL: Homepage, https://github.com/uyplayer/ugtext_processor
Project-URL: Repository, https://github.com/uyplayer/ugtext_processor
Project-URL: Issues, https://github.com/uyplayer/ugtext_processor/issues
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pytest
Requires-Dist: epitran
Requires-Dist: sentencepiece
Requires-Dist: tokenizers
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Dynamic: license-file

# Text Processing for Uyghur Script

`ugtext_processor` is a Python library for processing Uyghur text. It provides tools for normalization, phonemization, and tokenization.

## Features

*   **Normalizer**: Cleans and normalizes Uyghur text by handling punctuation, abbreviations, currency, dates, and numbers.
*   **Phonemizer**: Converts Uyghur text into IPA or ULY Latin script representations.
*   **Tokenizer**: Supports various tokenization strategies, including word, character, BPE, WordPiece, and SentencePiece.

## Installation
 
```bash
pip install ugtext-processor
```

## Usage

### Normalizer

The `normalizer` module provides a simple interface to clean and normalize Uyghur text.

```python
from ugtext_processor.normalizer import normalize

text = "بۈگۈن 2024/07/26 سائەت 14:30، باھاسى ¥120.5، ئېغىرلىقى 2kg"
normalized_text = normalize(text)
print(normalized_text)
```

### Phonemizer

The `phonemizer` module can convert Uyghur text to IPA or ULY Latin script.

```python
from ugtext_processor.phonemizer import UgPhonemizer

# To ULY Latin script
phonemizer_uly = UgPhonemizer(mod=UgPhonemizer.Mod.ULY)
text = "ياخشىمۇسىز؟"
uly_phonemes = phonemizer_uly.phonemizer(text)
print(f"ULY: {''.join(uly_phonemes)}")

# To IPA
phonemizer_ipa = UgPhonemizer(mod=UgPhonemizer.Mod.IPA)
ipa_phonemes = phonemizer_ipa.phonemizer(text)
print(f"IPA: {''.join(ipa_phonemes)}")
```

### Tokenizer

The `tokenizer` module provides a factory to create different types of tokenizers.

```python
from ugtext_processor.tokenizer import TokenizerFactory, TokenizerType

# Word Tokenizer
word_tokenizer = TokenizerFactory.create_tokenizer(TokenizerType.WORD)
text = "بۇ بىر ئاددىي جۈملە."
tokens = word_tokenizer.tokenize(text)
print(f"Word Tokens: {tokens}")

# Character Tokenizer
char_tokenizer = TokenizerFactory.create_tokenizer(TokenizerType.CHARACTER)
tokens = char_tokenizer.tokenize(text)
print(f"Character Tokens: {tokens}")
```

## Modules

### `ugtext_processor.normalizer`

This module contains functions to normalize Uyghur text. The main function is `normalize`, which applies the following steps in order:

1.  `UyghurPunctuationNormalizer`: Normalizes and cleans punctuation.
2.  `UyghurAbbreviation`: Expands common abbreviations.
3.  `UyghurCurrency`: Converts currency symbols to text.
4.  `UyghurDateNormalizer`: Normalizes dates and times into spoken form.
5.  `UyghurNumberNormalizer`: Converts numbers into spoken form.

### `ugtext_processor.phonemizer`

This module provides the `UgPhonemizer` class for converting Uyghur text into phonetic representations.

*   `UgPhonemizer(mod: Mod)`: The constructor takes a `mod` argument which can be `UgPhonemizer.Mod.IPA` or `UgPhonemizer.Mod.ULY`.
*   `phonemizer(text: str)`: The main method that performs the conversion.

### `ugtext_processor.tokenizer`

This module provides a `TokenizerFactory` for creating various tokenizers.

*   `TokenizerFactory.create_tokenizer(tokenizer_type: TokenizerType, **kwargs)`: Creates a tokenizer instance.
*   `TokenizerType`: An enum with the following values:
    *   `WORD`
    *   `CHARACTER`
    *   `BPE`
    *   `WORDPIECE`
    *   `SENTENCEPIECE`
