Metadata-Version: 2.4
Name: dv_normalizer
Version: 0.1.8
Summary: Dhivehi text normalization for TTS frontends
Project-URL: Homepage, https://github.com/alakxender/dv-text
Author-email: Alakxender <alakxender@gmail.com>
License: MIT License
        
        Copyright (c) 2024 
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: dhivehi,nlp,text-normalization,thaana,tts
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Requires-Dist: python-dateutil>=2.8
Requires-Dist: pyyaml>=6.0
Requires-Dist: regex>=2023.6.3
Provides-Extra: dev
Requires-Dist: huggingface-hub>=0.20; extra == 'dev'
Requires-Dist: hypothesis>=6; extra == 'dev'
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pytest-cov>=4; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: webui
Requires-Dist: fastapi>=0.110; extra == 'webui'
Requires-Dist: jinja2>=3.1; extra == 'webui'
Requires-Dist: python-multipart>=0.0.9; extra == 'webui'
Requires-Dist: uvicorn[standard]>=0.27; extra == 'webui'
Description-Content-Type: text/markdown

# dv-normalize

A Dhivehi text normalizer for TTS frontends. Converts numbers, dates, times,
fractions, scores, abbreviations, money, percentages, and other non-Thaana
input into spoken-form Dhivehi.

> **Status:** `0.1.8`. The v1 rewrite ships under the existing 0.1.x line.
> The new public API (`normalize`, `Normalizer`, `NormalizerConfig`) is the
> supported entry point. The legacy 0.1.x classes are still exported but
> emit a `DeprecationWarning` and will be removed in a future major release.

## Installation

```bash
pip install dv-normalize
```

## Quick start

```python
from dv_normalize import normalize

normalize("ވަކި ލާރިން ވެސް 232.23 ލާރި ހޯދައެވެ")
# 'ވަކި ލާރިން ވެސް ދުއިސައްތަ ތިރީސް ދޭއް ޕޮއިންޓް ދޭއް ތިނެއް ލާރި ހޯދައޭ'

normalize("ޑރ. އިބްރާހިމް 14:30 ގައި އައި")
normalize("ކ.އަތޮޅު ވިލިނގިލިން 120 ކިލޯ މީޓަރު")
normalize("ފޭސް2ގެ")
```

For repeated use, hold onto a `Normalizer` instance:

```python
from dv_normalize import Normalizer, NormalizerConfig

n = Normalizer(NormalizerConfig(keep_punctuation=False))
n("ހެލޯ، ދުނިޔެ")  # → 'ހެލޯ ދުނިޔެ'
```

## What it handles

| Class           | Example input          | Example output                                        |
| --------------- | ---------------------- | ----------------------------------------------------- |
| Cardinal        | `232`                  | `ދުއިސައްތަ ތިރީސް ދޭއް`                                |
| Comma-grouped   | `104,880`              | (single cardinal, not per-digit)                      |
| Per-digit       | `9982711`              | spelled digit-by-digit (7+ digit identifier)          |
| Decimal         | `232.23`               | `ދުއިސައްތަ ތިރީސް ދޭއް ޕޮއިންޓް ދޭއް ތިނެއް`              |
| Year            | `2024`                 | `ދެހާސް ސައުވީސް`                                       |
| Year range      | `1982 - 2024`          | `… ން … އަށް`                                          |
| Time            | `14:30`                | `ސާދަ ގަޑި ތިރީސް`                                      |
| Ordinal         | `11ވަނަ`               | adnominal head form                                   |
| Fraction        | `1/2`                  | `ދެބައިކުޅަ އެއްބައި`                                    |
| Mixed fraction  | `1 1/2`                | `… އަދި …`                                            |
| Percent         | `25%`                  | `ފަންސަވީސް ޕަސެންޓް`                                    |
| Oblique ref     | `2024/3`               | `… ޚާއްސަ <denom-ordinal>`                            |
| Score           | `3-2`, `0-0`, `5-0`    | compact draw / shutout forms                          |
| Money           | `52 ރ.`                | Rufiyaa context-sensitive                             |
| Abbreviation    | `ޑރ.`, `ހއ.`           | `ޑޮކްޓަރު`, `ހާ އަލިފު`                                |
| Compound abbrev | `ސ.ޢ.ވ.`               | `ޞައްލަﷲ ޢަލައިހި ވަސައްލަމް`                            |
| Calendar marker | `2026 މ.`, `1447 ހ.`   | `… މީލާދީ`, `… ހިޖުރީ`                                  |
| Sentence ending | `ހޯދައެވެ`              | `ހޯދައޭ` (113 rules, context-sensitive)                |

The classifier is priority-ranked, so more specific patterns (calendar
markers, multi-letter compound abbreviations, year ranges) shadow the
generic ones. Tokens that don't match any rule pass through unchanged.

## Configuration

```python
NormalizerConfig(
    dialect="spoken",            # only option for now
    unknown_latin="passthrough", # "passthrough" | "drop" | "spell"
    decimal_separator="auto",    # "auto" | "dot" | "comma"
    time_system="auto",          # "auto" | "12" | "24"
    currency_default="MVR",
    keep_punctuation=True,
    diagnostic=False,
    strict=False,
)
```

## Diagnostic mode

`Normalizer.trace(text)` returns the classified token list instead of joined
text. Useful for debugging which rule fired:

```python
for tok in Normalizer().trace("ޑރ. އިބްރާހިމް 2024ގައި"):
    print(tok.cls, tok.text, tok.spoken, tok.fields)
```

## Legacy API

The original 0.1.x classes (`DhivehiNumberConverter`, `DhivehiTimeConverter`,
`DhivehiYearConverter`, `DhivehiTextProcessor`) are still importable from
`dv_normalize` but emit a `DeprecationWarning`. They are scheduled for
removal in a future major release — migrate to `normalize()` / `Normalizer`.

## License

MIT — see `LICENSE`.
