Metadata-Version: 2.4
Name: indic-tts-preprocess
Version: 0.1.1
Summary: Convert numbers and dates in Indic text to spoken words for TTS models
Author-email: Dhruv Dornal <dhruvdornal2003@gmail.com>
License: MIT License
        
        Copyright (c) 2025 Dhruv
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Keywords: tts,indic,hindi,marathi,nlp,text-to-speech,preprocessing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Natural Language :: Hindi
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Dynamic: license-file

# indic-tts-preprocess

A small Python library that fixes one specific but very annoying problem with open-source Indic TTS models: **they cannot read numbers**.

Models like [ai4bharat/indic-parler-tts](https://huggingface.co/ai4bharat/indic-parler-tts) hallucinate badly when the input text contains raw digits. Feed them `"1997 में"` and you get garbled audio. Feed them `"उन्नीस सौ सत्तानवे में"` and it works perfectly.

This library does that conversion for you — numbers, years, dates — before the text ever reaches the tokenizer.

---

## Install

```bash
pip install indic-tts-preprocess
```

No dependencies. Pure Python. Works on Python 3.8 and above.

---

## Quick start

```python
from indic_tts_preprocess import preprocess

# Hindi
preprocess("उनका जन्म 5 अगस्त 1997 को हुआ", "hi")
# -> "उनका जन्म पाँच अगस्त उन्नीस सौ सत्तानवे को हुआ"

# Marathi
preprocess("त्यांचा जन्म 15 ऑगस्ट 1947 रोजी झाला", "mr")
# -> "त्यांचा जन्म पंधरा ऑगस्ट एकोणीस शे सत्तेचाळीस रोजी झाला"

# English
preprocess("He was born on 15 August 1947", "en")
# -> "He was born on fifteen August nineteen forty seven"
```

---

## Supported languages

| Code | Language |
|------|----------|
| `hi` | Hindi    |
| `mr` | Marathi  |
| `en` | English  |

If you pass an unsupported language code, the text comes back unchanged — nothing crashes.

---

## What it handles

**Date formats**

| Input | Language | Output |
|-------|----------|--------|
| `5 अगस्त 2004` | hi | `पाँच अगस्त दो हज़ार चार` |
| `05/08/2004` | hi | `पाँच अगस्त दो हज़ार चार` |
| `05-08-2004` | hi | `पाँच अगस्त दो हज़ार चार` |
| `15 August 1947` | en | `fifteen August nineteen forty seven` |
| `15/08/1947` | en | `fifteen August nineteen forty seven` |

**Standalone numbers**

| Input | Language | Output |
|-------|----------|--------|
| `73` | hi | `तिहत्तर` |
| `1997` | hi | `उन्नीस सौ सत्तानवे` |
| `2024` | en | `two thousand twenty four` |
| `1905` | en | `nineteen oh five` |

**Year handling** (the tricky part)

Hindi and Marathi speakers say years in the 1900s differently from how you'd read them literally:
- `1997` → `उन्नीस सौ सत्तानवे` (not `एक हज़ार नौ सौ सत्तानवे`)

English speakers do the same:
- `1997` → `nineteen ninety seven` (not `one thousand nine hundred ninety seven`)
- `1905` → `nineteen oh five`

The library handles all of these correctly.

---

## API reference

### `preprocess(text, lang)`

| Parameter | Type | Description |
|-----------|------|-------------|
| `text` | `str` | The raw input text containing digits/dates |
| `lang` | `str` | Language code: `"hi"`, `"mr"`, or `"en"` |

Returns `str` — the same text with all digits replaced by spoken words.

Raises `TypeError` if `text` is not a string.

---

## Contributing

Adding a new language is straightforward:

1. Create `indic_tts_preprocess/languages/yourlang.py` — look at `hindi.py` as a template
2. Add a `preprocess(text)` function and a `num_to_words(n)` function
3. Add the new language code in `core.py`
4. Add tests in `tests/test_yourlang.py`
5. Open a pull request

---

## License

MIT
