Metadata-Version: 2.4
Name: thairom
Version: 0.1.1
Summary: Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.
Project-URL: Homepage, https://github.com/alexsears/thairom
Project-URL: Repository, https://github.com/alexsears/thairom
Project-URL: Issues, https://github.com/alexsears/thairom/issues
Author-email: Alex Sears <asears2@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: isan,lao,nlp,romanization,thai,transliteration
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Requires-Dist: pythainlp>=5.0
Description-Content-Type: text/markdown

# thairom

[![PyPI](https://img.shields.io/pypi/v/thairom)](https://pypi.org/project/thairom/)
[![Python](https://img.shields.io/pypi/pyversions/thairom)](https://pypi.org/project/thairom/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

Accurate Thai and Lao/Isan romanization for real-world text -- song lyrics, colloquial speech, and dialects.

## Installation

```bash
pip install thairom
```

## Quick Start

```python
from thairom import romanize

# Thai
print(romanize('สวัสดีครับ'))       # sawatdee krap
print(romanize('ขอบคุณมาก'))       # khop khun mak
print(romanize('หัวใจ'))           # hua jai
print(romanize('หก'))              # hok

# Lao/Isan
print(romanize('ฮักเจ้าหลาย', lang='lo'))  # hak jao laai
print(romanize('ม่วนคัก', lang='lo'))       # muan khak
```

## Features

- **Thai romanization** using pythainlp's royin engine with word-level corrections
- **Lao/Isan dialect support** for Thai-script Isan text with proper pronunciation rules (r-to-l substitution, etc.)
- **Word correction maps** that fix common pythainlp errors on colloquial vocabulary, song lyrics, and everyday phrases
- **Handles real-world text** -- tested against song lyrics, spoken Thai, and Isan dialect ground truth data
- **Clean output** -- strips leaked Thai/Lao characters and normalizes whitespace

## Why thairom instead of pythainlp alone?

pythainlp's royin romanization engine is solid for formal Thai, but it struggles with colloquial speech, song lyrics, and regional dialects. thairom builds on pythainlp and fixes these gaps:

| Thai Text | pythainlp (royin) | thairom | Correct |
|-----------|-------------------|---------|---------|
| หัวใจ | hua chai | hua jai | hua jai |
| น้ำตา | nam ta | nam ta | nam ta |
| เข้าใจ | khao chai | khao jai | khao jai |
| หก | hok | hok | hok |
| ก็ | ko | kaw | kaw |
| เวลา | wela | welaa | welaa |
| ตลอดเวลา | talot wela | talod welaa | talod welaa |
| ขอบคุณ | khop khun | khop khun | khop khun |
| ฮักเจ้าหลาย | (no Isan support) | hak jao laai | hak jao laai |

thairom also handles Isan/Lao dialect written in Thai script, which pythainlp does not support at all.

## API Reference

### `romanize(text, lang='th')`

Top-level convenience function. Dispatches to `romanize_thai` or `romanize_lao` based on `lang`.

**Parameters:**
- `text` (str): Text to romanize.
- `lang` (str): `'th'` for Thai (default), `'lo'` for Lao/Isan.

**Returns:** Lowercase romanized string.

### `romanize_thai(text)`

Romanize Thai text using pythainlp with word-level corrections from `THAI_WORD_MAP`.

**Parameters:**
- `text` (str): Thai text to romanize.

**Returns:** Lowercase romanized string.

### `romanize_lao(text)`

Romanize Isan/Lao text written in Thai script. Applies Lao pronunciation rules (e.g., initial r becomes l) and word corrections from `LAO_WORD_MAP`.

**Parameters:**
- `text` (str): Isan/Lao text in Thai script.

**Returns:** Lowercase romanized string.

### Word Maps

The correction maps are available as importable dictionaries for inspection or extension:

```python
from thairom.maps import THAI_WORD_MAP, LAO_WORD_MAP
```

## Contributing

Contributions are welcome, especially additions to the word correction maps. The maps were developed using an autoresearch pipeline that scores romanization output against ground truth data. If you find a word that romanizes incorrectly:

1. Add the word and its correct romanization to `THAI_WORD_MAP` or `LAO_WORD_MAP` in `src/thairom/maps.py`.
2. Add a test case to `tests/test_romanize.py`.
3. Run `pytest` to verify.
4. Submit a pull request.

## License

MIT
