Metadata-Version: 2.4
Name: olaph
Version: 0.2.19
Summary: A multilingual phonemizer combining lexica, NLP, and probabilistic scoring for improved phonemization accuracy..
Author-email: Johannes Wirth <johannes.wirth.3@iisys.de>
License: MIT
Project-URL: Homepage, https://github.com/iisys-hof/olaph
Project-URL: Documentation, https://github.com/iisys-hof/olaph#readme
Project-URL: Issues, https://github.com/iisys-hof/olaph/issues
Keywords: phonemizer,text-to-speech,linguistics,NLP,multilingual
Classifier: Intended Audience :: Developers
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: inflect==7.5.0
Requires-Dist: spacy
Requires-Dist: lingua-language-detector==2.1.1
Requires-Dist: num2words==0.5.14
Requires-Dist: requests==2.32.5
Requires-Dist: annotated-types==0.7.0
Requires-Dist: blis==1.3.3
Requires-Dist: catalogue==2.0.10
Requires-Dist: certifi==2025.10.5
Requires-Dist: charset-normalizer==3.4.3
Requires-Dist: click==8.3.0
Requires-Dist: cloudpathlib==0.22.0
Requires-Dist: colorama==0.4.6
Requires-Dist: confection==0.1.5
Requires-Dist: cymem==2.0.11
Requires-Dist: docopt==0.6.2
Requires-Dist: idna==3.10
Requires-Dist: jinja2==3.1.6
Requires-Dist: langcodes==3.5.0
Requires-Dist: language-data==1.3.0
Requires-Dist: marisa-trie==1.3.1
Requires-Dist: markdown-it-py==4.0.0
Requires-Dist: markupsafe==3.0.3
Requires-Dist: mdurl==0.1.2
Requires-Dist: murmurhash==1.0.15
Requires-Dist: numpy==2.2.0
Requires-Dist: packaging==25.0
Requires-Dist: preshed==3.0.13
Requires-Dist: pygments==2.19.2
Requires-Dist: regex==2025.9.18
Requires-Dist: rich==14.1.0
Requires-Dist: setuptools==80.9.0
Requires-Dist: shellingham==1.5.4
Requires-Dist: smart-open==7.3.1
Requires-Dist: spacy-legacy==3.0.12
Requires-Dist: spacy-loggers==1.0.5
Requires-Dist: srsly==2.5.3
Requires-Dist: tqdm==4.67.1
Requires-Dist: typer==0.19.2
Requires-Dist: typing-inspection==0.4.2
Requires-Dist: typing-extensions==4.15.0
Requires-Dist: urllib3==2.7.0
Requires-Dist: wasabi==1.1.3
Requires-Dist: weasel==0.4.1
Requires-Dist: wrapt==1.17.3
Requires-Dist: pytest
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Dynamic: license-file

# OLaPh — Optimal Language Phonemizer

[![PyPI version](https://img.shields.io/pypi/v/olaph.svg?logo=pypi)](https://pypi.org/project/olaph/)
[![Python versions](https://img.shields.io/pypi/pyversions/olaph.svg)](https://pypi.org/project/olaph/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)

**OLaPh (Optimal Language Phonemizer)** is a multilingual phonemization framework that converts text into phonemes surpassing the quality of comparable frameworks.

---

## NEWS
05/2026: The instruction finetuning dataset for OlaphLLM is available [here](https://huggingface.co/datasets/iisys-hof/olaph-data-v2)
05/2026: A new version of OLaPhLLM is available [here](https://huggingface.co/iisys-hof/OLaPhLLM_v2)

---

## Overview

Traditional phonemizers rely on simple rule-based mappings or lexicon lookups.
Neural and hybrid approaches improve generalization but still struggle with:

- Names and foreign words
- Abbreviations and acronyms
- Loanwords and compounds
- Ambiguous homographs

**OLaPh** tackles these challenges by combining:

- Extensive **language-specific dictionaries**
- **Abbreviation, number, and letter normalization**
- **Compound resolution with probabilistic scoring**
- **Cross-language handling**
- **NLP-based preprocessing** via [spaCy](https://spacy.io) and [Lingua](https://github.com/pemistahl/lingua-py)

Evaluations on the Wikipron dataset show improved accuracy and robustness over existing phonemizers, including on OOV words.

---

## Features

- Multilingual phonemization (DE, EN-US, EN-UK, FR, ES, NL, SV, DA, PL, IT, FI)
- Abbreviation and letter pronunciation dictionaries
- Number normalization
- Cross-language acronym detection
- Compound splitting with statistical scoring
- Freely available lexica for research and development derived from wiktionary.org.

## Large Language Model
A LLM based on OLaPh output is also available. It is a GemmaX 2B Model trained on ~10M sentences derived from the FineWeb Corpus phonemized with the OLaPh framework.

Find it here on [huggingface](https://huggingface.co/iisys-hof/OLaPhLLM_v2) (DE, EN, FR, US. Training for additional languages planned)

---

## Installation

### From PyPI

```bash
pip install olaph
```

spaCy models are downloaded on demand.

### From source

```bash
git clone https://github.com/iisys-hof/olaph.git
cd olaph
pip install -e .
```

## Example Usage

```python
from olaph import Olaph

phonemizer = Olaph()

output = phonemizer.phonemize_text("He ordered a Brezel and a beer in a tavern near München.", lang="en-us")

print(output)
```

---

## Dependencies

- [spaCy](https://spacy.io)
- [Lingua](https://github.com/pemistahl/lingua-py)
- [num2words](https://github.com/savoirfairelinux/num2words)
- [inflect](https://github.com/jaraco/inflect)

---

## Dictionars Sources

- [Wiktionary Dumps](https://dumps.wikimedia.org/backup-index.html)
- [Neurlang Dataset](https://github.com/neurlang/dataset)


## Research Summary

Phonemization is a critical component in text-to-speech synthesis. Traditional approaches rely on deterministic transformations and lexica, while neural methods offer potential for higher generalization on out-of-vocabulary (OOV) terms. This work introduces OLaPh (Optimal Language Phonemizer), a hybrid framework that integrates extensive multilingual lexica with advanced NLP techniques and a statistical subword segmentation function. Evaluations on the WikiPron benchmark show that the OLaPh framework significantly outperforms established baselines in overall accuracy and maintains robustness on OOV data through advanced fallback mechanisms. To further explore neural generalization, we utilize the framework to synthesize a high-consistency training corpus for an instruction-tuned Large Language Model (LLM). While the deterministic framework remains more accurate overall, the LLM demonstrates strong generalization, matching or partly exceeding the framework’s performance. This suggests that the LLM successfully internalized phonetic intuitions from the synthetic data that transcend the framework’s capabilities. Together, these tools provide a comprehensive, open-source resource for multilingual G2P research.

---

## Citation

If you use OLaPh in academic work, please cite:

```bibtex
@misc{wirth2026olaphoptimallanguagephonemizer,
      title={OLaPh: Optimal Language Phonemizer}, 
      author={Johannes Wirth},
      year={2026},
      eprint={2509.20086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.20086}, 
}
```
