Metadata-Version: 2.4
Name: vocx
Version: 0.1.0
Summary: Transcribe Esperanto text into phonetic Polish for use in professional TTS engines.
Project-URL: Homepage, https://github.com/eugenzor/vocx
Project-URL: Repository, https://github.com/eugenzor/vocx
Author-email: eugenzor <3356454+eugenzor@users.noreply.github.com>
License-Expression: MIT
License-File: LICENSE
Keywords: esperanto,phonetic,polish,text-to-speech,transcription,tts
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.11
Description-Content-Type: text/markdown

# vocx

`vocx` transcribes Esperanto text into phonetic Polish for use in professional
text-to-speech (TTS) engines.

## Background

Commercial TTS engines tend not to support minority languages, particularly
constructed languages such as Esperanto. It turns out Esperanto shares lots of
sounds with Polish. By transcribing Esperanto to Polish, we can make commercial
TTS engines give us a good approximation for spoken Esperanto.

This is a Python port of the original
[Go library](https://github.com/martinrue/vocx).

## Installation

```bash
pip install vocx
```

## Usage

### Library

```python
from vocx import Transcriber

t = Transcriber()
t.transcribe("Ĉu vi ŝatas Esperanton? Esperanto estas facila lingvo.")
# "czu wij szatas esperanton? esperanto estas fatssila lijngwo."
```

### Custom rules

To override the default rules used during transcription, call `load_rules`,
passing a custom JSON rules document. See
[`src/vocx/default_rules.py`](src/vocx/default_rules.py) for the correct
structure.

```python
from vocx import Transcriber

t = Transcriber()
t.load_rules(my_rules_json)
```

A rules document has four sections:

- `letters` — single-character substitutions (applied lowercased).
- `fragments` — ordered regular-expression replacements applied to each word.
- `overrides` — whole-word replacements (surrounding punctuation is preserved).
- `numbers` — the words used when transcribing numeric tokens.

### Command line

```bash
# Transcribe arguments
vocx "Saluton, kiel vi fartas?"
# saluton, kijel wij fartas?

# Transcribe stdin
echo "Saluton" | vocx

# Use a custom rules file
vocx --rules my_rules.json "Saluton"
```

## Development

```bash
uv sync
uv run pytest
```

