Metadata-Version: 2.4
Name: md-llm-lang-reader
Version: 0.1.0
Summary: Generate language-learning HTML readers (with sentence-level LLM translations and TTS) from Markdown.
Author: Katsutoshi Seki
License: MIT
Project-URL: Homepage, https://github.com/sekika/langreader
Project-URL: Documentation, https://github.com/sekika/langreader
Project-URL: Source, https://github.com/sekika/langreader
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: multiai
Dynamic: license-file

# md-llm-lang-reader

Generate language-learning HTML readers from Markdown using an LLM:
- sentence-by-sentence splitting + translation
- one-click TTS playback for the source text (browser Web Speech API)
- fenced code blocks are preserved as code (not sent to the LLM)

This package is published on PyPI as `md-llm-lang-reader`, and installs the CLI command `langreader`.

[![en](https://img.shields.io/badge/lang-en-red.svg)](https://github.com/sekika/langreader/blob/main/README.md)
[![ja](https://img.shields.io/badge/lang-ja-blue.svg)](https://github.com/sekika/langreader/blob/main/README-ja.md)
[![fr](https://img.shields.io/badge/lang-fr-green.svg)](https://github.com/sekika/langreader/blob/main/README-fr.md)

## Features

- Markdown → HTML (simple headings + paragraphs)
- LLM-assisted sentence splitting (natural sentence boundaries)
- Sentence-level translations (each source sentence paired with its translation)
- TTS button per source sentence
- Fenced code blocks (``` or ~~~) are emitted as `<pre><code>` and are not sent to the LLM
- Bullet lists are translated (no special handling; they are passed to the LLM as plain text)

## Installation

```bash
pip install md-llm-lang-reader
```

## Quick start

Create `input.md`:

```md
# Example

Bonjour ! Ceci est un court paragraphe.

```python
# Code blocks are not translated.
print("Hello")
```

- Premier point
- Deuxième point
```

Generate `output.html`:

```bash
langreader \
  -i input.md \
  -o output.html \
  --src fr \
  --tgt en \
  --provider YOUR_PROVIDER \
  --model YOUR_MODEL
```

Open the generated HTML in your browser and click the speaker buttons.

## CLI usage

```bash
langreader -i INPUT.md -o OUTPUT.html --src SRC --tgt TGT --provider PROVIDER --model MODEL [-v 0|1|2|3]
```

### Options

- `-i, --input` (required)  
  Input Markdown file path.

- `-o, --output` (required)  
  Output HTML file path.

- `--src` (default: `fr`)  
  Source language code (e.g. `fr`, `de`, `es`, `ja`).

- `--tgt` (default: `en`)  
  Target language code.

- `--provider` (required)  
  Provider name passed to `multiai` (depends on your `multiai` configuration).

- `--model` (required)  
  Model name passed to `multiai`.

- `-v, --verbose` (default: `1`)  
  Controls terminal output:
  - `0`: silent
  - `1`: headings only
  - `2`: paragraph preview (first ~5 words)
  - `3`: full original paragraph text

### Examples

French → English:

```bash
langreader -i alsace.md -o alsace.html --src fr --tgt en --provider ... --model ...
```

German → English:

```bash
langreader -i berlin.md -o berlin.html --src de --tgt en --provider ... --model ...
```

Japanese → English:

```bash
langreader -i news.md -o news.html --src ja --tgt en --provider ... --model ...
```

## How it works

For each paragraph, the tool asks the LLM to:

1. Split the paragraph into natural sentences (avoid splitting on abbreviations).
2. Translate each sentence into the target language.
3. Return only valid JSON in this schema:

```json
[
  { "src": "…", "tgt": "…" }
]
```

The tool validates and parses the JSON and then generates HTML like:

- source sentence + TTS button
- translated sentence below it

## Notes on Text-to-Speech (TTS)

- TTS uses the browser’s Web Speech API (`speechSynthesis`).
- Voice availability depends on the OS/browser. Some environments may have limited voices for certain languages.
- The tool sets the utterance language to `--src` (e.g. `fr`). If you need a specific locale (e.g. `fr-FR`), you can currently edit the generated HTML (a future CLI option could expose this).

## Markdown support (current)

Supported:
- Headings: `#`, `##`, `###`, `####`
- Paragraphs: consecutive non-empty lines are joined with spaces
- Fenced code blocks: ``` or ~~~ (any info string is allowed)

Not yet supported (treated as plain text or not specially parsed):
- Blockquotes, tables, images
- Inline formatting (links/emphasis) is not rendered; it is passed as plain text

If you need richer Markdown rendering, consider adding a Markdown parser and preserving a mapping between original text and rendered HTML.

## Security

This tool escapes text embedded into HTML and does not inline arbitrary text into `onclick` handlers.
TTS buttons store text in `data-speak="..."` attributes and use JS event listeners, which is safer and avoids quoting issues.

Still, treat generated HTML as untrusted if your input Markdown is untrusted.

## Development

Clone and install in editable mode:

```bash
pip install -e .
```

Run tests:

```bash
pytest
```

Build the package:

```bash
python -m build
```

## License

MIT
