Metadata-Version: 2.3
Name: words-to-readlang
Version: 1.1.0
Summary: Convert vocabulary exports from Pod101, Language Reactor, and more to Readlang CSV format
Keywords: vocabulary,language-learning,readlang,csv,converter
Author: Psy-Q
Author-email: Psy-Q <rca@psy-q.ch>
License: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Education
Classifier: Topic :: Text Processing :: Filters
Classifier: Topic :: Utilities
Requires-Dist: typer>=0.9.0
Requires-Dist: typing-extensions>=4.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: pytest>=7.0.0 ; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0 ; extra == 'dev'
Requires-Dist: pytest-mock>=3.10.0 ; extra == 'dev'
Requires-Dist: pytest-flask>=1.3.0 ; extra == 'dev'
Requires-Dist: black>=23.0.0 ; extra == 'dev'
Requires-Dist: ruff>=0.1.0 ; extra == 'dev'
Requires-Dist: flask>=2.3.0 ; extra == 'web'
Requires-Dist: flask-sqlalchemy>=3.0.0 ; extra == 'web'
Requires-Dist: flask-migrate>=4.0.0 ; extra == 'web'
Requires-Dist: sqlalchemy>=2.0.0 ; extra == 'web'
Requires-Python: >=3.10
Project-URL: Homepage, https://codeberg.org/psy-q/words-to-readlang
Project-URL: Repository, https://codeberg.org/psy-q/words-to-readlang
Project-URL: Issues, https://codeberg.org/psy-q/words-to-readlang/issues
Provides-Extra: dev
Provides-Extra: web
Description-Content-Type: text/markdown

# words-to-readlang

A Python library and CLI tool for converting vocabulary exports from language learning platforms (**Pod101**, **Language Reactor**, etc.) into the CSV format accepted by [Readlang's word import](https://readlang.com/importWords).

It also comes with a **web interface** for uploading, previewing, editing, and exporting vocabulary — with automatic example sentence fetching from [Tatoeba](https://tatoeba.org) and the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de).

## Background

[Readlang](https://readlang.com) is a reading-focused vocabulary tool with spaced-repetition flashcards. It supports [importing words via CSV](https://forum.readlang.com/t/secret-csv-word-import-feature-available/138), but the format it expects is specific, and the exports from other learning platforms don't match it out of the box. This library bridges that gap.

## Installation

```bash
pip install words-to-readlang
```

Or install from source:

```bash
git clone https://codeberg.org/psy-q/words-to-readlang
cd words-to-readlang
pip install -e .
```

To use the web interface, install with the `web` extras:

```bash
pip install -e ".[web]"
```

## Supported Input Formats

### Pod101 (FinnishPod101, SpanishPod101, etc.)

[Pod101 language learning sites](https://www.finnishpod101.com) let Premium subscribers save words to a Word Bank, which can be exported as a CSV file. The export is a simple two-column file (`Word`, `English`) encoded in UTF-16, which is automatically detected.

### Language Reactor

[Language Reactor](https://www.languagereactor.com) is a browser extension for learning languages through Netflix and YouTube. Pro subscribers can save words and [export them](https://www.languagereactor.com/help/export) as a tab-separated file from the Saved Items panel. This format includes the base/dictionary form, translations, and subtitle sentences.

## CLI Usage

### Commands

```
words-to-readlang convert      Convert a vocabulary file to Readlang CSV
words-to-readlang serve        Start the web interface
words-to-readlang list-formats List all available input formats
```

### Basic Conversion

```bash
# From Pod101
words-to-readlang convert input.csv output.csv --format pod101

# From Language Reactor
words-to-readlang convert saved.csv output.csv --format languagereactor

# Short alias
words-to-readlang convert saved.csv output.csv --format lr
```

If your input has more than 200 words (Readlang's import limit), the output is automatically split into multiple files: `output (part 1).csv`, `output (part 2).csv`, etc.

### Example Sentence Lookup

Language Reactor saves words in their inflected form as found in subtitles. Readlang requires the example sentence to contain the exact word, so mismatches are detected and the context is cleared.

The `--fetch` flag fetches replacement example sentences automatically. It tries [Tatoeba](https://tatoeba.org) first, then falls back to the [Leipzig Corpora Collection](https://wortschatz.uni-leipzig.de) if Tatoeba has no match.

```bash
words-to-readlang convert input.csv output.csv --format languagereactor --fetch
```

This is also useful for Pod101 exports, which never include example sentences.

By default the source language is Finnish (`--lang fin`). Pass the [ISO 639-3 code](https://tatoeba.org/en/languages/index) for other languages:

```bash
words-to-readlang convert words.csv output.csv --format pod101 --fetch --lang swe
```

### `convert` options

| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--format` | `-f` | *(required)* | Input format: `pod101`, `languagereactor`, or `lr` |
| `--fetch` | | off | Fetch missing examples from Tatoeba / Leipzig |
| `--lang` | | `fin` | ISO 639-3 source language code |
| `--version` | | | Print version and exit |

### List Available Formats

```bash
words-to-readlang list-formats
```

### Web Interface

```bash
# Development server (localhost only)
words-to-readlang serve

# Custom host/port
words-to-readlang serve --port 8080

# Bind to all interfaces with auto-reload
words-to-readlang serve --host 0.0.0.0 --port 8080 --debug
```

`serve` options:

| Option | Short | Default | Description |
|--------|-------|---------|-------------|
| `--host` | `-h` | `127.0.0.1` | Address to bind to |
| `--port` | `-p` | `5000` | Port to listen on |
| `--debug` | | off | Enable Flask debug mode with auto-reload |

See [README_WEB.md](README_WEB.md) for full web interface and Docker deployment documentation.

## Programmatic API

```python
from pathlib import Path
from words_to_readlang import Converter

# Basic usage
converter = Converter()
output_files = converter.convert(
    input_path=Path("words.csv"),
    output_path=Path("readlang.csv"),
    parser_name="pod101"
)

# With example sentence lookup
converter = Converter(use_tatoeba=True, src_lang="fin")
output_files = converter.convert(
    input_path=Path("saved_items.csv"),
    output_path=Path("readlang.csv"),
    parser_name="languagereactor"
)

print(f"Created {len(output_files)} file(s)")
```

### Advanced: Custom Processing

```python
from pathlib import Path
from words_to_readlang import get_parser, ReadlangWriter, ExampleFetcher

# Parse input
parser = get_parser("languagereactor")
entries = parser.parse(Path("saved.csv"))

# Filter or modify entries
entries = [e for e in entries if len(e.word) > 3]

# Write output with example lookup
fetcher = ExampleFetcher(delay=1.0, verbose=True)
writer = ReadlangWriter(example_fetcher=fetcher, src_lang="fin")
output_files = writer.write(entries, Path("output.csv"), use_tatoeba=True)
```

### Adding Custom Parsers

```python
from pathlib import Path
from typing import List
from words_to_readlang import Entry, register_parser

@register_parser("myformat")
class MyFormatParser:
    @property
    def name(self) -> str:
        return "myformat"

    @property
    def description(self) -> str:
        return "My custom vocabulary format"

    def parse(self, file_path: Path) -> List[Entry]:
        entries = []
        # ... parse file ...
        return entries
```

```bash
words-to-readlang convert input.txt output.csv --format myformat
```

## Readlang CSV Format Reference

| Column | Content | Notes |
|--------|---------|-------|
| 1 | Word or phrase | ` / ` separates synonyms |
| 2 | Translation | ` / ` separates alternatives |
| 3 | Context sentence *(optional)* | Must contain the exact word from column 1 |
| 4 | Practice interval in days *(optional)* | |
| 5 | Next practice date *(optional)* | `YYYY-MM-DD` |

- No header row
- No newlines within cells
- Maximum 200 words per file

## Development

```bash
git clone https://codeberg.org/psy-q/words-to-readlang
cd words-to-readlang
pip install -e ".[dev]"

# Run tests
pytest

# Type check
ty check

# Format / lint
black src tests
ruff check src tests
```

## License

MIT
