Metadata-Version: 2.4
Name: scrape-forvo
Version: 1.1.1
Summary: Add your description here
Requires-Python: >=3.13
Description-Content-Type: text/markdown
Requires-Dist: playwright>=1.58.0
Requires-Dist: requests>=2.32.5
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: responses>=0.25.0; extra == "dev"

# scrape-forvo

Download pronunciation MP3s from Forvo search pages.

## Installation

```bash
uv run python -m pip install -e .
```

## Usage

Only this command is confirmed to work reliably:

```bash
scrape-forvo egg --use-playwright --headed
```

By default, the scraper uses Forvo language code `no` and downloads files using
`no` as the filename prefix. You can change the language code with `--lang`:

```bash
scrape-forvo egg --lang en --use-playwright --headed
```

If you want a custom filename prefix, pass `--prefix` (this overrides the default
language-based prefix):

```bash
scrape-forvo egg --lang en --prefix myset --use-playwright --headed
```

## Forvo Language Codes (YAML)

The pairs below were collected from `https://forvo.com/` language links (plus `no` from the homepage language menu: `Norsk`).

```yaml
forvo_language_codes:
  ar: Arabic
  ca: Catalan
  chm: Mari
  cs: Czech
  de: German
  el: Greek
  en: English
  eo: Esperanto
  es: Spanish
  fa: Persian
  fi: Finnish
  fr: French
  grc: Ancient Greek
  ha: Hausa
  he: Hebrew
  hu: Hungarian
  it: Italian
  ja: Japanese
  ko: Korean
  lb: Luxembourgish
  nl: Dutch
  no: Norwegian
  pl: Polish
  pt: Portuguese
  ru: Russian
  sk: Slovak
  sv: Swedish
  tr: Turkish
  tt: Tatar
  uk: Ukrainian
  yue: Cantonese
  zh: Mandarin Chinese
```

## Scriptable Usage

You can also import `scrape_forvo` and use it from Python:

```python
from scrape_forvo import scrape

result = scrape(
    "egg",
    outdir="forvo_mp3",
    lang="no",
    use_playwright=True,
    headed=True,
)

print(result.downloaded_count)
for candidate in result.candidates:
    print(candidate.url, "->", candidate.out_path)
```

The `scrape()` arguments map directly to CLI flags, so both interfaces share the same behavior without duplicated logic.
Internally, the search URL is built as `https://forvo.com/search/<word>/<lang>/` (default `lang="no"`).

## Development

Set up the project virtual environment with uv:

```bash
uv sync
```

Then run commands from the environment:

```bash
source .venv/bin/activate
```

Install dev dependencies:

```bash
python -m pip install -e .[dev]
```

Run tests:

```bash
pytest
```

### Optional live test

Set `FORVO_LIVE_TEST=1` to enable the live integration test.

## TODO

edge cases
- [ ] when multiple pronunciation files come out. which one to pick?
- [ ] when there's no pronunciation available.

integration
- [ ] integration with the vocab repo
