Metadata-Version: 2.4
Name: harnais-web-extractor
Version: 0.1.1
Summary: Two-stage web article extractor (trafilatura + Playwright) with YouTube transcript fetching.
Author-email: John Linotte <contact@harnais.be>
License: MIT
Project-URL: Homepage, https://harnais.be
Project-URL: Repository, https://github.com/JohnLinotte/web-article-extractor
Keywords: scraping,article,extraction,trafilatura,playwright,youtube,transcript
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: trafilatura
Requires-Dist: playwright
Provides-Extra: youtube
Requires-Dist: yt-dlp; extra == "youtube"
Provides-Extra: whisper
Requires-Dist: faster-whisper; extra == "whisper"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Dynamic: license-file

# web-article-extractor

A small, dependency-light toolkit for pulling readable content off the web. It
extracts article text from any URL using a two-stage strategy — `trafilatura`
first (fast, no browser), then a headless Playwright Chromium fallback for
JavaScript-heavy pages — and fetches transcripts from YouTube videos through a
manual-then-automatic subtitle cascade.

> **Naming.** The PyPI distribution is `harnais-web-extractor` (this repository
> keeps the name `web-article-extractor`). The Python import package is
> `web_article_extractor`.

## Install

From PyPI:

```bash
pip install harnais-web-extractor
# Playwright also needs a browser binary the first time:
playwright install chromium
# YouTube transcript extraction requires yt-dlp (optional extra):
pip install "harnais-web-extractor[youtube]"
```

Or from source (GitHub):

```bash
pip install git+https://github.com/JohnLinotte/web-article-extractor.git
```

## Usage

### Command line

```bash
# Extract an article as Markdown (default):
python -m web_article_extractor https://example.com/some-article

# As JSON:
python -m web_article_extractor https://example.com/some-article --format json

# A YouTube URL fetches the transcript instead:
python -m web_article_extractor "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
```

> **YouTube transcripts require `yt-dlp`.** Install it via the `youtube` extra
> (`pip install "harnais-web-extractor[youtube]"`) or provide any `yt-dlp`
> binary on your `PATH`. Without it, `fetch_transcript()` cannot run.
>
> **Optional: Whisper fallback** for videos without subtitles — install
> `faster-whisper` via the `whisper` extra:
> `pip install "harnais-web-extractor[whisper]"`.
>
> **yt-dlp tuning** (optional env vars, all empty by default so the package
> works on any machine):
> - `YT_DLP_COOKIES_FROM_BROWSER=firefox` — pass `--cookies-from-browser firefox`
>   to yt-dlp (needed for age-restricted or members-only videos).
> - `YT_DLP_JS_RUNTIME=node` — pass `--js-runtime node`.
> - `YT_DLP_BIN=/path/to/yt-dlp` — override the binary location.

### Python API

```python
from web_article_extractor import extract_article, fetch_transcript, is_youtube_url

result = extract_article("https://example.com/some-article")
if result:
    print(result["title"])
    print(result["content"])      # Markdown
    print(result["word_count"])

if is_youtube_url(url):
    transcript = fetch_transcript(url)
    if transcript:
        print(transcript["text"])
```

`extract_article` returns a dict with `url`, `title`, `content`,
`source_method`, `extracted_at` and `word_count`, or `None` when extraction
fails entirely.

## License

MIT — see [LICENSE](LICENSE).
