Metadata-Version: 2.4
Name: webtomd
Version: 0.1.1
Summary: Web to Markdown. No garbage.
Project-URL: Homepage, https://github.com/MrRaccooon/WebToMD
Project-URL: Repository, https://github.com/MrRaccooon/WebToMD
Project-URL: Issues, https://github.com/MrRaccooon/WebToMD/issues
Author: Prabhat
License: GPL-3.0-or-later
License-File: LICENSE
Keywords: cli,converter,markdown,terminal,web-scraping
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Text Processing :: Markup :: Markdown
Requires-Python: >=3.11
Requires-Dist: beautifulsoup4>=4.14.3
Requires-Dist: httpx>=0.27
Requires-Dist: markdownify>=0.13
Requires-Dist: pynput>=1.7
Requires-Dist: pyperclip>=1.9
Requires-Dist: readability-lxml>=0.8
Requires-Dist: rich>=13
Requires-Dist: trafilatura>=1.12
Requires-Dist: typer>=0.12
Provides-Extra: ai-all
Requires-Dist: anthropic>=0.40; extra == 'ai-all'
Requires-Dist: google-generativeai>=0.8; extra == 'ai-all'
Requires-Dist: groq>=0.9; extra == 'ai-all'
Requires-Dist: openai>=1.0; extra == 'ai-all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.8; extra == 'gemini'
Provides-Extra: groq
Requires-Dist: groq>=0.9; extra == 'groq'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: playwright
Requires-Dist: playwright>=1.44; extra == 'playwright'
Description-Content-Type: text/markdown

# webtomd

> Web to Markdown. No garbage.

A fast, terminal-native CLI that converts any URL into clean, structured Markdown. Supports multi-provider AI post-processing, batch conversion, CSS selectors, and YAML frontmatter — all from one command.

Works on **Windows**, **macOS**, and **Linux**. Python 3.11+.

## Quick Start

```bash
pip install webtomd
webtomd https://example.com/article
```

That's it. The Markdown file is saved in your current directory.

## Install

### pip (all platforms)

```bash
pip install webtomd
```

### uv (recommended — faster)

```bash
uv pip install webtomd
```

### pipx (isolated global install)

```bash
pipx install webtomd
```

### Optional extras

```bash
# AI provider support
pip install "webtomd[openai]"
pip install "webtomd[anthropic]"
pip install "webtomd[gemini]"
pip install "webtomd[groq]"
pip install "webtomd[ai-all]"

# JS-rendered page support (SPAs, React/Vue/Next.js sites)
pip install "webtomd[playwright]"
playwright install chromium
```

### Verify installation

```bash
webtomd --help
```

If `webtomd` isn't found in your PATH, you can always run it as a module:

```bash
python -m webtomd --help
```

## Features

- **Smart extraction** — trafilatura + readability fallback chain with quality scoring
- **JS-rendered pages** — optional Playwright fallback for SPAs
- **AI modes** — summarize, translate, extract, Q&A via Anthropic / OpenAI / Gemini / Groq / Ollama
- **Batch processing** — convert a file of URLs in one command with progress bar
- **CSS selectors** — target specific page sections
- **YAML frontmatter** — title, URL, date metadata
- **Auto-save** — interactive terminals save files; piped runs output to stdout
- **Smart filenames** — deterministic or AI-assisted naming
- **Clipboard** — copy output with `--copy`
- **stdin support** — pipe HTML directly
- **Recursive crawl** — `--depth N` discovers and converts same-domain linked pages
- **Clean output** — strips nav, sidebars, cookie banners, CSS noise, duplicate content
- **Cross-platform** — Windows, macOS, Linux with encoding-safe output

## Usage

### Basic conversion

```bash
# Auto-saves .md file in interactive terminals
webtomd https://example.com/article

# Save to a specific file
webtomd https://example.com/article -o article.md

# Force output to terminal
webtomd https://example.com/article --stdout
```

### Selectors and metadata

```bash
# Extract only content inside a CSS selector
webtomd https://example.com --selector "main"
webtomd https://example.com --selector "article .content"

# Add YAML frontmatter (title, url, date)
webtomd https://example.com --metadata
```

### AI post-processing

```bash
webtomd https://example.com --ai summarize
webtomd https://example.com --ai "tl;dr"
webtomd https://example.com --ai translate
webtomd https://example.com --ai extract
webtomd https://example.com --ai qa
```

### Batch and crawl

```bash
# Batch: convert a list of URLs
webtomd --batch urls.txt

# Crawl: recursively discover and convert same-domain links
webtomd https://example.com --depth 2
```

### Stdin (pipe HTML directly)

**macOS / Linux:**

```bash
curl -s https://example.com | webtomd - --stdout
curl -s https://example.com | webtomd --stdout
```

**Windows (PowerShell):**

```powershell
(Invoke-WebRequest https://example.com).Content | python -m webtomd - --stdout
```

### Other options

```bash
# Copy result to clipboard
webtomd https://example.com --copy

# Open in default editor after saving
webtomd https://example.com --open

# Silent mode (no spinners, no preview — pipe-safe)
webtomd https://example.com --silent -o out.md

# Filename strategy
webtomd https://example.com --name-strategy deterministic
webtomd https://example.com --name-strategy ai
```

## AI Setup

Set your provider's API key as an environment variable.

**macOS / Linux (bash/zsh):**

```bash
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GEMINI_API_KEY=...
export GROQ_API_KEY=gsk_...
export OLLAMA_HOST=http://localhost:11434
```

**Windows (PowerShell):**

```powershell
$env:OPENAI_API_KEY = "sk-..."
$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:GEMINI_API_KEY = "..."
$env:GROQ_API_KEY = "gsk_..."
$env:OLLAMA_HOST = "http://localhost:11434"
```

**Windows (Command Prompt):**

```cmd
set OPENAI_API_KEY=sk-...
set ANTHROPIC_API_KEY=sk-ant-...
```

To persist across sessions, add these to your shell profile (`~/.bashrc`, `~/.zshrc`) or set them via Windows System Environment Variables.

Or use the interactive setup wizard (writes to `~/.webtomdrc`):

```bash
webtomd --configure
```

The first available key is auto-detected in priority order: Anthropic > OpenAI > Gemini > Groq > Ollama.

If no key is configured, `--ai` modes gracefully fall back to plain Markdown output with a friendly message — nothing breaks.

## Configuration

Create `~/.webtomdrc` (TOML format) for persistent defaults:

```toml
output_dir = "~/Documents/webtomd"
copy = false
metadata = false
silent = false
name_strategy = "deterministic"
ai_provider = "openai"
```

CLI flags always override config file values.

**Location:** `~/.webtomdrc` resolves to:
- macOS/Linux: `/home/yourname/.webtomdrc`
- Windows: `C:\Users\YourName\.webtomdrc`

## Batch Mode

Create a text file with one URL per line (`#` comments supported):

```text
# My reading list
https://example.com/article-1
https://example.com/article-2
https://example.com/article-3
```

```bash
webtomd --batch urls.txt
```

Each URL is processed independently with a live progress bar — failures don't abort the batch. A summary is printed at the end.

## Output Defaults

| Context | Behavior |
|---|---|
| Interactive terminal | Auto-saves `.md` file with generated name |
| Piped / non-interactive | Prints Markdown to stdout |
| `-o file.md` | Saves to the specified file |
| `--stdout` | Forces stdout in any context |

## Troubleshooting

**`webtomd` command not found:**
- Ensure your Python `Scripts` (Windows) or `bin` (macOS/Linux) directory is in your PATH
- Alternative: `python -m webtomd`

**Encoding errors on Windows:**
- webtomd handles UTF-8 output automatically, but if your terminal shows garbled characters, run `chcp 65001` first or use Windows Terminal (recommended over cmd.exe)

**Playwright not installing:**
- Run `playwright install chromium` after installing the playwright extra
- On Linux, you may need system deps: `playwright install-deps chromium`

**Clipboard not working:**
- macOS: works out of the box (`pbcopy`)
- Linux: install `xclip` or `xsel` (`sudo apt install xclip`)
- Windows: works out of the box

**Slow conversion on certain sites:**
- Some sites throttle or block automated requests — this is network-bound, not a tool issue
- Try `--selector "main"` to skip heavy page processing

## Contributing

```bash
git clone https://github.com/MrRaccooon/WebToMD.git
cd WebToMD
```

**Setup (all platforms):**

```bash
pip install uv         # if you don't have uv
uv sync --extra dev
uv run pytest
```

**Run lints:**

```bash
uv run ruff check .
```

**Run type checks:**

```bash
uv run mypy webtomd/
```

## License

GPL-3.0-or-later
