Metadata-Version: 2.4
Name: sitesavvy
Version: 0.5.0
Summary: Capture the web, your way. A modern, async, cross-platform web scraper.
Project-URL: Homepage, https://github.com/Bloody-Crow/SiteSavvy
Project-URL: Documentation, https://github.com/Bloody-Crow/SiteSavvy#readme
Project-URL: Repository, https://github.com/Bloody-Crow/SiteSavvy
Project-URL: Issues, https://github.com/Bloody-Crow/SiteSavvy/issues
Project-URL: Changelog, https://github.com/Bloody-Crow/SiteSavvy/blob/main/CHANGELOG.md
Author: SiteSavvy Contributors
License-Expression: MIT
License-File: LICENSE
Keywords: async,crawler,epub,markdown,offline-reader,pdf,web-scraper
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: aiohttp>=3.9
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: ebooklib>=0.18
Requires-Dist: html2text>=2024.2.26
Requires-Dist: httpx>=0.27
Requires-Dist: lxml>=5.0
Requires-Dist: markdownify>=0.13
Requires-Dist: numpy>=1.26
Requires-Dist: playwright>=1.40
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.7
Requires-Dist: tomlkit>=0.12
Requires-Dist: typer>=0.12
Requires-Dist: weasyprint>=60.0
Provides-Extra: ai
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest-httpserver>=1.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocs>=1.6; extra == 'docs'
Provides-Extra: mcp
Requires-Dist: mcp>=1.0; extra == 'mcp'
Provides-Extra: reppy
Requires-Dist: reppy>=2.0; extra == 'reppy'
Provides-Extra: textual
Requires-Dist: textual>=0.50; extra == 'textual'
Description-Content-Type: text/markdown

# SiteSavvy

[![PyPI version](https://img.shields.io/pypi/v/sitesavvy.svg?logo=pypi&logoColor=white)](https://pypi.org/project/sitesavvy/)
[![PyPI downloads](https://img.shields.io/pypi/dm/sitesavvy.svg?logo=pypi&logoColor=white)](https://pypi.org/project/sitesavvy/)
[![Python versions](https://img.shields.io/pypi/pyversions/sitesavvy.svg?logo=python&logoColor=white)](https://pypi.org/project/sitesavvy/)
[![CI](https://github.com/Bloody-Crow/SiteSavvy/actions/workflows/ci.yml/badge.svg)](https://github.com/Bloody-Crow/SiteSavvy/actions/workflows/ci.yml)
[![Coverage](https://img.shields.io/badge/coverage-90%25-brightgreen.svg)](https://github.com/Bloody-Crow/SiteSavvy)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Release](https://img.shields.io/github/v/release/Bloody-Crow/SiteSavvy?logo=github)](https://github.com/Bloody-Crow/SiteSavvy/releases)

> **Capture the web, your way.**

SiteSavvy is a modern, async, cross-platform web scraper that mirrors entire
sites or extracts their readable text — and now also an **AI-powered research
tool**. Beyond the original HTML / Markdown / text / PDF / EPUB / ZIP exports,
v0.5.0 adds **LLM content extraction**, **per-page summaries and
auto-categorization**, **RAG question-answering** (`sitesavvy ask "..."`), an
**MCP server** that exposes crawl / search / ask to Claude, Cursor and VS Code
Copilot, and **nine output formats** including SQLite, WARC and Obsidian
vaults. Point it at a docs site, a blog, a shop or a whole wiki, and SiteSavvy
will quietly fetch, parse, summarize, index and answer questions about it —
politely, resumably, and on your laptop.

```bash
pip install sitesavvy
```

A basic crawl:

```bash
sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out
```

An AI-powered research crawl that extracts clean text, summarizes every page
and builds a RAG index you can ask questions of:

```bash
sitesavvy crawl https://example.com --mode text --format md sqlite \
    --summarize --categorize --index --out-dir ./research
```

Then ask a question in natural language:

```bash
sitesavvy ask "what does this site say about pricing?"
```

---

## What's new in v0.5.0

> **Note:** v0.5.0 is a major release that turns SiteSavvy from a scraper into
> an AI-powered research assistant. The headline additions:

- 🤖 **AI:** LLM content extraction, per-page summaries, auto-categorization
- 💬 **RAG:** `sitesavvy ask "..."` — ask questions about crawled sites
- 🔌 **MCP server:** expose crawl / search / ask to Claude, Cursor, VS Code Copilot
- 📋 **9 output formats:** `html`, `md`, `txt`, `pdf`, `epub`, `zip`, **`sqlite`**, **`warc`**, **`obsidian`**
- 🎯 **URL patterns** (`--include` / `--exclude`), CSS scope, sitemap seeding
- 📊 **Budgets** (`--max-pages` / `--max-bytes` / `--max-time`), HTML reports
- 🔄 **Content diff** between crawls, Wayback Machine archiving
- ⚙️ **Config files** + 6 presets (`docs` / `blog` / `wiki` / `shop` / `archive` / `research`)

See the [CHANGELOG](CHANGELOG.md) for the full list, and the
[Architecture](#architecture) section below for the new module layout.

---

## Features

### Crawl modes

- **`full`** — recursively download every reachable resource (HTML, CSS, JS,
  images, PDFs, fonts, …) preserving the original directory hierarchy.
- **`text`** — extract the readable text from each HTML page (strips scripts,
  navigation, ads) and store it in your chosen format.

### Output formats

Nine formats, repeatable via `--format`:

| Format | Mode `full` | Mode `text` | Backend |
| --- | --- | --- | --- |
| `html` | original bytes, hierarchy preserved | — | built-in |
| `md` | — | `markdownify` (ATX headings, links absolute) | `markdownify` |
| `txt` | — | `html2text` (no hard wrap) | `html2text` |
| `pdf` | — | WeasyPrint | `weasyprint` |
| `epub` | — | `ebooklib`, one chapter per page | `ebooklib` |
| `zip` | archive of the whole crawl | archive of the whole crawl | `zipfile` |
| `sqlite` | one row per page (URL, title, text, meta) | one row per page + embeddings | `sqlite3` |
| `warc` | ISO 28500:2017 archive (replayweb.page-compatible) | — | built-in |
| `obsidian` | — | Markdown vault with `[[wikilinks]]` + frontmatter | built-in |

### AI & intelligence

- **LLM content extraction** (`--ai-extract`) — let an LLM pull the main
  article out of cluttered pages in `text` mode.
- **Per-page summaries + site digest** (`--summarize`) — every page gets a
  one-paragraph summary, plus a top-level `digest.md` overview.
- **Auto-categorization** (`--categorize`) — each page is tagged with an
  AI-derived category (e.g. *tutorial*, *pricing*, *API reference*).
- **RAG question-answering** (`sitesavvy ask "..."`) — semantic search over a
  SQLite vector store (cosine similarity) plus an LLM that synthesizes an
  answer from the top-k retrieved chunks, with source URLs.

### MCP server

`sitesavvy mcp` starts a Model Context Protocol server (stdio transport) that
exposes SiteSavvy to AI assistants. Six tools are available:

| Tool | What it does |
| --- | --- |
| `crawl` | Run a crawl with the same options as the CLI. |
| `list_pages` | List pages in a finished crawl's manifest. |
| `search` | Full-text search over a crawled mirror. |
| `get_page` | Fetch the body of a single page by URL. |
| `ask` | RAG question-answering over a crawled mirror. |
| `info` | Report installed backends and AI configuration. |

Configure it once in your client (see [MCP server](#mcp-server) below) and
your assistant can crawl, read and reason about the web without leaving the
chat.

### Scraping power

- **Sitemap seeding** (`--sitemap`) — discover and parse `sitemap.xml`,
  including sitemap indexes.
- **RSS / Atom feed discovery** (`--feeds`) — seed URLs from feed entries.
- **URL pattern filtering** — `--include` / `--exclude` accept globs
  (`*` / `**`) or `re:<regex>` patterns; the start URL is always allowed.
- **CSS scope** (`--scope "main"`) — restricts both link discovery and content
  extraction to a subtree.
- **Budgets** — `--max-pages`, `--max-bytes`, `--max-time` stop crawls cleanly
  and leave a resumable manifest behind.
- **Proxy support** — `--proxy http://host:port` or `socks5://host:port`.
- **Screenshots** (`--screenshots`) — capture full-page PNGs in headless mode.
- **Headless rendering** (`--headless`) via Playwright (falls back to `aiohttp`
  automatically when no browser binary is installed).

### Politeness

- **Robots.txt compliance by default**, with `--force` override.
- **Per-host delay** (`--delay`) and **auto-throttle** on `429` / `5xx`
  (`--rate-limit auto`, the default) with exponential back-off plus jitter.
- **Resume** (`--resume`) — skip URLs already completed in the manifest.
- **Incremental** (`--incremental`) — re-download only changed resources via
  conditional GETs (`ETag` / `Last-Modified`, `304 Not Modified`).
- **External-link gating** — stays on the start host unless you pass
  `--external`.

### Config & UX

- **Config files** — `sitesavvy.toml` with `[default]` and `[profiles.<name>]`
  sections; `--config` and `--profile` flags load them.
- **6 built-in presets** — `--preset docs|blog|wiki|shop|archive|research`
  (see [Presets](#presets)).
- **Interactive wizard** — `sitesavvy new` walks you through a few prompts and
  prints a ready-to-run `crawl` command (or writes a `sitesavvy.toml`).
- **`sitesavvy init-config`** — writes an example `sitesavvy.toml`.
- **`sitesavvy list-presets`** — lists available presets.
- **HTML report** (`--report`) — writes a self-contained `crawl-report.html`
  summarizing URLs fetched, failures, formats produced and AI summaries.
- **Content diff** (`sitesavvy diff <old> <new> <old-dir> <new-dir>`) —
  compares two crawls and reports added / removed / changed pages as Markdown.
- **Wayback Machine** (`--archive`) — submits every fetched page to
  `web.archive.org` (fire-and-forget).
- **Rich CLI** with progress tables, coloured output and `--verbose` / `-v`
  debug logging.

---

## Installation

### Option 1 — From PyPI (recommended)

```bash
pip install sitesavvy
```

For the MCP server, install the optional extra:

```bash
pip install 'sitesavvy[mcp]'
```

Verify the install:

```bash
sitesavvy --version
sitesavvy info   # show which optional backends are installed + AI config
```

### Option 2 — Stand-alone binary (no Python required)

Download the right archive for your OS from the
[latest release](https://github.com/Bloody-Crow/SiteSavvy/releases/latest),
extract it, and run:

| OS | Asset | How to run |
| --- | --- | --- |
| **Linux** (x86_64) | `sitesavvy-0.5.0-linux-x86_64.tar.gz` | `tar -xzf sitesavvy-*.tar.gz && ./sitesavvy --help` |
| **macOS** (x86_64) | `sitesavvy-0.5.0-macos-x86_64.tar.gz` | `tar -xzf sitesavvy-*.tar.gz && ./sitesavvy --help` |
| **Windows** (x86_64) | `sitesavvy-0.5.0-windows-x86_64.exe` | Double-click, or run `sitesavvy.exe --help` in PowerShell |

These are single-file [PyInstaller](https://pyinstaller.org/) executables — no
Python installation needed.

### Option 3 — From source (development)

```bash
git clone https://github.com/Bloody-Crow/SiteSavvy.git
cd SiteSavvy
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
playwright install chromium   # optional, only for --headless / --screenshots
```

A plain `pip install -r requirements.txt` is also supported if you prefer to
skip the PEP 517 build.

---

## AI configuration

AI features (extraction, summaries, categorization, RAG `ask`) talk to any
**OpenAI-compatible** endpoint: OpenAI itself, Ollama, vLLM, LM Studio, Groq,
Together, OpenRouter, and any other server that implements the
`/chat/completions` and `/embeddings` routes.

Configure via environment variables:

| Variable | Default | Purpose |
| --- | --- | --- |
| `SITESAVVY_LLM_BASE_URL` | `https://api.openai.com/v1` | OpenAI-compatible API base URL. |
| `SITESAVVY_LLM_API_KEY` | _(empty)_ | API key. **Required** for AI features. |
| `SITESAVVY_LLM_MODEL` | `gpt-4o-mini` | Chat model for extraction / summaries / `ask`. |
| `SITESAVVY_EMBED_MODEL` | `text-embedding-3-small` | Embedding model for the RAG index. |

Example — OpenAI:

```bash
export SITESAVVY_LLM_API_KEY="sk-..."
sitesavvy crawl https://example.com --mode text --format md --summarize --index
```

Example — local Ollama (no API key needed):

```bash
export SITESAVVY_LLM_BASE_URL="http://localhost:11434/v1"
export SITESAVVY_LLM_API_KEY="ollama"   # any non-empty string
export SITESAVVY_LLM_MODEL="llama3.1"
export SITESAVVY_EMBED_MODEL="nomic-embed-text"
sitesavvy crawl https://example.com --mode text --summarize --index
```

> **Note:** AI features are strictly opt-in. SiteSavvy never makes network
> calls to an LLM provider unless you explicitly pass `--ai-extract`,
> `--summarize`, `--categorize` or `--index`, or invoke `sitesavvy ask`. Without
> an API key the AI flags are silently skipped and the crawl completes
> normally — see [Troubleshooting](#troubleshooting).

Run `sitesavvy info` at any time to see the configured base URL, model, and
whether an API key is set.

---

## MCP server

`sitesavvy mcp` runs SiteSavvy as a Model Context Protocol server over stdio,
exposing the six tools listed in [Features](#mcp-server) to any MCP-compatible
client.

### Claude Desktop

Add an entry to your Claude Desktop config
(macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`,
Windows: `%APPDATA%\Claude\claude_desktop_config.json`):

```json
{
  "mcpServers": {
    "sitesavvy": {
      "command": "sitesavvy",
      "args": ["mcp"]
    }
  }
}
```

Restart Claude Desktop. You can now ask things like *"crawl
https://docs.example.com and tell me how their auth flow works"* and Claude
will call the `crawl` and `ask` tools for you.

### Cursor, VS Code Copilot, and other MCP clients

The same `command: sitesavvy, args: ["mcp"]` snippet works in any client that
speaks MCP. See your client's docs for where to register MCP servers.

> **Note:** The MCP server requires the optional `mcp` package. Install it with
> `pip install 'sitesavvy[mcp]'`. If it's missing, `sitesavvy mcp` prints a
> helpful error pointing at the install command.

---

## Quick start

### Full-site mirror → ZIP

```bash
sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out
```

### Text-only crawl → Markdown + EPUB

```bash
sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader
```

### AI research crawl → Markdown + SQLite + summaries + RAG index

```bash
sitesavvy crawl https://example.com \
    --mode text --format md sqlite \
    --summarize --categorize --index \
    --out-dir ./research
```

### Ask a question about a crawled site

```bash
sitesavvy ask "what does this site say about pricing?"
```

`ask` uses the RAG index built by `--index` (default location
`./sitesavvy.index.db`). Pass `--index /path/to/index.db` and `--top-k 10` to
customise retrieval.

### Preset crawl (one-shot sensible defaults)

```bash
sitesavvy crawl https://docusaurus.io --preset docs
sitesavvy crawl https://shop.example.com --preset shop
sitesavvy crawl https://wiki.example.com --preset wiki
```

See [Presets](#presets) for what each one does.

### Dry-run, resume, incremental, headless

```bash
# List URLs that would be fetched, without writing anything
sitesavvy crawl https://example.com --dry-run --depth 1

# Resume an interrupted crawl
sitesavvy crawl https://example.com --depth 3 --resume \
    --manifest ./out/manifest.json --out-dir ./out

# Only re-download changed resources
sitesavvy crawl https://example.com --incremental \
    --manifest ./out/manifest.json --out-dir ./out

# Render JavaScript-heavy pages with Playwright
sitesavvy crawl https://spa.example.com --headless --format html
```

---

## Command reference

### Global options

| Flag | Description |
| --- | --- |
| `--version` | Print the SiteSavvy version and exit. |
| `--verbose` / `-v` | Enable debug logging. |

### `sitesavvy crawl` — the main crawler

| Flag | Default | Description |
| --- | --- | --- |
| `url` *(positional)* | — | Starting URL to crawl (`http` / `https`). |
| `--depth INT` | `0` | Max link depth (`0` = unlimited). |
| `--mode {full,text}` | `full` | Full-site download or text-only extraction. |
| `--format …` | `html` | Output format, repeatable: `html md txt pdf epub zip sqlite warc obsidian`. |
| `--out-dir PATH` | CWD | Destination folder. |
| `--concurrency N` | `4` | Simultaneous HTTP requests. |
| `--user-agent STR` | browser-like | Custom `User-Agent` header. |
| `--respect-robots` / `--no-respect-robots` | on | Obey `robots.txt`. |
| `--delay SECS` | `0.5` | Polite delay between same-host requests. |
| `--resume` | off | Skip URLs already completed in the manifest. |
| `--manifest FILE` | `<out-dir>/manifest.json` | Manifest path. |
| `--dry-run` | off | List URLs that would be fetched. |
| `--headless` | off | Render JS pages with Playwright. |
| `--rate-limit {auto,fixed}` | `auto` | Back off on 429/5xx, or use fixed delay. |
| `--download-types …` | all | Comma-separated: `html,css,js,img,pdf,other`. |
| `--incremental` | off | Re-download only changed resources (conditional GET). |
| `--external` | off | Follow cross-domain links. |
| `--force` | off | Proceed even if `robots.txt` disallows the start URL. |
| `--timeout SECS` | `30` | Per-request timeout. |
| `--include PATTERN` | — | URL pattern to include (repeatable; glob with `*` / `**` or `re:<regex>`). |
| `--exclude PATTERN` | — | URL pattern to exclude (repeatable; glob or `re:<regex>`). |
| `--scope SELECTOR` | — | CSS selector restricting link discovery + content extraction. |
| `--max-pages N` | — | Stop after this many pages (budget). |
| `--max-bytes N` | — | Stop after downloading this many bytes (budget). |
| `--max-time SECS` | — | Stop after this many seconds (budget). |
| `--proxy URL` | — | Proxy URL (`http://`, `https://`, `socks5://`). |
| `--screenshots` | off | Capture full-page PNGs (headless). |
| `--archive` | off | Submit every page to the Wayback Machine. |
| `--ai-extract` | off | Use an LLM to extract main content (text mode). |
| `--summarize` | off | Generate per-page summaries + a site digest. |
| `--categorize` | off | Tag each page with an AI category. |
| `--structured` | off | Emit JSON-LD / Open Graph / table sidecars. |
| `--sitemap` | off | Seed URLs from `sitemap.xml` (incl. sitemap indexes). |
| `--index` | off | Build a RAG index for `sitesavvy ask`. |
| `--report` | off | Write a self-contained `crawl-report.html`. |
| `--config FILE` | — | Path to a `sitesavvy.toml` config file. |
| `--profile NAME` | — | Named profile from the config file. |
| `--preset NAME` | — | Built-in preset: `docs` / `blog` / `wiki` / `shop` / `archive` / `research`. |

### `sitesavvy ask` — RAG question-answering

```bash
sitesavvy ask "what does this site say about pricing?" [--index PATH] [--top-k N]
```

| Flag | Default | Description |
| --- | --- | --- |
| `question` *(positional)* | — | Question to answer from the crawled mirror. |
| `--index PATH` | `./sitesavvy.index.db` | Path to the RAG index built by `crawl --index`. |
| `--top-k N` | `5` | Number of pages to retrieve and feed to the LLM. |

### `sitesavvy mcp` — MCP server

```bash
sitesavvy mcp
```

Starts the Model Context Protocol server over stdio. No flags. Requires the
optional `mcp` package (`pip install 'sitesavvy[mcp]'`). See
[MCP server](#mcp-server) above.

### `sitesavvy new` — interactive wizard

```bash
sitesavvy new [--config PATH]
```

| Flag | Default | Description |
| --- | --- | --- |
| `--config PATH` | — | Write a `sitesavvy.toml` config file instead of printing a command. |

Walks you through a few prompts (URL, mode, formats, scope, budgets, AI flags)
and either prints a ready-to-run `crawl` command or writes a `sitesavvy.toml`.

### `sitesavvy diff` — compare two crawls

```bash
sitesavvy diff <old-manifest.json> <new-manifest.json> <old-dir> <new-dir> [--output PATH]
```

| Flag | Default | Description |
| --- | --- | --- |
| `old` *(positional)* | — | Path to the old crawl's `manifest.json`. |
| `new` *(positional)* | — | Path to the new crawl's `manifest.json`. |
| `old_dir` *(positional)* | — | Old crawl's output directory. |
| `new_dir` *(positional)* | — | New crawl's output directory. |
| `--output PATH` | — | Write the Markdown diff report to this path (default: stdout). |

Reports added, removed and changed pages between two crawls, including a
diff of the page bodies for changed pages.

### `sitesavvy list-presets` — list built-in presets

```bash
sitesavvy list-presets
```

No flags. Prints a table of the six built-in presets and their use cases.

### `sitesavvy init-config` — write an example config

```bash
sitesavvy init-config [--output PATH]
```

| Flag | Default | Description |
| --- | --- | --- |
| `--output PATH` | `sitesavvy.toml` | Where to write the example config. |

### `sitesavvy legal` and `sitesavvy info`

```bash
sitesavvy legal     # print the legal / ethical disclaimer
sitesavvy info      # show installed backends + AI configuration status
```

---

## Presets

Six built-in presets cover the common crawl shapes. Pass `--preset <name>` and
SiteSavvy fills in the flags for you. A preset can be combined with `--profile`
from your `sitesavvy.toml` (the profile overrides the preset).

| Preset | Mode | Formats | Highlights | Use case |
| --- | --- | --- | --- | --- |
| `docs` | `text` | `md`, `pdf` | `scope=article`, `depth=3`, `delay=0.2` | Documentation sites (Docusaurus, MkDocs, Sphinx). |
| `blog` | `text` | `md` | `include=/blog/*`, excludes pagination / tags / categories | A blog archive for offline reading. |
| `wiki` | `text` | `md`, `epub` | `depth=0` (unlimited), `delay=0.3` | Mirror a wiki into an EPUB for an e-reader. |
| `shop` | `full` | `html`, `zip` | `include=/product/*`, `/products/*`, excludes cart / checkout / account | Archive an e-commerce catalogue. |
| `archive` | `full` | `html`, `warc`, `zip` | `depth=0`, `respect_robots=true` | Long-term archival — WARC for replay, ZIP for sharing. |
| `research` | `text` | `md`, `sqlite` | `summarize=true`, `depth=0`, `delay=0` | Research crawl with AI summaries + a queryable SQLite store. |

Example: combine a preset with the `--index` flag to also build a RAG index
for `ask`:

```bash
sitesavvy crawl https://docs.example.com --preset docs --index
sitesavvy ask "how do I configure authentication?"
```

---

## Output formats

The format matrix in [Features](#output-formats) lists all nine formats and
their backends. A few notes:

- **`html`** preserves the original byte stream and the site's directory
  hierarchy, so a `full` crawl can be served verbatim from disk.
- **`zip`** packages the entire crawl (any combination of other formats) into
  a single archive for easy sharing.
- **`sqlite`** stores one row per page with URL, title, extracted text,
  metadata and (when `--index` is set) embedding vectors — perfect for
  downstream analysis or for `sitesavvy ask`.
- **`warc`** writes an ISO 28500:2017 archive that opens in
  [replayweb.page](https://replayweb.page/) and any Web Archive player.
- **`obsidian`** exports a Markdown vault with YAML frontmatter and
  `[[wikilinks]]` between pages, ready to drop into an Obsidian vault.

Sample Markdown output:

```markdown
---
url: https://example.com/page
title: Page Title
category: tutorial
summary: A short paragraph summarizing the page.
---

# Page Title

## A heading

Some paragraph text with a [link](https://example.com/other).
```

---

## Architecture

```
sitesavvy/
├── __init__.py            # package metadata
├── __main__.py            # python -m sitesavvy
├── __about__.py           # version
├── config.py              # CrawlConfig + enums (CrawlMode, OutputFormat, ...)
├── models.py              # CrawlItem, FetchResult, ManifestEntry
├── url_utils.py           # normalisation, link extraction, path mapping
├── robots.py              # async robots.txt (reppy or stdlib fallback)
├── conversions.py         # HTML → MD/TXT/PDF/EPUB + ZIP
├── manifest.py            # resume / incremental state
├── headless.py            # Playwright fetcher
├── crawler.py             # the Crawler engine (orchestrates everything)
├── legal.py               # disclaimer text
├── cli.py                 # Typer + Rich CLI (crawl, ask, mcp, new, diff, ...)
├── main.py                # console-script entry point
├── ai.py                  # LLM client + LLMConfig (OpenAI-compatible)
├── rag.py                 # SQLite vector store + cosine similarity search
├── mcp_server.py          # MCP server exposing 6 tools over stdio
├── feeds.py               # RSS / Atom feed discovery + seeding
├── patterns.py            # glob + regex URL pattern matching
├── structured.py          # JSON-LD / Open Graph / table sidecar extraction
├── warc.py                # ISO 28500:2017 WARC writer
├── sqlite_export.py       # SQLite exporter (rows + embeddings)
├── report.py              # self-contained HTML crawl report
├── config_file.py         # sitesavvy.toml parsing + 6 built-in presets
├── budgets.py             # page / byte / time budget enforcement
├── wayback.py             # Wayback Machine submission
├── scope.py               # CSS selector scoping for discovery + extraction
├── screenshots.py         # full-page PNG capture (headless)
├── diff.py                # cross-crawl added/removed/changed diff
├── obsidian.py            # Obsidian vault exporter (wikilinks + frontmatter)
└── wizard.py              # interactive `sitesavvy new` wizard
```

Networking layer: `aiohttp` (primary) with an optional Playwright headless
browser for JS-rendered pages, and `httpx` as the LLM / embeddings client.
HTML parsing uses `beautifulsoup4` + `lxml`. `robots.txt` is parsed with
`reppy` when available, otherwise with the stdlib `urllib.robotparser`.

See the [Architecture docs](docs/architecture.md) for a deeper dive, including
a Mermaid flow diagram of a crawl.

---

## Troubleshooting

- **`HTTP 429 Too Many Requests`** — lower `--concurrency`, raise `--delay`,
  and keep `--rate-limit auto` (default) so SiteSavvy backs off automatically.
- **Large sites** — set `--depth` to bound the crawl, run with `--dry-run`
  first to estimate scope, and use `--resume` so an interruption doesn't waste
  work. `--max-pages`, `--max-bytes` and `--max-time` add hard budgets that
  stop the crawl cleanly and leave a resumable manifest behind.
- **PDF export fails** — WeasyPrint needs Pango/Cairo system libraries. On
  Debian/Ubuntu: `apt install libpango-1.0-0 libpangoft2-1.0-0`. On macOS:
  `brew install pango`. The other formats keep working even if PDF is missing.
- **Headless mode crashes** — run `playwright install chromium` once after
  installing the package. Without it, SiteSavvy transparently falls back to
  `aiohttp`.
- **`robots.txt disallows …`** — by default SiteSavvy honours `robots.txt`.
  Add `--force` only if you have permission and accept responsibility.
- **AI features silently skipped / `No LLM API key set`** — export
  `SITESAVVY_LLM_API_KEY` (or set it in your shell profile). Run
  `sitesavvy info` to confirm the key is detected. AI flags are no-ops without
  a key; the crawl itself still completes normally.
- **`MCP server failed to start`** — the optional `mcp` package is missing.
  Install it with `pip install 'sitesavvy[mcp]'` and try again.
- **Proxy issues** — `--proxy` accepts `http://`, `https://` and `socks5://`
  URLs. For SOCKS proxies, make sure the `aiohttp-socks` extra is installed
  (`pip install aiohttp-socks`). If requests time out, check that the proxy
  allows CONNECT to port 443.
- **Windows: long path errors** — enable long paths
  (`New-ItemProperty -Path "HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1`)
  or use a shorter `--out-dir`.
- **`command not found: sitesavvy`** — make sure `pip install`'s `bin` /
  `Scripts` directory is on your `PATH`, or use `python -m sitesavvy`.

---

## Legal & ethics

SiteSavvy is provided for **personal, non-commercial use only**. Respect the
copyright, terms of service, and `robots.txt` of every site you crawl. AI
features send extracted page content to your configured LLM provider — make
sure you have the right to do so for the sites you crawl. The authors assume
no liability for misuse. Run `sitesavvy legal` to read the full disclaimer.
Licensed under the [MIT License](LICENSE).

---

## Contributing

Pull requests are welcome! Please run the full check suite before submitting:

```bash
ruff check .
mypy sitesavvy
pytest --cov=sitesavvy --cov-report=term-missing
```

Coverage must stay at or above **90 %**. See the [Developer Guide](docs/developer.md)
for the project layout, release process and binary-building instructions.

---

## Links

- **PyPI:** <https://pypi.org/project/sitesavvy/>
- **GitHub:** <https://github.com/Bloody-Crow/SiteSavvy>
- **Releases:** <https://github.com/Bloody-Crow/SiteSavvy/releases>
- **Issues:** <https://github.com/Bloody-Crow/SiteSavvy/issues>
- **Changelog:** [CHANGELOG.md](CHANGELOG.md)
- **License:** [MIT](LICENSE)
