Metadata-Version: 2.4
Name: sitesavvy
Version: 0.1.0
Summary: Capture the web, your way. A modern, async, cross-platform web scraper.
Project-URL: Homepage, https://github.com/your-org/sitesavvy
Project-URL: Documentation, https://your-org.github.io/sitesavvy/
Project-URL: Repository, https://github.com/your-org/sitesavvy
Project-URL: Issues, https://github.com/your-org/sitesavvy/issues
Project-URL: Changelog, https://github.com/your-org/sitesavvy/blob/main/CHANGELOG.md
Author: SiteSavvy Contributors
License-Expression: MIT
License-File: LICENSE
Keywords: async,crawler,epub,markdown,offline-reader,pdf,web-scraper
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: aiohttp>=3.9
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: ebooklib>=0.18
Requires-Dist: html2text>=2024.2.26
Requires-Dist: lxml>=5.0
Requires-Dist: markdownify>=0.13
Requires-Dist: playwright>=1.40
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.7
Requires-Dist: tomlkit>=0.12
Requires-Dist: typer>=0.12
Requires-Dist: weasyprint>=60.0
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest-httpserver>=1.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocs>=1.6; extra == 'docs'
Provides-Extra: reppy
Requires-Dist: reppy>=2.0; extra == 'reppy'
Description-Content-Type: text/markdown

# SiteSavvy

> **Capture the web, your way.**

A modern, async, cross-platform web scraper that mirrors entire sites or
extracts their readable text — and exports the result as **HTML**, **Markdown**,
**plain text**, **PDF**, **EPUB** or a single **ZIP** archive.

Built with [`aiohttp`](https://docs.aiohttp.org/), `BeautifulSoup` + `lxml`,
`Typer` + `Rich`, with optional Playwright headless rendering for
JavaScript-heavy pages.

---

## Features

- **Two crawl modes**
  - `full` — recursively download every reachable resource (HTML, CSS, JS,
    images, PDFs, fonts, …) preserving the original directory hierarchy.
  - `text` — extract the readable text from each HTML page (strips scripts,
    navigation, ads) and store it in your chosen format.
- **Six output formats** (repeatable `--format`): `html`, `md`, `txt`, `pdf`,
  `epub`, `zip`.
- **Polite by default**: respects `robots.txt`, enforces a per-host delay, and
  auto-throttles on `429` / `5xx` responses.
- **Resume & incremental**: a JSON manifest records every fetched URL, its
  local path and `ETag` / `Last-Modified`; `--resume` skips completed work and
  `--incremental` re-downloads only what changed.
- **Concurrency control** with a global semaphore and per-host locks.
- **Dry-run** mode that lists the URLs that *would* be fetched.
- **Headless rendering** via Playwright (falls back to `aiohttp` automatically).
- **Fine-grained `--download-types`** filtering: `html,css,js,img,pdf,other`.
- **External-link gating** — stays on the start host unless you pass
  `--external`.
- **Rich CLI** with progress tables and coloured output.
- **Cross-platform** — runs on Linux, macOS and Windows; ships a CI matrix for
  all three.

---

## Installation

### From PyPI (once published)

```bash
pip install sitesavvy
```

### From source (development)

```bash
git clone https://github.com/your-org/sitesavvy.git
cd sitesavvy
python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
playwright install chromium   # optional, only for --headless
```

A plain `pip install -r requirements.txt` is also supported if you prefer to
skip the PEP 517 build.

---

## Quick start

### Full-site mirror → ZIP

```bash
sitesavvy crawl https://example.com --depth 2 --format html zip --out-dir ./out
```

### Text-only crawl → Markdown + EPUB

```bash
sitesavvy crawl https://example.com --mode text --format md epub --out-dir ./reader
```

### Dry-run (list URLs only)

```bash
sitesavvy crawl https://example.com --dry-run --depth 1
```

### Resume an interrupted crawl

```bash
sitesavvy crawl https://example.com --depth 3 --resume --manifest ./out/manifest.json --out-dir ./out
```

### Only re-download changed resources

```bash
sitesavvy crawl https://example.com --incremental --manifest ./out/manifest.json --out-dir ./out
```

### Render JavaScript pages

```bash
sitesavvy crawl https://spa.example.com --headless --format html
```

---

## Command reference

| Flag | Default | Description |
| --- | --- | --- |
| `url` *(positional)* | — | Starting URL. |
| `--depth INT` | `0` | Max link depth (`0` = unlimited). |
| `--mode {full,text}` | `full` | Full-site download or text-only extraction. |
| `--format …` | `html` | Output format, repeatable: `html md txt pdf epub zip`. |
| `--out-dir PATH` | CWD | Destination folder. |
| `--concurrency N` | `4` | Simultaneous HTTP requests. |
| `--user-agent STR` | browser-like | Custom `User-Agent` header. |
| `--respect-robots` / `--no-respect-robots` | on | Obey `robots.txt`. |
| `--delay SECS` | `0.5` | Polite delay between same-host requests. |
| `--resume` | off | Skip URLs already completed in the manifest. |
| `--manifest FILE` | `<out-dir>/manifest.json` | Manifest path. |
| `--dry-run` | off | List URLs that would be fetched. |
| `--headless` | off | Render JS pages with Playwright. |
| `--rate-limit {auto,fixed}` | `auto` | Back off on 429/5xx, or use fixed delay. |
| `--download-types …` | all | Comma-separated: `html,css,js,img,pdf,other`. |
| `--incremental` | off | Re-download only changed resources (conditional GET). |
| `--external` | off | Follow cross-domain links. |
| `--force` | off | Proceed even if `robots.txt` disallows the start URL. |
| `--timeout SECS` | `30` | Per-request timeout. |
| `--verbose` / `-v` | off | Enable debug logging. |

Auxiliary commands:

```bash
sitesavvy legal     # print the legal / ethical disclaimer
sitesavvy info      # show which optional backends are installed
sitesavvy --version
```

---

## Export-format matrix

| Format | Mode `full` | Mode `text` | Backend |
| --- | --- | --- | --- |
| `html` | original bytes, hierarchy preserved | — | built-in |
| `md` | — | `markdownify` (ATX headings, links absolute) | `markdownify` |
| `txt` | — | `html2text` (no hard wrap) | `html2text` |
| `pdf` | — | WeasyPrint | `weasyprint` |
| `epub` | — | `ebooklib`, one chapter per page | `ebooklib` |
| `zip` | archive of the whole crawl | archive of the whole crawl | `zipfile` |

Sample Markdown output:

```markdown
# Page Title

## A heading

Some paragraph text with a [link](https://example.com/page).
```

---

## Architecture

```
sitesavvy/
├── __init__.py          # package metadata
├── __main__.py          # python -m sitesavvy
├── __about__.py         # version
├── config.py            # CrawlConfig + enums
├── models.py            # CrawlItem, FetchResult, ManifestEntry
├── url_utils.py         # normalisation, link extraction, path mapping
├── robots.py            # async robots.txt (reppy or stdlib fallback)
├── conversions.py       # HTML → MD/TXT/PDF/EPUB + ZIP
├── manifest.py          # resume / incremental state
├── headless.py          # Playwright fetcher
├── crawler.py           # the Crawler engine
├── legal.py             # disclaimer text
├── cli.py               # Typer + Rich CLI
└── main.py              # console-script entry point
```

Networking layer: `aiohttp` (primary) with an optional Playwright headless
browser for JS-rendered pages. HTML parsing uses `beautifulsoup4` + `lxml`.
`robots.txt` is parsed with `reppy` when available, otherwise with the stdlib
`urllib.robotparser`.

---

## Troubleshooting

- **`HTTP 429 Too Many Requests`** — lower `--concurrency`, raise `--delay`,
  and keep `--rate-limit auto` (default) so SiteSavvy backs off automatically.
- **Large sites** — set `--depth` to bound the crawl, run with `--dry-run`
  first to estimate scope, and use `--resume` so an interruption doesn't waste
  work.
- **PDF export fails** — WeasyPrint needs Pango/Cairo system libraries. On
  Debian/Ubuntu: `apt install libpango-1.0-0 libpangoft2-1.0-0`. On macOS:
  `brew install pango`. The other formats keep working even if PDF is missing.
- **Headless mode crashes** — run `playwright install chromium` once after
  installing the package. Without it, SiteSavvy transparently falls back to
  `aiohttp`.
- **`robots.txt disallows …`** — by default SiteSavvy honours `robots.txt`.
  Add `--force` only if you have permission and accept responsibility.

---

## Legal & ethics

SiteSavvy is provided for **personal, non-commercial use only**. Respect the
copyright, terms of service, and `robots.txt` of every site you crawl. The
authors assume no liability for misuse. Run `sitesavvy legal` to read the full
disclaimer. Licensed under the [MIT License](LICENSE).

---

## Contributing

Pull requests are welcome! Please run the full check suite before submitting:

```bash
ruff check .
mypy sitesavvy
pytest --cov=sitesavvy --cov-report=term-missing
```

Coverage must stay at or above **90 %**. See the [Developer Guide](docs/developer.md)
for the project layout, release process and binary-building instructions.
