Metadata-Version: 2.4
Name: crawlvox
Version: 0.1.0
Summary: Web scraping and content extraction tool
Author: CrawlVox Team
License: MIT
License-File: LICENSE
Requires-Python: >=3.11
Requires-Dist: aiofiles
Requires-Dist: aiolimiter
Requires-Dist: aiosqlite
Requires-Dist: docling
Requires-Dist: httpx
Requires-Dist: playwright
Requires-Dist: protego
Requires-Dist: pydantic-settings
Requires-Dist: pydantic>=2.0
Requires-Dist: pymupdf
Requires-Dist: readability-lxml
Requires-Dist: rich
Requires-Dist: structlog
Requires-Dist: tenacity
Requires-Dist: trafilatura
Requires-Dist: typer[all]
Requires-Dist: ultimate-sitemap-parser
Requires-Dist: url-normalize
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Description-Content-Type: text/markdown

<div align="center">

<img src="assets/logo.png" alt="CrawlVox Logo" width="350">

# CrawlVox

**Extract clean, structured content from any website.**

A command-line web crawler that handles static HTML, JavaScript-heavy SPAs, and PDFs — built for content migration, web archiving, and research workflows.

[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-3776AB?logo=python&logoColor=white)](https://python.org)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)

---

**Crawl** websites at scale &nbsp;|&nbsp; **Extract** articles, metadata, images & PDFs &nbsp;|&nbsp; **Export** to JSONL, CSV or Markdown

</div>

---

> **Personal Use & Legal Notice**
>
> CrawlVox is intended for **personal and educational use only**. Always respect the laws of your jurisdiction, website terms of service, and robots.txt directives before crawling any website. The authors are not responsible for misuse. By using this software, you agree to comply with all applicable local, national, and international laws, including but not limited to data protection regulations (GDPR, CCPA), computer fraud laws, and intellectual property rights. **Do not use this tool to scrape websites without authorization.**

---

## Why CrawlVox?

Most web scrapers give you raw HTML and leave you to figure out the content. CrawlVox gives you **clean, readable text** with full metadata — ready to use.

```bash
crawlvox crawl https://example.com --max-pages 50 --export-jsonl results.jsonl
```

That's it. 50 pages crawled, content extracted, metadata captured, exported to a file.

### What Makes It Different

- **Dual extraction engine** — trafilatura for speed, readability-lxml as fallback. You get content, not boilerplate.
- **Smart JS rendering** — Only fires up a browser when static HTML doesn't have enough content. Saves time without missing SPAs.
- **PDF pipeline** — Finds PDFs during crawl, extracts text with OCR support. No separate tool needed.
- **Image deduplication** — Downloads images with two-tier dedup (URL + content hash). No duplicates across runs.
- **Resumable crawls** — Ctrl+C anytime. Resume later with `--resume`. State is saved automatically.
- **Ethical by default** — Respects robots.txt, rate-limits per domain, backs off on 429/503 errors.

---

## Features

<table>
<tr>
<td width="50%">

### Crawling
- BFS traversal with depth control (1-20 levels)
- 1-100 concurrent workers
- Same-domain or cross-domain scope
- URL allowlist/denylist via regex
- Infinite trap avoidance (calendar, faceted search)
- sitemap.xml seeding for efficient discovery

</td>
<td width="50%">

### Content Extraction
- Clean article text via trafilatura + readability
- Title, description, canonical URL
- OpenGraph & Twitter Card metadata
- Language detection
- Link extraction with anchor text & classification
- Image extraction with alt text & dimensions

</td>
</tr>
<tr>
<td>

### JavaScript & Documents
- Playwright-based rendering (off / auto / always)
- Resource blocking for 60-80% speed boost
- PDF text extraction via Docling
- OCR support (auto-detect or force)
- Multi-language OCR (en, fr, de, es, and more)

</td>
<td>

### Storage & Export
- SQLite with WAL mode for async safety
- Normalized URL deduplication
- Export to JSONL, CSV, or Markdown
- Run history with status tracking
- Filter exports by run, URL pattern, or status

</td>
</tr>
<tr>
<td>

### Reliability
- Resumable crawls with database-backed state
- Per-domain rate limiting (configurable)
- Exponential backoff on 429/503
- Retry logic with tenacity (3 attempts)
- Graceful shutdown — completes in-flight pages

</td>
<td>

### Ethics & Safety
- robots.txt respected by default
- Crawl-delay auto-adjustment
- Rate limiting prevents server overload
- Configurable User-Agent header
- Same-domain scope by default

</td>
</tr>
</table>

---

## Installation

**Requirements:** Python 3.11+

```bash
# Clone the repository
git clone https://github.com/your-username/crawlvox.git
cd crawlvox

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate        # Linux/macOS
source .venv/Scripts/activate    # Windows (Git Bash)
.venv\Scripts\activate           # Windows (CMD)

# Install CrawlVox
pip install -e .

# Install Playwright browser (for JavaScript rendering)
playwright install chromium
```

---

## Quick Start

```bash
# Crawl a website (100 pages by default)
crawlvox crawl https://example.com

# Crawl with limits and auto-export
crawlvox crawl https://example.com --max-pages 50 --max-depth 2 --export-jsonl results.jsonl

# Enable JavaScript rendering for SPAs
crawlvox crawl https://spa-site.com --dynamic always

# Download images too
crawlvox crawl https://example.com --download-images

# Process PDFs found during crawl
crawlvox crawl https://example.com --process-documents

# Resume an interrupted crawl
crawlvox crawl https://example.com --resume

# Check crawl history
crawlvox status

# Export data from database
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl
```

---

## Commands

### `crawlvox crawl`

Crawl one or more websites and extract content.

```bash
crawlvox crawl [OPTIONS] URLS...
```

#### Crawling Options

| Option | Default | Description |
|--------|---------|-------------|
| `-w, --workers` | 10 | Concurrent workers (1-100) |
| `-p, --max-pages` | 100 | Maximum pages to crawl |
| `--max-depth` | 3 | Maximum link depth (1-20) |
| `-t, --timeout` | 30.0 | Request timeout in seconds |
| `-r, --rate-limit` | 2.0 | Max requests/second per domain |
| `--no-robots` | off | Disable robots.txt respect |
| `--same-domain / --cross-domain` | same-domain | Domain scope control |
| `-i, --include` | none | Regex allowlist for URLs (repeatable) |
| `-e, --exclude` | none | Regex denylist for URLs (repeatable) |
| `--user-agent` | CrawlVox/0.1 | HTTP User-Agent header |
| `--cookie-file` | none | Path to cookie file (LWP format) |

#### JavaScript Rendering

| Option | Default | Description |
|--------|---------|-------------|
| `--dynamic` | auto | `off` = static only, `auto` = fallback on low content, `always` = render all |
| `--min-content-length` | 200 | Character threshold before triggering JS fallback |

```bash
# Force JS rendering on all pages
crawlvox crawl https://spa-site.com --dynamic always

# Static-only (fastest)
crawlvox crawl https://static-site.com --dynamic off
```

#### Image Downloading

| Option | Default | Description |
|--------|---------|-------------|
| `--download-images` | off | Enable binary image downloading |
| `--image-dir` | images | Directory for downloaded images |
| `--image-scope` | same-domain | `same-domain` or `all` (includes CDNs) |
| `--max-image-size` | 10mb | Max image file size |

```bash
# Download images from any source
crawlvox crawl https://example.com --download-images --image-scope all
```

#### PDF/Document Processing

| Option | Default | Description |
|--------|---------|-------------|
| `--process-documents` | off | Enable PDF text extraction via Docling |
| `--ocr-mode` | auto | `off`, `auto`, or `always` |
| `--ocr-language` | en | OCR language code |
| `--max-document-size` | 50mb | Max PDF file size |
| `--max-document-pages` | 500 | Max PDF page count |

```bash
# Process PDFs with forced OCR
crawlvox crawl https://example.com --process-documents --ocr-mode always
```

#### Output & Resume

| Option | Default | Description |
|--------|---------|-------------|
| `--store-html` | off | Store raw HTML in database |
| `--export-jsonl` | none | Auto-export to JSONL after crawl |
| `-q, --quiet` | off | Suppress progress output |
| `-l, --log-level` | INFO | DEBUG, INFO, WARNING, ERROR |
| `--resume` | off | Resume an interrupted crawl |
| `--recrawl` | off | Re-process pages on resume |

```bash
# Start a large crawl (Ctrl+C to interrupt safely)
crawlvox crawl https://large-site.com --max-pages 1000

# Resume where you left off
crawlvox crawl https://large-site.com --max-pages 1000 --resume
```

---

### `crawlvox export`

Export crawl data to a file.

```bash
crawlvox export [OPTIONS]
```

**Formats:**

```bash
# JSONL (one JSON object per line)
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl

# CSV (flat table)
crawlvox export -d crawlvox.db -f csv -o output.csv

# Markdown (one .md file per page)
crawlvox export -d crawlvox.db -f markdown -o ./pages/
```

**Filtering:**

```bash
# By run ID
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --run-id abc123

# By URL pattern
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --url-pattern "%/blog/%"

# By status
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --status ok
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --status error

# Include raw HTML
crawlvox export -d crawlvox.db -f jsonl -o output.jsonl --include-html
```

---

### `crawlvox status`

Show crawl run history and storage statistics.

```bash
crawlvox status
crawlvox status -d myproject.db
crawlvox status --limit 20
```

---

### `crawlvox purge`

Delete old crawl runs and their associated data.

```bash
# Purge a specific run
crawlvox purge --run abc123def456

# Purge runs older than 7 days
crawlvox purge --older-than 7d

# Purge runs older than 2 weeks
crawlvox purge --older-than 2w
```

Active (running) crawls are never purged.

---

## Output Format

### JSONL

Each line is a self-contained JSON object:

```json
{
  "type": "page",
  "url": "https://example.com/about",
  "final_url": "https://example.com/about",
  "title": "About Us",
  "description": "Learn more about our company",
  "text": "Extracted main content text...",
  "fetched_at": "2025-01-15T10:30:00+00:00",
  "status_code": 200,
  "fetch_method": "static",
  "extraction_method": "trafilatura",
  "canonical_url": "https://example.com/about",
  "og_title": "About Us | Example",
  "og_description": "Learn more about our company",
  "og_image": "https://example.com/og-about.jpg",
  "language": "en",
  "content_type": "text/html",
  "error": null,
  "images": [
    {
      "src": "https://example.com/team.jpg",
      "alt": "Our team",
      "local_path": null
    }
  ]
}
```

### CSV

Flat table with columns: `url`, `final_url`, `title`, `description`, `text`, `status_code`, `content_type`, `fetch_method`, `extraction_method`, `language`, `fetched_at`, `error`.

### Markdown

One `.md` file per page with metadata header and extracted text body.

---

## URL Filtering

Control which URLs get crawled with regex patterns:

```bash
# Only crawl blog pages
crawlvox crawl https://example.com -i "/blog/"

# Skip admin and login pages
crawlvox crawl https://example.com -e "/admin/" -e "/login/"

# Combine: product pages only, skip archived
crawlvox crawl https://example.com -i "/products/" -e "/archived/"
```

Deny patterns (`-e`) always take priority over allow patterns (`-i`).

---

## Database

All crawl data is stored in a local SQLite database (default: `crawlvox.db`) using WAL mode for safe concurrent access.

**Tables:**

| Table | Purpose |
|-------|---------|
| `pages` | URL, status code, extracted text, metadata, timestamps |
| `links` | Source page, target URL, anchor text, internal/external classification |
| `images` | Source page, image URL, alt text, local file path, SHA256 hash |
| `documents` | PDF/document processing results |
| `runs` | Crawl run metadata, status, configuration |
| `run_pages` | Maps pages to the runs that crawled them |

Query directly with any SQLite client:

```bash
sqlite3 crawlvox.db "SELECT url, title FROM pages LIMIT 10"
```

---

## Architecture

```
URL Input
    |
    v
[Scope Checker] --> robots.txt / sitemap.xml
    |
    v
[Worker Pool] ---> [Fetcher] ---> httpx (static)
    |                  |
    |                  +---------> [Playwright] (dynamic fallback)
    |
    v
[Content Extractor] --> trafilatura (primary)
    |                       |
    |                       +--> readability-lxml (fallback)
    |
    +---> [Metadata Extractor] --> OG, Twitter, canonical, lang
    +---> [Link Extractor] ------> internal/external classification
    +---> [Image Extractor] -----> optional binary download + dedup
    +---> [Document Processor] --> PDF text extraction + OCR
    |
    v
[SQLite Storage] --> WAL mode, normalized URLs, run tracking
    |
    v
[Export] --> JSONL / CSV / Markdown
```

---

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run with verbose output
pytest -v

# Run a specific test file
pytest tests/test_fetcher.py

# Type checking
mypy src/crawlvox/
```

---

## Responsible Use

CrawlVox is designed with ethical crawling in mind:

- **robots.txt is respected by default** — disable only when you have explicit permission
- **Rate limiting prevents server overload** — default 2 req/sec per domain
- **Same-domain scope** prevents unintended cross-site crawling
- **Crawl-delay headers** are automatically honored

**Please use this tool responsibly:**

1. Only crawl websites you have permission to access
2. Respect `robots.txt` directives and website terms of service
3. Use appropriate rate limits to avoid overloading servers
4. Comply with all applicable laws in your jurisdiction, including GDPR, CCPA, CFAA, and local equivalents
5. Do not use extracted content in ways that violate copyright or intellectual property rights
6. When in doubt, ask the website owner for permission

**This tool is provided for personal, educational, and legitimate research purposes.** The authors assume no liability for misuse.

---

## License

[MIT](LICENSE)

---

<div align="center">

Built with Python, asyncio, and respect for the web.

</div>
