Metadata-Version: 2.4
Name: spidur
Version: 0.3.0
Summary: 🕷️ A lightweight, generic parallel runner for custom scrapers
License: MIT
License-File: LICENSE
Keywords: scraping,parallel,async,framework,runner
Author: ra0x3
Author-email: ractz@pm.me
Requires-Python: >=3.12,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Project-URL: Homepage, https://github.com/ra0x3/spidur
Project-URL: Repository, https://github.com/ra0x3/spidur
Description-Content-Type: text/markdown

# spidur 🕷️

[![PyPI version](https://img.shields.io/pypi/v/spidur.svg)](https://pypi.org/project/spidur/)
[![License](https://img.shields.io/github/license/ra0x3/spidur)](LICENSE)
[![Tests](https://github.com/ra0x3/spidur/actions/workflows/ci.yaml/badge.svg)](https://github.com/ra0x3/spidur/actions)

**spidur** is a lightweight, hackable framework for running **multiple custom scrapers in parallel** — even on the same domain.  
It helps you coordinate different scrapers, ensure valid URLs (no wasted work), and collect all results at once.

---

## ✨ Core ideas

- **Multiple scrapers per domain** — handle different content types (articles, images, comments, etc.) simultaneously.
- **Parallel execution** — utilizes all CPU cores.
- **Async + multiprocessing safe** — works across async methods and process pools.
- **No opinions** — you control discovery, validation, and scraping logic.
- **Results collected automatically** — each scraper contributes to a single aggregated result set.

---

## 📦 Install

```bash
pip install spidur
```

or with Poetry:

```bash
poetry add spidur
```

---

## ⚡ Example

```python
from typing import Any

from spidur import Runner, Scraper, ScraperFactory, ScrapeResult, Target


# --- define your scrapers ---
# Implement three small hooks. The discover -> validate -> scrape loop is
# provided for you by Scraper.fetch_round(), so you never write it yourself.

class ArticleScraper(Scraper):
    def is_valid_url(self, url: str) -> bool:
        return url.startswith("https://example.com/articles/")

    async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
        return [
            "https://example.com/articles/1",
            "https://example.com/articles/2",
        ]

    async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
        return {"type": "article", "url": url, "data": f"Content of {url}"}


class CommentScraper(Scraper):
    def is_valid_url(self, url: str) -> bool:
        return url.startswith("https://example.com/comments/")

    async def discover_urls(self, page: Any, known: set[str]) -> list[str]:
        return [
            "https://example.com/comments/1",
            "https://example.com/comments/2",
        ]

    async def scrape_page(self, page: Any, url: str) -> ScrapeResult | None:
        return {"type": "comment", "url": url, "data": f"Comments from {url}"}


# --- register both scrapers ---

ScraperFactory.register("articles", ArticleScraper)
ScraperFactory.register("comments", CommentScraper)


# --- define your scrape targets ---

targets = [
    Target(name="articles", start_url="https://example.com/articles"),
    Target(name="comments", start_url="https://example.com/comments"),
]


# --- run them all in parallel ---
# Put the code above in a module (e.g. `my_scrapers.py`) and pass its name as
# `bootstrap` so each worker process re-registers the scrapers. This is
# required on macOS/Windows, where multiprocessing uses the `spawn` start
# method and workers do not inherit the parent's registrations.

results = Runner.run(targets, bootstrap=["my_scrapers"])

for name, items in results.items():
    print(f"Results from {name}:")
    for item in items:
        print("  →", item)
```

---

## 🖥️ Command line

`spidur` ships a small CLI. Point it at the module(s) that register your
scrapers and pass one or more `name=start_url` targets:

```bash
spidur --module my_scrapers articles=https://example.com/articles
```

Results are written to stdout as JSON. Use `-v` for info-level logging.

---

## 🧠 How it works

1. Each `Scraper` subclass defines three single-purpose hooks:
    - `is_valid_url(url)` — a pure predicate; keeps invalid URLs out of scope.
    - `discover_urls(page, known)` — finds new pages to scrape.
    - `scrape_page(page, url)` — extracts structured data.

   The discover → validate → scrape orchestration is provided once by the base
   class as `fetch_round(known)`, so subclasses never re-implement it.

2. You register scrapers in `ScraperFactory`.

3. The `Runner`:
    - Spawns multiple processes.
    - Executes all scrapers concurrently.
    - Aggregates their results into a single dictionary keyed by scraper name.

---

## 🧪 Development

Install the project with its dev tooling and run the full quality gate:

```bash
poetry install
poetry run ruff check .     # lint
poetry run black --check .  # formatting
poetry run mypy             # static types (strict)
poetry run pytest           # tests
```

`spidur` is fully type-annotated and ships a `py.typed` marker, so downstream
projects get type checking for free.

## 🧩 Why “spidur”?

Because it crawls the web — but cleanly, predictably, and in parallel. 🕸️

