Metadata-Version: 2.4
Name: scrapy-mcp
Version: 0.1.0
Summary: Headless web-scraping MCP server built on Scrapy: fetch, extract (CSS/XPath), links, tables, sitemaps, robots, and async crawls.
Project-URL: Homepage, https://github.com/eitan3/Scrapy_MCP_Scraper
Project-URL: Repository, https://github.com/eitan3/Scrapy_MCP_Scraper
Author: Eitan Hadar
License: MIT
License-File: LICENSE
Keywords: crawler,mcp,model-context-protocol,scraper,scrapy,web-scraping
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Requires-Python: >=3.10
Requires-Dist: markdownify>=0.13
Requires-Dist: mcp>=1.27
Requires-Dist: parsel>=1.9
Requires-Dist: protego>=0.3
Requires-Dist: scrapy>=2.16
Requires-Dist: w3lib>=2.1
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == 'dev'
Description-Content-Type: text/markdown

# scrapy-mcp

A headless web-scraping [MCP](https://modelcontextprotocol.io) server built on
[Scrapy](https://scrapy.org). It exposes Scrapy's scraping primitives — polite fetching,
CSS/XPath extraction, link and table extraction, sitemap and robots.txt reading, and bounded
asynchronous crawls — as MCP tools an agent can call over stdio.

- **Headless, no rendering.** Pages are fetched and parsed as HTML; no browser, no JavaScript
  execution. This keeps the footprint tiny — it runs comfortably on weak machines.
- **Reactor-safe.** Every operation runs in a short-lived Scrapy subprocess, so Twisted's
  reactor never lives inside the asyncio MCP server (no `ReactorNotRestartable`), and memory
  is reclaimed after each call.
- **Polite by default.** Obeys `robots.txt`, throttles with AutoThrottle, and enforces hard
  page/depth caps so a crawl can't run away.

## Install / run

Run straight from PyPI with [uv](https://docs.astral.sh/uv/) — no install step:

```bash
uvx scrapy-mcp
```

Or install it:

```bash
uv pip install scrapy-mcp
scrapy-mcp
```

The server speaks MCP over **stdio**. Point any MCP client at it. For Claude Desktop, add to
`claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "scrapy": {
      "command": "uvx",
      "args": ["scrapy-mcp"]
    }
  }
}
```

## Tools

| Tool | What it does |
|------|--------------|
| `fetch_page(url, format, max_bytes, obey_robots)` | Fetch one page as `markdown` (default), `text`, or `html`. |
| `extract(url, selectors, obey_robots)` | Pull structured fields with CSS/XPath selectors. |
| `extract_tables(url, max_tables, obey_robots)` | Extract every HTML `<table>` as `{headers, rows}`. |
| `extract_links(url, same_domain, pattern, limit, obey_robots)` | List de-duplicated links on a page. |
| `get_sitemap(url, limit, obey_robots)` | Read a sitemap (gzip + sitemap-index aware). |
| `check_robots(url, user_agent)` | Is a URL crawlable? Returns the crawl-delay and sitemaps. |
| `start_crawl(start_url, allow_patterns, deny_patterns, max_pages, max_depth, same_domain, selectors, ...)` | Start a bounded BFS crawl; returns a `job_id`. |
| `crawl_status(job_id)` | State + pages scraped for a crawl. |
| `crawl_results(job_id, cursor, limit)` | Page through a crawl's scraped items. |
| `cancel_crawl(job_id)` | Stop a running crawl; keep results so far. |

### Selector format (`extract` / `start_crawl`)

`selectors` maps an output field to a selector. Each value is either a **CSS string** (first
match) or an **object** for more control:

```json
{
  "title": "h1::text",
  "price": "span.price::text",
  "all_links": {"css": "a::attr(href)", "all": true},
  "first_heading": {"xpath": "//h1/text()"}
}
```

`"all": true` returns every match as a list; otherwise the first match is returned.

### Crawls are asynchronous

`start_crawl` returns immediately with a `job_id`. The crawl runs as a detached worker that
streams results to disk, so it survives a server restart. Poll `crawl_status(job_id)`, then
read items with `crawl_results(job_id)` (safe to call mid-crawl for partial results). Jobs are
stored under the system temp dir and reclaimed after 7 days (configurable).

## Configuration

All settings are optional environment variables (sensible, polite defaults tuned for a weak
host). They're how you tune a `uvx scrapy-mcp` deployment.

| Variable | Default | Meaning |
|----------|---------|---------|
| `SCRAPY_MCP_USER_AGENT` | `scrapy-mcp/<version> …` | User-Agent header. |
| `SCRAPY_MCP_OBEY_ROBOTS` | `true` | Obey `robots.txt`. |
| `SCRAPY_MCP_DOWNLOAD_DELAY` | `0.5` | Seconds between requests to a host. |
| `SCRAPY_MCP_CONCURRENT_REQUESTS` | `8` | Global concurrency. |
| `SCRAPY_MCP_CONCURRENT_REQUESTS_PER_DOMAIN` | `4` | Per-host concurrency. |
| `SCRAPY_MCP_DOWNLOAD_TIMEOUT` | `30` | Per-request timeout (s). |
| `SCRAPY_MCP_RETRY_TIMES` | `2` | Retries on transient failures. |
| `SCRAPY_MCP_AUTOTHROTTLE` | `true` | Adapt delay to server latency. |
| `SCRAPY_MCP_MAX_BYTES` | `50000` | Max characters returned per page (then truncated). |
| `SCRAPY_MCP_REQUEST_TIMEOUT` | `60` | Wall-clock cap for a blocking single fetch (s). |
| `SCRAPY_MCP_DEFAULT_MAX_PAGES` / `_MAX_PAGES_CAP` | `50` / `1000` | Crawl page default / hard cap. |
| `SCRAPY_MCP_DEFAULT_MAX_DEPTH` / `_MAX_DEPTH_CAP` | `2` / `10` | Crawl depth default / hard cap. |
| `SCRAPY_MCP_JOB_DIR` | `<tmp>/scrapy_mcp_jobs` | Where crawl jobs are stored. |
| `SCRAPY_MCP_JOB_TTL_DAYS` | `7` | Delete crawl jobs older than this (0 disables). |
| `SCRAPY_MCP_LOG_LEVEL` | `ERROR` | Scrapy log level (to stderr). |

## Development

```bash
uv venv
uv pip install -e ".[dev]"
uv run pytest          # unit tests (no network)
uv build               # build wheel + sdist into dist/
```

## License

MIT © Eitan Hadar
