Metadata-Version: 2.4
Name: scrapemcp
Version: 1.0.0
Summary: Web scraping MCP server with SSRF protection
Project-URL: Homepage, https://github.com/Medalcode/ScrapeMCP
Project-URL: Repository, https://github.com/Medalcode/ScrapeMCP
Project-URL: Bug Tracker, https://github.com/Medalcode/ScrapeMCP/issues
Author-email: Jonatthan Medalla <152304407+Medalcode@users.noreply.github.com>
License: MIT
License-File: LICENSE
Keywords: beautifulsoup,crawling,data-extraction,mcp,model-context-protocol,web-scraping
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Requires-Dist: beautifulsoup4
Requires-Dist: html5lib
Requires-Dist: httpx
Requires-Dist: mcp
Requires-Dist: python-dotenv
Description-Content-Type: text/markdown

# ScrapeMCP — Web Scraping MCP Server

[![CI](https://github.com/Medalcode/ScrapeMCP/actions/workflows/test-and-lint.yml/badge.svg)](https://github.com/Medalcode/ScrapeMCP/actions/workflows/test-and-lint.yml)
[![PyPI](https://img.shields.io/pypi/v/scrapemcp.svg)](https://pypi.org/project/scrapemcp/)
[![Python](https://img.shields.io/badge/python-3.11%2B-blue.svg)]()
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

Servidor MCP para extracción estructurada de datos web. Scrapea páginas, tablas, listas, sitemaps y más. Incluye protección SSRF integrada.

## Features / Funcionalidades

| Tool / Herramienta | Description / Descripción |
|---|---|
| `scrape` | Extrae contenido de una URL usando selectores CSS personalizados |
| `inspect` | Analiza la estructura de una página (meta tags, headings, links, images, forms, scripts) |
| `tables` | Extrae todas las tablas HTML de una página |
| `scrape_list` | Extrae una lista de items con campos personalizados desde selectores CSS |
| `scrape_recursive` | Navega páginas enlazadas recursivamente extrayendo datos |
| `sitemap` | Parsea el sitemap.xml de un sitio web |
| `scrape_sitemap` | Scrapea todas las URLs de un sitemap |
| `export` | Exporta datos a CSV, Markdown o JSON |

## SSRF Protection / Protección SSRF

El servidor bloquea automáticamente accesos a:

- IPs privadas (`10.0.0.0/8`, `172.16.0.0/12`, `192.168.0.0/16`)
- Localhost/loopback (`127.0.0.0/8`, `::1`)
- Link-local (`169.254.0.0/16`, `fe80::/10`)
- Hostnames bloqueados: `localhost`, `metadata.google.internal`, `169.254.169.254`
- Dominios `.internal` y `.local`

Solo permite esquemas `http://` y `https://`.

## Tech Stack

- **Python** — `>=3.11`
- **Framework**: `mcp` (FastMCP) via stdio JSON-RPC
- **HTTP**: `httpx` (async) — replaced sync `requests` for non-blocking I/O
- **Parsing**: `beautifulsoup4` + `html5lib`
- **Export**: CSV (sanitized), Markdown, JSON
- **Config**: `python-dotenv` — reads `.env` file for all settings

## Quick Start

```bash
# Instalar dependencias
pip install mcp requests beautifulsoup4 html5lib httpx

# Ejecutar servidor
python server.py
```

### Ejemplos

```python
# Scrapear página completa
result = await session.call_tool("scrape", {"url": "https://example.com"})

# Scrapear con selectores personalizados
result = await session.call_tool("scrape", {
    "url": "https://example.com",
    "selectors": {"title": "h1", "price": ".price"}
})

# Inspeccionar estructura de página
result = await session.call_tool("inspect", {"url": "https://example.com"})

# Extraer tablas
result = await session.call_tool("tables", {
    "url": "https://example.com",
    "selector": "table"
})

# Scrapear lista
result = await session.call_tool("scrape_list", {
    "url": "https://example.com/items",
    "item_selector": ".item",
    "fields": {"name": "h2", "price": ".price"}
})

# Scrapear recursivamente
result = await session.call_tool("scrape_recursive", {
    "start_url": "https://example.com/blog",
    "link_selector": "a.post-link",
    "item_selector": "article",
    "fields": {"title": "h1", "content": "p"},
    "max_pages": 10
})

# Exportar a CSV
result = await session.call_tool("export", {
    "data": '[{"name": "Alice", "age": 30}]',
    "format": "csv"
})
```

## Project Structure

```
scrapemcp/
├── server.py              # MCP server entry point (tools)
├── scrapers/
│   ├── __init__.py
│   ├── base.py            # BaseScraper, ScrapeResult, SSRF validation
│   ├── page.py            # PageScraper (scrape, inspect)
│   ├── table.py           # TableScraper (tables)
│   ├── list_scraper.py    # ListScraper (scrape_list, scrape_recursive)
│   └── sitemap.py         # SitemapScraper (sitemap, scrape_sitemap)
├── exporters.py           # CSV, Markdown, JSON export
├── client.py              # Test client CLI
└── pyproject.toml
```

## 🔧 Recent Improvements

- **SSRF Bypass Fixed** — DNS resolution added: URL-encoded private IPs are now properly blocked
- **Async HTTP** — Replaced sync `requests` with `httpx.AsyncClient` (non-blocking, async-compatible)
- **State Race Fixed** — Removed shared mutable `_last_url`/`_last_soup` (no more cross-call contamination)
- **`.env` Support** — Reads `HTTP_TIMEOUT`, `SITEMAP_URL_LIMIT`, `MAX_RECURSIVE_PAGES` from environment
- **Sitemap Discovery** — Tries `robots.txt` for `Sitemap:` before falling back to `/sitemap.xml`
- **Recursive Timeout** — 120s wall-clock timeout prevents runaway crawling
- **Configurable Rate Limit** — Recursive crawl delay via `RECURSIVE_DELAY` env var
- **Export Size Limit** — Rejects data > 10MB before parsing
