Metadata-Version: 2.1
Name: fetchkit
Version: 0.3.1
Summary: Advanced web fetching, scraping, and content acquisition toolkit with crawl-scrape-download pipeline
Author-Email: Will <you@example.com>
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.13
Requires-Dist: pydantic>=2.0
Requires-Dist: httpx[http2]>=0.27
Requires-Dist: aiohttp>=3.9
Requires-Dist: tenacity>=8.0
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: click>=8.0
Requires-Dist: rich>=13.0
Requires-Dist: defusedxml>=0.7
Requires-Dist: pydantic-settings>=2.0
Provides-Extra: tui
Requires-Dist: textual>=0.40; extra == "tui"
Provides-Extra: metadata
Requires-Dist: extruct>=0.17; extra == "metadata"
Requires-Dist: w3lib>=2.0; extra == "metadata"
Provides-Extra: curl
Requires-Dist: curl_cffi>=0.7; extra == "curl"
Provides-Extra: cloudscraper
Requires-Dist: cloudscraper>=1.2; extra == "cloudscraper"
Provides-Extra: db
Requires-Dist: sqlalchemy[asyncio]>=2.0; extra == "db"
Requires-Dist: asyncpg>=0.31; extra == "db"
Requires-Dist: alembic>=1.14; extra == "db"
Requires-Dist: pydantic-settings>=2.0; extra == "db"
Provides-Extra: store
Requires-Dist: aioboto3>=13.0; extra == "store"
Requires-Dist: minio>=7.2; extra == "store"
Provides-Extra: pipeline
Requires-Dist: fetchkit[db,store]; extra == "pipeline"
Provides-Extra: downloaders
Requires-Dist: yt-dlp>=2024.0; extra == "downloaders"
Requires-Dist: gallery-dl>=1.27; extra == "downloaders"
Provides-Extra: extractors
Requires-Dist: trafilatura>=2.0; extra == "extractors"
Requires-Dist: readability-lxml>=0.8; extra == "extractors"
Requires-Dist: html2text>=2025.0; extra == "extractors"
Requires-Dist: markdownify>=1.0; extra == "extractors"
Requires-Dist: newspaper3k>=0.2; extra == "extractors"
Provides-Extra: media
Requires-Dist: mutagen>=1.47; extra == "media"
Requires-Dist: pymediainfo>=7.0; extra == "media"
Requires-Dist: exifread>=3.5; extra == "media"
Requires-Dist: pypdf>=6.0; extra == "media"
Requires-Dist: Pillow>=12.0; extra == "media"
Requires-Dist: ffmpeg-python>=0.2; extra == "media"
Provides-Extra: browser
Requires-Dist: playwright>=1.50; extra == "browser"
Requires-Dist: playwright-stealth>=2.0; extra == "browser"
Provides-Extra: observe
Requires-Dist: structlog>=25.0; extra == "observe"
Requires-Dist: opentelemetry-api>=1.30; extra == "observe"
Requires-Dist: opentelemetry-sdk>=1.30; extra == "observe"
Provides-Extra: feeds
Requires-Dist: feedparser>=6.0; extra == "feeds"
Requires-Dist: dateparser>=1.2; extra == "feeds"
Provides-Extra: text
Requires-Dist: ftfy>=6.0; extra == "text"
Requires-Dist: anyascii>=0.3; extra == "text"
Requires-Dist: tldextract>=5.0; extra == "text"
Requires-Dist: python-slugify>=8.0; extra == "text"
Provides-Extra: mcp
Requires-Dist: fastmcp>=2.0; extra == "mcp"
Provides-Extra: langchain
Requires-Dist: langchain-mcp-adapters>=0.1; extra == "langchain"
Requires-Dist: langchain-core>=0.3; extra == "langchain"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0; extra == "docs"
Requires-Dist: furo>=2024.0; extra == "docs"
Requires-Dist: sphinx-autodoc-typehints>=2.0; extra == "docs"
Provides-Extra: full
Requires-Dist: fetchkit[browser,cloudscraper,curl,downloaders,extractors,feeds,mcp,media,metadata,observe,pipeline,text,tui]; extra == "full"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: respx>=0.21; extra == "dev"
Requires-Dist: aioresponses>=0.7; extra == "dev"
Requires-Dist: pydantic-settings>=2.0; extra == "dev"
Requires-Dist: ruff>=0.15; extra == "dev"
Description-Content-Type: text/markdown

<div align="center">

# fetchkit

**Agentic web infrastructure for autonomous fetching, scraping, and content acquisition.**

Give AI agents the power to fetch, scrape, extract, and download anything on the web -- with realistic browser fingerprints, structured outputs, and a full crawl-scrape-download pipeline backed by Postgres and MinIO.

[![PyPI](https://img.shields.io/pypi/v/fetchkit?style=flat-square&logo=pypi&logoColor=white&color=blue)](https://pypi.org/project/fetchkit/)
[![Python](https://img.shields.io/pypi/pyversions/fetchkit?style=flat-square&logo=python&logoColor=white)](https://pypi.org/project/fetchkit/)
[![Docs](https://img.shields.io/github/actions/workflow/status/pr1m8/pyfetcher/pages.yml?branch=main&style=flat-square&logo=github&label=docs)](https://pr1m8.github.io/pyfetcher/)
[![CI](https://img.shields.io/github/actions/workflow/status/pr1m8/pyfetcher/ci.yml?branch=main&style=flat-square&logo=github&label=CI)](https://github.com/pr1m8/pyfetcher/actions/workflows/ci.yml)
[![License](https://img.shields.io/github/license/pr1m8/pyfetcher?style=flat-square&color=green)](https://github.com/pr1m8/pyfetcher/blob/main/LICENSE)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json&style=flat-square)](https://github.com/astral-sh/ruff)
[![PDM](https://img.shields.io/badge/pdm-managed-blueviolet?style=flat-square)](https://pdm-project.org)
[![Tests](https://img.shields.io/badge/tests-488_passed-brightgreen?style=flat-square)](#development)
[![MCP](https://img.shields.io/badge/MCP-16_tools-orange?style=flat-square)](#mcp-server-ai-agent-integration)

---

[MCP Server](#mcp-server-ai-agent-integration) | [Quick Start](#quick-start) | [Pipeline](#pipeline) | [CLI](#cli) | [Documentation](https://pr1m8.github.io/pyfetcher/) | [Examples](examples/)

</div>

## Why fetchkit?

**The problem**: AI agents need to interact with the web -- fetch pages, extract data, download files -- but existing tools aren't designed for autonomous operation. They lack structured outputs, realistic browser fingerprints, and pipeline orchestration.

**fetchkit solves this** by providing:

1. **MCP Server** -- 16 tools that any AI agent (Claude, LangChain, LangGraph) can call directly. Structured Pydantic outputs, not raw HTML.
2. **Realistic browser identity** -- 11 profiles with consistent UA + Client Hints + Sec-Fetch-\* headers. TLS fingerprinting via curl_cffi. Cloudflare bypass.
3. **Full pipeline** -- Event-driven crawl -> scrape -> download backed by Postgres job queues and MinIO object storage.
4. **Deep downloader integration** -- yt-dlp and gallery-dl Python APIs with progress hooks and metadata extraction.

```bash
pip install 'fetchkit[mcp]'     # AI agent integration
pip install 'fetchkit[full]'    # Everything
```

## Highlights

```
pip install fetchkit                   # Core: fetch, scrape, headers
pip install 'fetchkit[pipeline]'       # + Postgres job queue + MinIO storage
pip install 'fetchkit[full]'           # Everything including yt-dlp, Playwright, etc.
```

<table>
<tr>
<td width="50%">

**Fetch with realistic browser headers**

```python
from pyfetcher import fetch

response = fetch("https://example.com")
# Sends Chrome-like headers with Client Hints,
# Sec-Fetch-*, UA rotation automatically
```

</td>
<td width="50%">

**Scrape anything**

```python
from pyfetcher.scrape import (
    extract_links, extract_text,
    extract_readable_text,
)

links = extract_links(html, base_url=url)
titles = extract_text(html, "h1")
article = extract_readable_text(html)
```

</td>
</tr>
<tr>
<td>

**4 HTTP backends**

```python
from pyfetcher import FetchRequest, fetch

# TLS fingerprinting (bypass bot detection)
fetch(FetchRequest(url=url, backend="curl_cffi"))

# Cloudflare bypass
fetch(FetchRequest(url=url, backend="cloudscraper"))
```

</td>
<td>

**Download media with yt-dlp**

```python
from pyfetcher.downloaders.ytdlp import YtdlpDownloader

dl = YtdlpDownloader()
info = await dl.extract_info(video_url)
results = await dl.download(video_url,
    output_dir="./media")
```

</td>
</tr>
</table>

## Features

### Core Library

| Feature             | Description                                                                                                                               |
| ------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- |
| **Browser Headers** | 11 profiles (Chrome/Firefox/Safari/Edge) across 5 platforms. Consistent UA + Client Hints + Sec-Fetch-\*. Market-share-weighted rotation. |
| **4 Backends**      | `httpx` (default, HTTP/2), `aiohttp` (async), `curl_cffi` (TLS fingerprint), `cloudscraper` (CF bypass)                                   |
| **Rate Limiting**   | Per-domain + global token bucket with configurable burst                                                                                  |
| **Retry**           | Exponential backoff via Tenacity with configurable status codes                                                                           |
| **Scraping**        | CSS selectors, link harvesting, form parsing, table extraction                                                                            |
| **Metadata**        | HTML meta, Open Graph, JSON-LD, microdata, RDFa, Dublin Core                                                                              |
| **CLI**             | `pyfetcher fetch`, `scrape`, `headers`, `user-agent`, `robots`, `download`                                                                |
| **TUI**             | Interactive Textual terminal UI for building and inspecting requests                                                                      |

### Infrastructure (optional extras)

| Feature          | Extra           | Description                                                                 |
| ---------------- | --------------- | --------------------------------------------------------------------------- |
| **Pipeline**     | `[pipeline]`    | Event-driven Crawl -> Scrape -> Download via Postgres LISTEN/NOTIFY         |
| **Database**     | `[db]`          | SQLAlchemy 2.0 async + Alembic. Jobs, pages, media, hosts, feeds, URL dedup |
| **Object Store** | `[store]`       | MinIO/S3 via aioboto3. Upload, download, presigned URLs                     |
| **Downloaders**  | `[downloaders]` | yt-dlp (progress hooks, info_dict) + gallery-dl (170+ sites)                |
| **Extractors**   | `[extractors]`  | trafilatura + readability-lxml fallback, html2text, markdownify             |
| **Media**        | `[media]`       | Audio (mutagen), video (pymediainfo), image (exifread), PDF (pypdf)         |
| **Browser**      | `[browser]`     | Playwright + stealth for JS-heavy sites                                     |
| **Feeds**        | `[feeds]`       | RSS/Atom monitoring with adaptive polling                                   |
| **Crawler**      | `[pipeline]`    | URL frontier, spider + router, dedup, politeness, sitemap discovery         |

## Installation

```bash
pip install fetchkit
```

All optional extras:

```bash
pip install 'fetchkit[tui]'            # Textual TUI
pip install 'fetchkit[curl]'           # curl_cffi TLS fingerprinting
pip install 'fetchkit[cloudscraper]'   # Cloudflare bypass
pip install 'fetchkit[db]'             # Postgres + SQLAlchemy + Alembic
pip install 'fetchkit[store]'          # MinIO/S3 object storage
pip install 'fetchkit[pipeline]'       # db + store (full pipeline)
pip install 'fetchkit[downloaders]'    # yt-dlp + gallery-dl
pip install 'fetchkit[extractors]'     # trafilatura, readability, html2text
pip install 'fetchkit[media]'          # Audio/video/image/PDF metadata
pip install 'fetchkit[browser]'        # Playwright + stealth
pip install 'fetchkit[feeds]'          # RSS/Atom feed parsing
pip install 'fetchkit[full]'           # Everything
```

## Quick Start

### Fetch

```python
from pyfetcher import fetch, afetch, FetchRequest
import asyncio

# Sync
response = fetch("https://example.com")
print(response.status_code, response.ok)

# Async
response = asyncio.run(afetch("https://example.com"))
```

### Browser Profiles & Headers

```python
from pyfetcher.headers.browser import BrowserHeaderProvider
from pyfetcher.headers.rotating import RotatingHeaderProvider
from pyfetcher.headers.ua import random_user_agent
from pyfetcher.fetch.service import FetchService

# Fixed profile (Chrome on Windows)
service = FetchService(header_provider=BrowserHeaderProvider("chrome_win"))

# Rotating profiles weighted by real-world market share
service = FetchService(header_provider=RotatingHeaderProvider())

# Just need a user-agent string?
ua = random_user_agent(browser="firefox", platform="macOS")
```

### Scraping

```python
from pyfetcher.scrape import (
    extract_links, extract_text, extract_table,
    extract_forms, extract_readable_text,
)
from pyfetcher.scrape.robots import parse_robots_txt, is_allowed

# CSS selectors
titles = extract_text(html, "h1.title")
rows = extract_table(html, "table.data")

# Links with internal/external classification
links = extract_links(html, base_url=url, same_domain_only=True)

# Forms with field extraction
forms = extract_forms(html, base_url=url)
print(forms[0].action, forms[0].to_dict())

# Robots.txt
rules = parse_robots_txt(robots_content)
allowed = is_allowed(rules, "/admin", user_agent="MyBot")
```

### Rate-Limited Fetching

```python
from pyfetcher.fetch.service import FetchService
from pyfetcher.ratelimit.limiter import DomainRateLimiter, RateLimitPolicy

limiter = DomainRateLimiter(
    default_policy=RateLimitPolicy(requests_per_second=2.0, burst=5),
    domain_policies={
        "api.example.com": RateLimitPolicy(requests_per_second=0.5),
    },
)
service = FetchService(rate_limiter=limiter)
```

### Content Extraction

```python
from pyfetcher.extractors.content import extract_article_text
from pyfetcher.extractors.convert import html_to_markdown, html_to_plaintext

# Article text (trafilatura with readability-lxml fallback)
article = extract_article_text(html, url="https://example.com/post")

# HTML -> Markdown
md = html_to_markdown(html)
```

### yt-dlp & gallery-dl

```python
from pyfetcher.downloaders.ytdlp import YtdlpDownloader
from pyfetcher.downloaders.gallerydl import GalleryDlDownloader

# yt-dlp with progress tracking
yt = YtdlpDownloader()
info = await yt.extract_info("https://youtube.com/watch?v=dQw4w9WgXcQ")
results = await yt.download(url, output_dir="./videos",
    progress_callback=lambda p: print(f"{p.status}: {p.percent}"))

# gallery-dl for image galleries (170+ supported sites)
gdl = GalleryDlDownloader()
results = await gdl.download("https://imgur.com/gallery/...", output_dir="./images")
```

## CLI

```bash
# Fetch with any backend
pyfetcher fetch https://example.com
pyfetcher fetch https://example.com -o json -b curl_cffi

# Preview generated headers
pyfetcher headers --profile chrome_win
pyfetcher headers --browser firefox -o json
pyfetcher headers --list

# Scrape content
pyfetcher scrape https://example.com --css "h1"
pyfetcher scrape https://example.com --links -o json
pyfetcher scrape https://example.com --text
pyfetcher scrape https://example.com --meta

# Random user-agents
pyfetcher user-agent --browser chrome --count 5
pyfetcher user-agent --mobile

# Check robots.txt
pyfetcher robots https://example.com -p /admin

# Download files
pyfetcher download https://example.com/file.pdf ./file.pdf
```

## Pipeline

The event-driven pipeline connects three stages via Postgres LISTEN/NOTIFY:

```
Seeds / RSS / Sitemap
       |
  [Crawl Stage]  ──NOTIFY──>  [Scrape Stage]  ──NOTIFY──>  [Download Stage]
       |                             |                             |
       v                             v                             v
  pages table                 pages (enriched)              media_assets
  + new crawl jobs            + download jobs               + MinIO objects
```

### Setup

```bash
make infra-up     # Start Postgres + MinIO
make migrate      # Run Alembic migrations
make pipeline     # Start all workers
```

### Programmatic

```python
from pyfetcher.pipeline.runner import PipelineRunner
from pyfetcher.config import PyfetcherConfig

runner = PipelineRunner(PyfetcherConfig(
    crawl_concurrency=10,
    scrape_concurrency=20,
    download_concurrency=5,
))
await runner.start()
```

### Custom Spiders

```python
from pyfetcher.crawler.spider import Spider, SpiderResult

spider = Spider(name="my-spider")

@spider.router.add(r"/blog/\d{4}/")
async def handle_post(url, response):
    return SpiderResult(
        discovered_urls=[...],
        items=[{"title": "...", "content": "..."}],
    )
```

## MCP Server (AI Agent Integration)

fetchkit ships as an **MCP server**, making all its capabilities available to AI agents (Claude, LangChain, LangGraph, and any MCP-compatible client). This turns fetchkit into **autonomous agentic infrastructure** -- LLMs can fetch, scrape, extract, and download without custom code.

### Why MCP?

Traditional scraping requires writing code for every site. With fetchkit's MCP server, an AI agent can:

- **Autonomously research topics** by fetching pages, extracting content, and following links
- **Audit websites** by checking metadata, robots.txt, sitemaps, and page structure
- **Extract structured data** from any page using CSS selectors, table parsing, or article extraction
- **Download media** with progress tracking and checksum verification
- **Generate realistic requests** using browser profiles that pass bot detection

All 16 tools return **structured Pydantic models** so the LLM gets clean, typed data -- not raw HTML.

### Quick Start

```bash
pip install 'fetchkit[mcp]'

# Run as stdio server (Claude Desktop / Claude Code)
pyfetcher-mcp

# Run as HTTP server (LangChain / remote agents)
pyfetcher-mcp --http 8000

# Or via Makefile
make mcp          # stdio
make mcp-http     # HTTP on port 8000
```

### Available Tools (16)

| Tool                | What it does                                                       |
| ------------------- | ------------------------------------------------------------------ |
| `fetch_url`         | Fetch any URL with browser headers, returns status + body + timing |
| `fetch_multiple`    | Batch fetch with concurrency control                               |
| `scrape_css`        | Extract content via CSS selectors                                  |
| `scrape_links`      | Harvest links with internal/external classification                |
| `scrape_text`       | Extract readable text (strips scripts, nav, etc.)                  |
| `scrape_metadata`   | Title, description, Open Graph, favicons                           |
| `scrape_forms`      | Parse forms with fields and default values                         |
| `scrape_table`      | Extract HTML table data as rows                                    |
| `check_robots`      | Check robots.txt rules for any path                                |
| `parse_sitemap`     | Parse XML sitemaps                                                 |
| `generate_headers`  | Preview full browser header sets                                   |
| `list_profiles`     | Show all 11 browser profiles                                       |
| `random_user_agent` | Generate random realistic UAs                                      |
| `extract_article`   | Article text + markdown via trafilatura                            |
| `convert_html`      | HTML -> markdown or plaintext                                      |
| `download_file`     | Download with checksum verification                                |

### Resources & Prompts

Resources expose data for context: `pyfetcher://profiles`, `pyfetcher://backends`, `pyfetcher://version`.

Prompts provide templates: `web_research`, `site_audit`, `scrape_guide`, `compare_pages`.

### Use with LangChain

```python
from langchain_mcp_adapters import MultiServerMCPClient

client = MultiServerMCPClient({
    "pyfetcher": {"transport": "http", "url": "http://localhost:8000/mcp"}
})
tools = await client.get_tools()  # 16 LangChain tools ready to use

# Build an agent
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(model, tools)
```

### Use with Claude Desktop

Add to `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "pyfetcher": {
      "command": "pyfetcher-mcp",
      "args": []
    }
  }
}
```

## Transport Backends

| Backend      | Sync | Async | Stream | TLS Fingerprint | CF Bypass | Install          |
| ------------ | :--: | :---: | :----: | :-------------: | :-------: | ---------------- |
| httpx        |  Y   |   Y   |   Y    |        -        |     -     | _(core)_         |
| aiohttp      |  -   |   Y   |   Y    |        -        |     -     | _(core)_         |
| curl_cffi    |  Y   |   Y   |   Y    |        Y        |     -     | `[curl]`         |
| cloudscraper |  Y   |   -   |   -    |        -        |     Y     | `[cloudscraper]` |

## Development

```bash
git clone https://github.com/pr1m8/pyfetcher.git
cd pyfetcher
make install-all              # pdm install with all deps
make test                     # 358 tests
make check                    # format + lint + test
make infra-up && make migrate # start Postgres + MinIO
```

### Makefile Targets

```
make help          Show all targets
make install-all   Install everything
make test          Run 358 tests
make test-cov      Tests with coverage report
make fmt           Format with trunk
make lint          Lint with trunk
make check         Format + lint + test
make infra-up      Start Postgres + MinIO
make infra-down    Stop infrastructure
make migrate       Run Alembic migrations
make pipeline      Run crawl->scrape->download
make build         Build wheel + sdist
make publish       Publish to PyPI
make docs          Build Sphinx docs
make clean         Remove build artifacts
```

## Documentation

<div align="center">

**[pr1m8.github.io/pyfetcher](https://pr1m8.github.io/pyfetcher/)**

[Quick Start](https://pr1m8.github.io/pyfetcher/en/latest/quickstart.html) | [Headers](https://pr1m8.github.io/pyfetcher/en/latest/headers.html) | [Scraping](https://pr1m8.github.io/pyfetcher/en/latest/scraping.html) | [Pipeline](https://pr1m8.github.io/pyfetcher/en/latest/pipeline.html) | [Infrastructure](https://pr1m8.github.io/pyfetcher/en/latest/infrastructure.html) | [CLI](https://pr1m8.github.io/pyfetcher/en/latest/cli.html) | [API Reference](https://pr1m8.github.io/pyfetcher/en/latest/api/index.html)

</div>

## License

[MIT](LICENSE)
