Metadata-Version: 2.4
Name: docpull
Version: 4.1.0
Summary: Pull documentation from the web and convert to clean markdown
Author-email: Zachary Roth <support@raintree.technology>
Maintainer-email: Raintree Technology <support@raintree.technology>
License-Expression: MIT
Project-URL: Homepage, https://github.com/raintree-technology/docpull
Project-URL: Documentation, https://github.com/raintree-technology/docpull#readme
Project-URL: Repository, https://github.com/raintree-technology/docpull
Project-URL: Source Code, https://github.com/raintree-technology/docpull
Project-URL: Bug Tracker, https://github.com/raintree-technology/docpull/issues
Project-URL: Releases, https://github.com/raintree-technology/docpull/releases
Keywords: python,markdown,documentation,web-scraping,developer-tools,claude,ai-training-data
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Education
Classifier: Environment :: Console
Classifier: Topic :: Documentation
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Software Development :: Documentation
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Topic :: Utilities
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Natural Language :: English
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: html2text>=2020.1.16
Requires-Dist: defusedxml>=0.7.1
Requires-Dist: extruct>=0.15.0
Requires-Dist: aiohttp>=3.14.0
Requires-Dist: idna>=3.15
Requires-Dist: regex>=2024.11.6
Requires-Dist: rich>=13.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: pydantic>=2.0
Requires-Dist: urllib3>=2.7.0
Provides-Extra: proxy
Requires-Dist: aiohttp-socks>=0.8.0; extra == "proxy"
Provides-Extra: normalize
Requires-Dist: url-normalize>=1.4.0; extra == "normalize"
Provides-Extra: trafilatura
Requires-Dist: trafilatura>=1.12.0; extra == "trafilatura"
Provides-Extra: tokens
Requires-Dist: tiktoken>=0.7.0; extra == "tokens"
Provides-Extra: mcp
Requires-Dist: mcp>=1.0.0; extra == "mcp"
Requires-Dist: pyjwt>=2.13.0; extra == "mcp"
Requires-Dist: python-multipart>=0.0.27; extra == "mcp"
Requires-Dist: starlette>=1.0.1; extra == "mcp"
Provides-Extra: parallel
Requires-Dist: parallel-web>=1.0.1; extra == "parallel"
Provides-Extra: llm
Requires-Dist: tiktoken>=0.7.0; extra == "llm"
Provides-Extra: all
Requires-Dist: aiohttp-socks>=0.8.0; extra == "all"
Requires-Dist: url-normalize>=1.4.0; extra == "all"
Requires-Dist: trafilatura>=1.12.0; extra == "all"
Requires-Dist: tiktoken>=0.7.0; extra == "all"
Requires-Dist: mcp>=1.0.0; extra == "all"
Requires-Dist: pyjwt>=2.13.0; extra == "all"
Requires-Dist: python-multipart>=0.0.27; extra == "all"
Requires-Dist: starlette>=1.0.1; extra == "all"
Requires-Dist: parallel-web>=1.0.1; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: bandit>=1.7.0; extra == "dev"
Requires-Dist: pip-audit>=2.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Requires-Dist: types-requests>=2.31.0; extra == "dev"
Requires-Dist: types-beautifulsoup4>=4.12.0; extra == "dev"
Requires-Dist: types-defusedxml>=0.7.0; extra == "dev"
Requires-Dist: types-pyyaml>=6.0.0; extra == "dev"
Dynamic: license-file

# docpull

**Security-hardened, browser-free crawler that turns static documentation sites into clean, AI-ready Markdown — fast.**

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyPI version](https://badge.fury.io/py/docpull.svg)](https://badge.fury.io/py/docpull)
[![Downloads](https://pepy.tech/badge/docpull)](https://pepy.tech/project/docpull)
[![License: MIT](https://img.shields.io/github/license/raintree-technology/docpull)](https://github.com/raintree-technology/docpull/blob/main/LICENSE)

<p align="center">
  <a href="https://docpull.raintree.technology">
    <img src="https://pub-e85a1abca36f4fd8b4300a6ec2d6f45f.r2.dev/marketing/docpull/1768954147343-iaiziy-docpull-terminal-hero.gif" alt="docpull demo" width="600">
  </a>
</p>

docpull uses async HTTP (not Playwright) to fetch server-rendered pages,
extracts main content, and writes clean Markdown with source-URL frontmatter —
in seconds, with a small install footprint. It won't render JavaScript, but for
the large class of docs that don't need it (API references, Python/Go stdlib,
most dev-tool docs, OpenAPI specs, Next.js and Docusaurus builds), it is a
fast, auditable, sandbox-friendly way to pipe documentation into an LLM context,
a RAG index, or an offline archive. SSRF, XXE, DNS-rebinding, and
CRLF-injection protections are on by default — a necessity when an AI agent
is choosing the URLs.

## Install

```bash
pip install docpull

# Optional extras
pip install 'docpull[llm]'           # tiktoken for token-accurate chunking
pip install 'docpull[trafilatura]'   # alternative extractor for noisy pages
pip install 'docpull[mcp]'           # run as an MCP server for AI agents
pip install 'docpull[all]'           # everything above
```

## Quick start

```bash
# Crawl and save Markdown
docpull https://docs.example.com

# One page, no crawl — the fast path for agents
docpull https://docs.example.com/guide --single

# LLM-ready NDJSON with 4k-token chunks streamed to stdout
docpull https://docs.example.com --profile llm --stream | jq .

# Mirror a site for offline use
docpull https://docs.example.com --profile mirror --cache
```

## Framework-aware extraction

docpull inspects each page before running the generic extractor and can pull
content directly from framework data feeds:

| Framework | Strategy |
|-----------|----------|
| Next.js   | Parses `__NEXT_DATA__` JSON |
| Mintlify  | `__NEXT_DATA__` with Mintlify tagging |
| OpenAPI   | Renders `openapi.json` / `swagger.json` into Markdown |
| Docusaurus| Detected and tagged; generic extractor produces Markdown |
| Sphinx    | Detected from generator metadata / Read the Docs hosts and tagged; generic extractor produces Markdown |

JS-only SPAs with no server-rendered content are detected and skipped with a
clear reason (or, with `--strict-js-required`, reported as an error so agents
can route elsewhere).

## Agent-friendly features

- **`--single`** — fetch a single URL without discovery. Designed for tool loops.
- **`--stream`** — NDJSON one-record-per-line, flushed on every page, pipeable.
- **`--max-tokens-per-file N`** — split each page into token-bounded chunks on
  heading boundaries (exact counts with tiktoken, estimate without).
- **`--emit-chunks`** — write one file or record per chunk instead of per page.
- **`--strict-js-required`** — hard-fail on JS-only pages instead of silently
  skipping.
- **`--extractor trafilatura`** — swap in [trafilatura](https://trafilatura.readthedocs.io/)
  for sites where the default heuristics struggle.

## Python API

```python
from docpull import fetch_one

ctx = fetch_one("https://docs.python.org/3/library/asyncio.html")
print(ctx.title, ctx.source_type)
print(ctx.markdown[:500])
```

Async streaming:

```python
import asyncio
from docpull import Fetcher, DocpullConfig, ProfileName, EventType

async def main():
    cfg = DocpullConfig(
        url="https://docs.example.com",
        profile=ProfileName.LLM,  # chunked NDJSON output
    )
    async with Fetcher(cfg) as fetcher:
        async for event in fetcher.run():
            if event.type == EventType.FETCH_PROGRESS:
                print(f"{event.current}/{event.total}: {event.url}")
        print(f"Done: {fetcher.stats.pages_fetched} pages")

asyncio.run(main())
```

Single-page from an agent tool:

```python
from docpull import Fetcher, DocpullConfig

async def tool_call(url: str) -> str:
    async with Fetcher(DocpullConfig(url=url)) as f:
        ctx = await f.fetch_one(url, save=False)
        return ctx.markdown or ctx.error or ""
```

## Profiles

```bash
docpull https://site.com --profile rag      # Default. Dedup, rich metadata.
docpull https://site.com --profile llm      # NDJSON + chunks + metadata; JS-only pages skip unless --strict-js-required is passed.
docpull https://site.com --profile mirror   # Full archive, polite, cached, hierarchical paths.
docpull https://site.com --profile quick    # Sampling: 50 pages, depth 2.
```

## Parallel context packs

`docpull[parallel]` adds an optional source-discovery and research layer on top
of Parallel web intelligence APIs. Use the core crawler when you already know
the docs URL. Use Parallel packs when an agent needs current web sources found,
extracted, researched, organized, and loaded as durable local context.

Parallel handles live Search, Extract, Task, Entity Search, FindAll, TaskGroup,
and Monitor calls. docpull turns those results into local, inspectable packs
with Markdown, NDJSON, manifests, hashes, source indexes, workflow metadata, and
an `AGENT_CONTEXT.md` load plan so agents do not need to reverse-engineer raw API
responses before using the pack.

Live calls use your own Parallel API key. docpull never proxies requests through
a shared account or writes the key into pack artifacts. For normal PyPI installs,
`docpull parallel init` stores the key in a local secrets file with restricted
permissions so it works across shells and agent sessions. Pack artifacts can
still contain source content, workflow inputs/outputs, selected URLs, and
metadata, so treat them as research data.

```bash
pip install 'docpull[parallel]'

# Recommended local setup. Prompts securely and writes ~/.config/docpull/secrets.env.
docpull parallel init
docpull parallel auth --json

# Optional project-local setup for agent worktrees. Writes .env.local and
# adds it to .gitignore.
docpull parallel init --project

# CI can still use a normal environment secret instead.
export PARALLEL_API_KEY="<your-parallel-api-key>"

docpull parallel context-pack "Compare AI web-search APIs for agents" \
  --query "AI web search API" \
  --query "agent web extraction API" \
  --include-domain parallel.ai \
  --exclude-domain onparallel.com \
  --output-dir ./packs/ai-web-search \
  --mode advanced \
  --extract-limit 8 \
  --max-estimated-cost 0.05 \
  --task-brief
```

`docpull parallel auth` checks local SDK/key presence without printing the key
value and reports the key source (`env`, `project_env`, `user_config`, or
`missing`). It does not make a live Parallel call or prove the key is valid. Use
`--json` when an agent or CI job needs machine-readable configuration status.

The pack contains:

- `AGENT_CONTEXT.md` — agent load plan with source order, pack signals, warnings, errors, and artifact map.
- `documents.ndjson` — chunked records for agents and RAG pipelines.
- `corpus.manifest.json` — stable docpull record manifest.
- `parallel.pack.json` — Parallel workflow metadata, selected URLs, errors, IDs, and usage.
- `sources.md` and `sources/*.md` — human-readable sources and extracted Markdown.
- `brief.md` when `--task-brief` is enabled.

That makes the Parallel layer more than a pass-through wrapper: Parallel finds
and enriches live web sources, while docpull normalizes them into a repeatable
context pack that can be scored, diffed, audited, and loaded offline.

For planning without spending credits, use `--dry-run`. For repeatable packs,
use `--include-domain`, `--exclude-domain`, `--after-date`, and `--max-search-results`
to pin the source policy that appears later in `parallel.pack.json`.
Use `--fetch-max-age-seconds`, `--fetch-timeout-seconds`,
`--disable-cache-fallback`, `--excerpt-chars-per-result`, and `--location`
when you need Parallel's freshness, excerpt-size, and geo controls.

Offline fixtures are supported for demos and tests without a Parallel account:

```bash
docpull parallel demo --output-dir ./packs/demo
```

When working from the source checkout, the same fixture can also be imported
directly with `docpull parallel import ./docs/examples/parallel-search-extract.json`.

The core workflow is Search + Extract + optional Task. Broader pack workflows
cover Search-only snapshots, docs discovery plans, Extract-known-URL packs,
core-fetch-with-Parallel-fallback packs, diff briefs, Task lifecycle snapshots,
Entity Search, FindAll lifecycle actions, TaskGroup batches, Monitor
metadata/events, API specs, and local source scoring.
See [docs/parallel.md](docs/parallel.md) for the Parallel product cross-reference
and the boundary between docpull, Parallel MCP, and planned workflows.

Additional Parallel-backed artifact commands:

```bash
# Readiness and API-key setup checks.
docpull parallel auth
docpull parallel init
docpull parallel init --project

# Search + Extract + optional Task brief.
docpull parallel context-pack "Compare AI web-search APIs for agents" \
  --query "AI web search API" \
  --include-domain parallel.ai \
  --dry-run

# Search-only, docs-discovery, known-URL extract, fallback, and structured Task packs.
docpull parallel search-pack "Parallel Search API docs" --query "Parallel Search API" --dry-run
docpull parallel discover-docs "Find Parallel API docs" \
  --query "Parallel Search API docs" \
  --include-domain docs.parallel.ai \
  --dry-run
docpull parallel extract-pack https://docs.parallel.ai/api-reference/search/search --dry-run
docpull parallel fallback-pack https://docs.parallel.ai/api-reference/search/search --dry-run
docpull parallel task-pack "Research Parallel Search API" \
  --source-include-domain docs.parallel.ai \
  --output-schema-json '{"type":"object","properties":{"summary":{"type":"string"}},"required":["summary"]}' \
  --dry-run

# Entity dossiers, FindAll lifecycle, batch enrichment, monitors, and API specs.
docpull parallel entity-pack "AI developer infrastructure companies" --entity-type companies
docpull parallel findall-pack "AI developer infrastructure companies" --generator preview --dry-run
docpull parallel findall-ingest-pack "Find companies with public API docs"
docpull parallel findall-result-pack findall_123 --output-dir ./packs/findall-result
docpull parallel taskgroup-pack ./companies.json --prompt-template "Research {company}" --dry-run
docpull parallel taskgroup-pack ./companies.json --prompt-template "Research {company}" --wait
docpull parallel monitor-pack create "New vendor pricing changes" --dry-run
docpull parallel monitor-pack create --type snapshot --task-run-id run_123 --dry-run
docpull parallel monitor-pack list --status active --output-dir ./packs/monitors
docpull parallel monitor-pack events monitor_123 --cursor next_cursor --output-dir ./packs/events
docpull parallel api-pack https://docs.parallel.ai/public-openapi.json --output-dir ./packs/parallel-openapi
docpull parallel run ./docs/examples/parallel-llms-api-pack.yaml --dry-run
docpull parallel run ./docs/examples/parallel-openapi-api-pack.yaml --dry-run
docpull pack score ./packs/parallel-openapi
docpull pack sources ./packs/parallel-openapi --require-domain docs.parallel.ai
docpull pack diff ./packs/old ./packs/new --markdown ./packs/changes.md
docpull parallel diff-brief ./packs/old ./packs/new --dry-run
```

## MCP server

docpull ships an MCP (Model Context Protocol) server so AI agents can call it
directly over stdio:

```bash
pip install 'docpull[mcp]'
docpull mcp  # starts the stdio server
```

Claude Code:

```bash
claude mcp add --transport stdio docpull -- docpull mcp
```

Cursor (`.cursor/mcp.json` in a project, or `~/.cursor/mcp.json` globally):

```json
{
  "mcpServers": {
    "docpull": {
      "type": "stdio",
      "command": "docpull",
      "args": ["mcp"]
    }
  }
}
```

Claude Desktop uses the same `mcpServers` shape in
`claude_desktop_config.json`.

Or, if you use Claude Code, install the plugin instead — it bundles the MCP
server, five slash commands (`/docs-add`, `/docs-search`, `/docs-list`,
`/docs-refresh`, `/docs-remove`), and a meta-skill that teaches Claude
when to reach for docpull automatically:

```bash
# 1. Install docpull with the MCP extra (required for the plugin)
pip install 'docpull[mcp]'
```

```
# 2. Then in Claude Code:
/plugin marketplace add raintree-technology/docpull
/plugin install docpull@docpull
```

See [plugin/README.md](plugin/README.md) for details.

Tools exposed (12 total — read tools advertise `readOnlyHint` so hosts that auto-approve safe tools won't prompt):

Read:
- `fetch_url(url, max_tokens?)` — one-shot fetch, no crawl. HTTPS-only, SSRF-validated.
- `list_sources(category?)` — show available aliases (react, nextjs, parallel, fastapi, …)
- `list_indexed()` — what has been fetched locally, with last-fetched age
- `grep_docs(pattern, library?, limit?, context?)` — regex search across fetched Markdown (length-capped + wall-clock budgeted to mitigate ReDoS)
- `read_doc(library, path, line_start?, line_end?)` — read a specific cached file, optionally line-sliced
- `pack_score(pack_dir, required_domains?)` — score a local context pack for agent readiness.
- `pack_diff(old_pack_dir, new_pack_dir)` — compare pack URL sets and content hashes.

Write:
- `ensure_docs(source, force?, profile?)` — fetch a named library (cached 7 days). Forwards progress to clients that supply a `progressToken`.
- `parallel_context_pack(...)` — build or dry-run a Parallel Search + Extract pack without shelling out.
- `parallel_api_pack(source, kind?, output_dir?)` — build a pack from `llms.txt` or OpenAPI.
- `add_source(name, url, description?, category?, max_pages?, force?)` — register a user alias (HTTPS-only, atomic write to `sources.yaml`).
- `remove_source(name, delete_cache?)` — drop a user alias and (optionally) its cached docs.

All schema-backed tools return `structuredContent` validated against an
`outputSchema` for clients that prefer typed output. `fetch_url` intentionally
returns Markdown text directly.

User-defined sources live in `~/.config/docpull-mcp/sources.yaml`:

```yaml
sources:
  mydocs:
    url: https://docs.example.com
    description: My internal docs
    category: internal
    maxPages: 200
```

### Supported MCP path

The supported MCP server is the Python stdio server started by `docpull mcp`.
That is the only MCP path covered by the `docpull` package release contract and
the one agents, plugin users, Claude Code, Cursor, and Claude Desktop should
use.

This repository also contains an `mcp/` directory with an internal TypeScript +
Bun lab for PostgreSQL/pgvector semantic search. It is not shipped by the Python
package, is not documented as a user install path, and should be ignored unless
you are explicitly developing that lab.

## Output

Markdown files with YAML frontmatter:

```markdown
---
title: "Getting Started"
source: https://docs.example.com/guide
source_type: "nextjs"
---

# Getting Started
…
```

NDJSON (one record per page or chunk):

```json
{"document_id": "doc_...", "chunk_id": "chunk_...", "url": "...", "title": "...", "content": "...", "hash": "...", "token_count": 842, "chunk_index": 0}
```

Every output format also writes `corpus.manifest.json` next to the generated
documents. The manifest records the run identity, output format, stable
`document_id` / `chunk_id` values, content hashes, relative output paths, and
chunk counts so regenerated corpora can be diffed and cited by agents.

## Security

- HTTPS-only, mandatory robots.txt compliance
- SSRF protection: blocks private/internal network IPs, DNS rebinding via
  connect-time address pinning
- XXE protection via `defusedxml` on sitemaps
- Path traversal and CRLF header injection guards
- Auth headers stripped on cross-origin redirects

When running with `--proxy`, DNS pinning is delegated to the proxy. Pass
`--require-pinned-dns` to refuse this configuration and keep the connector-
level SSRF guarantees in effect.

## Options

Run `docpull --help` for the full list. Highlights:

```
Core:
  --profile {rag,mirror,quick,llm}
  --single                Fetch one URL (no crawl)
  --format {markdown,json,ndjson,sqlite}
  --stream                Stream NDJSON to stdout

LLM / chunking:
  --max-tokens-per-file N
  --tokenizer NAME        tiktoken encoding (default cl100k_base)
  --emit-chunks           One file/record per chunk

Content extraction:
  --extractor {default,trafilatura}
  --no-special-cases      Disable framework extractors
  --strict-js-required    Error on JS-only pages

Cache:
  --cache                 Enable incremental updates
  --cache-dir DIR
  --cache-ttl DAYS

Crawl:
  --max-concurrent N      Global request concurrency
  --per-host-concurrent N Per-host request concurrency
```

## Performance

End-to-end numbers from `tests/benchmarks/test_10k_pages.py` against a
synthetic 10,000-page localhost site (RAG profile, `max_concurrent=50`,
`per_host_concurrent=50`, HTTP keep-alive, 5% injected duplicate content).
The benchmark emits progress every 1,000 pages plus a final JSON report for
trend tooling. Wall time varies by machine; the value below is the latest
local audit run.

| Metric | Value |
|---|---|
| Total wall time | ~309 s |
| Pages fetched / skipped / failed | 9,501 / 499 / 0 |
| Time to first saved page | ~119 ms |
| Per-page latency p50 / p95 / p99 | ~0 / 168 / 271 ms |
| Peak RSS delta from baseline | ~93 MB |
| Cache manifest size on disk | ~8.9 MB |
| Duplicates detected (5% injected) | 499 / 500 |

Reproduce with `make benchmark` (requires `aiohttp`; runs the gated
benchmark in `tests/benchmarks/` and prints progress plus a JSON line you can
pipe into trend tooling).

## Troubleshooting

```bash
docpull --doctor              # Check installation
docpull URL --verbose         # Verbose output
docpull URL --dry-run         # Test without downloading
docpull URL --preview-urls    # List URLs without fetching
```

## Links

- [Website](https://docpull.raintree.technology)
- [PyPI](https://pypi.org/project/docpull/)
- [GitHub](https://github.com/raintree-technology/docpull)
- [Changelog](https://github.com/raintree-technology/docpull/blob/main/docs/CHANGELOG.md)
- [Metrics](https://github.com/raintree-technology/docpull/blob/main/METRICS.md) — auto-refreshed daily (PyPI downloads, plugin installs via clone count, traffic)

## License

MIT
