Metadata-Version: 2.3
Name: apicrawl
Version: 0.1.0
Summary: Crawl API documentation (OpenAPI, Swagger, ReadMe, Mintlify, Fern, llms.txt, plain HTML) into structured, searchable markdown
License: Apache-2.0
Keywords: api,openapi,swagger,documentation,crawler,llm,markdown
Author: Juraj Bezdek
Author-email: juraj.bezdek@gmail.com
Requires-Python: >=3.11,<4.0
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Markup
Requires-Dist: PyYAML (>=6.0,<7.0)
Requires-Dist: aiohttp (>=3.14.1)
Requires-Dist: beautifulsoup4 (>=4.12.0,<5.0.0)
Requires-Dist: html2text (>=2025.4.15,<2026.0.0)
Requires-Dist: httpx (>=0.27.0)
Requires-Dist: langchain (>=1.2.15,<2.0.0)
Requires-Dist: langchain-decorators (>=1.2.0,<2.0.0)
Requires-Dist: langchain-google-genai (>=4.2.2,<5.0.0)
Requires-Dist: langchain-groq (>=1.1.2,<2.0.0)
Requires-Dist: langchain-openai (>=1.1.14,<2.0.0)
Requires-Dist: markdown (>=3.10.2,<4.0.0)
Requires-Dist: playwright (>=1.40.0,<2.0.0)
Requires-Dist: pydantic (>=2.12.5,<3.0.0)
Requires-Dist: python-dotenv (>=1.0.0)
Requires-Dist: rank-bm25 (>=0.2.2,<0.3.0)
Requires-Dist: requests (>=2.31.0,<3.0.0)
Requires-Dist: unidecode (>=1.4.0)
Project-URL: Homepage, https://github.com/jurajbezdek/apicrawl
Description-Content-Type: text/markdown

# ApiCrawl

Crawl API documentation into structured, searchable markdown.

Point it at any API docs URL — an OpenAPI/Swagger spec, a Swagger UI / Redoc /
Stoplight / Scalar page, ReadMe / Mintlify / Fern hosted docs, a Postman
collection, a Google Discovery document, an `llms.txt` index, or plain HTML
docs — and it discovers the underlying spec where one exists, crawls the
pages where one doesn't, classifies the content with an LLM, extracts
authentication instructions, and writes everything as a local markdown tree
you can grep, read, or feed to any tool.

```
pip install apicrawl
playwright install chromium   # used to render JS-heavy docs sites

export GOOGLE_API_KEY=...     # LLM access is required (Gemini primary)
export GROQ_API_KEY=...       # optional fallback provider

apicrawl https://petstore3.swagger.io --output ./api-docs
```

Output layout:

```
api-docs/<catalog_id>/
  index.md            # API name, metadata, description, auth instructions
  manifest.json       # listing of everything ingested (also the completion marker)
  sections/<slug>.md  # docs pages / spec tag groups (markdown + frontmatter)
  endpoints/<slug>.md # one file per endpoint: parameters, examples, TypeScript types
```

Library usage:

```python
import asyncio
from apicrawl import ingest_to_dir

result = asyncio.run(ingest_to_dir("https://docs.example.com/api", "./api-docs"))
print(result.entry.name, result.pages_ingested)
```

Custom storage — implement `IngestionSink` and receive the parsed catalog
entry, sections, and endpoints as plain pydantic models, streamed in batches:

```python
from apicrawl import IngestionSink, ingest

class MySink(IngestionSink):
    async def emit_sections(self, sections): ...
    async def emit_endpoints(self, endpoints): ...

asyncio.run(ingest("https://docs.example.com/api", MySink()))
```

Notes:

- **LLM keys are required** — page classification and auth extraction are
  LLM-powered. Set `GOOGLE_API_KEY` (and optionally `GROQ_API_KEY`) in the
  environment or a `.env` file.
- **Node.js is optional** — if a `node` binary is on PATH, endpoint pages
  include generated TypeScript request/response types (via a bundled
  openapi-typescript). Without Node, ingestion still works; the TS sections
  are simply omitted.

License: Apache-2.0

