Metadata-Version: 2.4
Name: langchain-thecrawler
Version: 0.1.0
Summary: An integration package connecting TheCrawler and LangChain
License: MIT
License-File: LICENSE
Author: manchittlab
Requires-Python: >=3.9,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Dist: langchain-core (>=0.3.15,<0.4.0)
Requires-Dist: requests (>=2.31,<3.0)
Project-URL: Homepage, https://www.miaibot.ai
Project-URL: Repository, https://github.com/manchittlab/langchain-thecrawler
Project-URL: Source Code, https://github.com/manchittlab/langchain-thecrawler
Description-Content-Type: text/markdown

# langchain-thecrawler

This package contains the LangChain integration with [TheCrawler](https://www.miaibot.ai) — a web-scraping and structured-extraction API that runs the extraction LLM on its own GPU, so AI extraction is included on every page with no per-call surcharge.

## Installation

```bash
pip install -U langchain-thecrawler
```

Set your API key (get one at [miaibot.ai](https://www.miaibot.ai)):

```bash
export THECRAWLER_API_KEY="mai_live_..."
```

## Document Loader

`TheCrawlerLoader` loads one or more URLs as LangChain `Document` objects with boilerplate-stripped markdown as `page_content` and rich page metadata.

```python
from langchain_thecrawler import TheCrawlerLoader

loader = TheCrawlerLoader(
    ["https://example.com"],
    # api_key="mai_live_...",  # or set THECRAWLER_API_KEY
)

docs = loader.load()        # list[Document]
# or stream:
for doc in loader.lazy_load():
    print(doc.metadata["url"], len(doc.page_content))
```

PDF and DOCX URLs are handled server-side. A per-page failure does **not** raise — failed pages come back as a `Document` with empty `page_content` and `metadata["status"] == "error"` plus a structured `error_type`, so you can branch on it:

```python
ok = [d for d in docs if d.metadata.get("status") != "error"]
```

### Options

| Arg | Description |
| --- | --- |
| `urls` | A URL string or list of URLs (required) |
| `api_key` | TheCrawler key; falls back to `THECRAWLER_API_KEY` |
| `api_url` | API base URL (default `https://www.miaibot.ai/api/v1`) |
| `params` | Extra options merged into the crawl request (e.g. `{"usePlaywright": True}`) |
| `timeout` | Per-request HTTP timeout in seconds (default 120) |

## License

MIT

