Metadata-Version: 2.4
Name: langchain-crawlbase
Version: 0.1.0
Summary: LangChain document loader, tool, and retriever backed by the Crawlbase API.
Author-email: Crawlbase Team <support@crawlbase.com>
Maintainer-email: Crawlbase Team <support@crawlbase.com>
License: MIT
Project-URL: Homepage, https://crawlbase.com
Project-URL: Documentation, https://crawlbase.com/docs/crawling-api
Project-URL: Source, https://github.com/crawlbase/langchain-crawlbase
Project-URL: Issues, https://github.com/crawlbase/langchain-crawlbase/issues
Keywords: langchain,crawlbase,scraping,crawler,llm,retrieval
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: langchain-core<1.0,>=0.3
Requires-Dist: requests>=2.28
Provides-Extra: dev
Requires-Dist: pytest>=7; extra == "dev"
Requires-Dist: responses>=0.24; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Dynamic: license-file

# langchain-crawlbase

LangChain primitives backed by the [Crawlbase](https://crawlbase.com) Crawling API.

Three drop-in classes:

- **`CrawlbaseLoader`** — a `BaseLoader` that fetches a list of URLs and returns clean Markdown `Document`s ready for chunking.
- **`CrawlbaseTool`** — a `BaseTool` that lets an LLM agent fetch a live web page mid-conversation.
- **`CrawlbaseRetriever`** — a `BaseRetriever` that fetches a fixed set of seed URLs and filters by query.

All three return GitHub-flavored Markdown via Crawlbase's `format=md` parameter, so you skip the HTML-stripping step entirely.

## Installation

```bash
pip install langchain-crawlbase
```

## Setup

Get a token from your [Crawlbase dashboard](https://crawlbase.com/dashboard). Use your **normal token** for static pages, or your **JavaScript token** for SPA / JS-rendered pages — Crawlbase routes the request automatically based on which token you send.

```bash
export CRAWLBASE_TOKEN=your_token
```

## Usage

### Document loader

```python
import os
from langchain_crawlbase import CrawlbaseLoader

loader = CrawlbaseLoader(
    token=os.environ["CRAWLBASE_TOKEN"],
    urls=[
        "https://en.wikipedia.org/wiki/Large_language_model",
        "https://en.wikipedia.org/wiki/Retrieval-augmented_generation",
    ],
)
docs = loader.load()
print(docs[0].page_content[:500])
print(docs[0].metadata)  # {'source': '...', 'pc_status': 200, ...}
```

For SPA pages, just use your JavaScript token instead — same interface:

```python
loader = CrawlbaseLoader(
    token=os.environ["CRAWLBASE_JS_TOKEN"],
    urls=["https://some-spa-site.com/page"],
)
```

### Agent tool

```python
import os
from langchain_crawlbase import CrawlbaseTool

tool = CrawlbaseTool(token=os.environ["CRAWLBASE_TOKEN"])

# Use directly:
markdown = tool.invoke({"url": "https://example.com"})

# Or bind to an LLM:
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-opus-4-7")
agent_llm = llm.bind_tools([tool])
```

### Retriever

```python
import os
from langchain_crawlbase import CrawlbaseRetriever

retriever = CrawlbaseRetriever(
    token=os.environ["CRAWLBASE_TOKEN"],
    urls=[
        "https://crawlbase.com/docs/crawling-api",
        "https://crawlbase.com/docs/crawling-api#parameters",
    ],
)
docs = retriever.invoke("how do I render JavaScript pages")
```

> v0.1 uses simple substring matching. For semantic retrieval, pair `CrawlbaseLoader` with a vector store of your choice.

## Extra Crawlbase parameters

Pass any [Crawlbase API parameter](https://crawlbase.com/docs/crawling-api#parameters) via `extra_params`:

```python
loader = CrawlbaseLoader(
    token=token,
    urls=["https://example.com"],
    extra_params={"country": "US", "device": "mobile"},
)
```

## Development

```bash
pip install -e ".[dev]"
pytest tests/unit
ruff check .
```

Integration tests are gated on `CRAWLBASE_TOKEN`:

```bash
CRAWLBASE_TOKEN=xxx pytest tests/integration
```

## License

MIT — © Crawlbase Team. Contact: support@crawlbase.com
