Metadata-Version: 2.4
Name: langchain-spidra
Version: 0.1.0
Summary: LangChain integration for Spidra — AI-native web scraping for LLM workflows
Project-URL: Homepage, https://spidra.io
Project-URL: Documentation, https://docs.spidra.io
Project-URL: Repository, https://github.com/spidra-io/spidra-langchain
Project-URL: Bug Tracker, https://github.com/spidra-io/spidra-langchain/issues
Author-email: Spidra <support@spidra.io>
License: MIT
License-File: LICENSE
Keywords: ai,document-loader,langchain,llm,spidra,web-scraping
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Requires-Dist: langchain-core>=0.3.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: spidra>=0.2.0
Provides-Extra: dev
Requires-Dist: langchain-openai>=0.2.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=5.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# langchain-spidra

**LangChain integration for [Spidra](https://spidra.io) — AI-native web scraping for LLM workflows.**

[![PyPI version](https://img.shields.io/pypi/v/langchain-spidra.svg)](https://pypi.org/project/langchain-spidra/)
[![Python versions](https://img.shields.io/pypi/pyversions/langchain-spidra.svg)](https://pypi.org/project/langchain-spidra/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Spidra is **not** a traditional scraper that returns raw HTML. It uses AI to extract exactly the data you describe in natural language — returning clean, structured, LLM-ready content. This package brings that capability directly into LangChain pipelines.

---

## Installation

```bash
pip install langchain-spidra
```

Get your Spidra API key at [app.spidra.io](https://app.spidra.io), then:

```bash
export SPIDRA_API_KEY="spd_your_key_here"
```

---

## Quick Start

```python
from langchain_spidra import SpidraLoader

loader = SpidraLoader(
    url="https://example.com",
    prompt="Extract the main features and pricing",
    output="markdown",
)
docs = loader.load()
print(docs[0].page_content)
```

---

## Components

### `SpidraLoader` — Document Loader

A LangChain `BaseLoader` that supports three scraping modes:

| Mode | Description | Use case |
|---|---|---|
| `scrape` | AI scrape a single URL | Single page Q&A, summarisation |
| `batch` | Scrape multiple URLs in parallel | Compare pages, bulk extraction |
| `crawl` | AI-guided crawl of an entire site | RAG, site-wide analysis |

#### Scrape mode (default)

```python
from langchain_spidra import SpidraLoader

loader = SpidraLoader(
    url="https://spidra.io/pricing",
    prompt="Extract all pricing plans with their names, prices, and features",
    output="json",       # json | markdown | text | table
)
docs = loader.load()
```

#### Batch mode

```python
loader = SpidraLoader(
    urls=[
        "https://spidra.io",
        "https://spidra.io/blog",
        "https://competitor.com",
    ],
    mode="batch",
    prompt="Extract the main headline and product description",
)
docs = loader.load()  # one Document per URL
```

#### Crawl mode

```python
loader = SpidraLoader(
    url="https://spidra.io/blog",
    mode="crawl",
    crawl_instruction="Find all blog posts from 2024 and 2025",
    transform_instruction="Extract the title, publication date, and summary",
    max_pages=20,
)
docs = loader.load()  # one Document per crawled page
```

#### Async support

All loaders support async via `aload()` and `alazy_load()`:

```python
docs = await loader.aload()

async for doc in loader.alazy_load():
    process(doc)
```

#### Full parameter reference

| Parameter | Type | Default | Description |
|---|---|---|---|
| `url` | `str` | — | URL to scrape (`scrape`/`crawl` modes) |
| `urls` | `List[str]` | — | URLs to scrape (`batch` mode) |
| `api_key` | `str` | `SPIDRA_API_KEY` env | Spidra API key |
| `mode` | `str` | `"scrape"` | `"scrape"`, `"batch"`, or `"crawl"` |
| `prompt` | `str` | `"Extract the main content..."` | What data to extract |
| `output` | `str` | `"markdown"` | `"json"`, `"markdown"`, `"text"`, `"table"` |
| `crawl_instruction` | `str` | — | Which pages to discover (crawl mode) |
| `transform_instruction` | `str` | — | What to extract per page (crawl mode) |
| `max_pages` | `int` | — | Max pages to crawl (crawl mode) |
| `use_proxy` | `bool` | — | Route through residential proxy |
| `proxy_country` | `str` | — | ISO country code for geo-targeted proxy |
| `extract_content_only` | `bool` | — | Strip nav/footer boilerplate |
| `cookies` | `str` | — | Raw cookie header string |
| `poll_options` | `PollOptions` | — | Custom polling timeout/interval |

---

### `SpidraScrapeTool` — LangChain Tool

Use Spidra as a tool in agent workflows. The agent decides when and what to scrape.

```python
from langchain_spidra import SpidraScrapeTool
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, ToolMessage

tool = SpidraScrapeTool()  # reads SPIDRA_API_KEY from env

# Bind to a model
model = ChatOpenAI(model="gpt-4o-mini").bind_tools([tool])

messages = [HumanMessage(
    content="What does Spidra cost? Check https://spidra.io/pricing"
)]
response = model.invoke(messages)

# Execute tool calls and get the final answer
if response.tool_calls:
    messages.append(response)
    for tc in response.tool_calls:
        result = tool.invoke(tc["args"])
        messages.append(ToolMessage(content=result, tool_call_id=tc["id"]))
    final = model.invoke(messages)
    print(final.content)
```

---

### `SpidraRetriever` — LangChain Retriever

Drop-in retriever for RAG pipelines. Crawls a site with Spidra and uses your query as the AI extraction instruction.

```python
from langchain_spidra import SpidraRetriever
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

retriever = SpidraRetriever(
    url="https://spidra.io/docs",
    crawl_instruction="Find all documentation pages",
    max_pages=15,
)

# Use in a RAG chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | ChatPromptTemplate.from_template(
        "Answer based on context:\n{context}\n\nQuestion: {question}"
    )
    | ChatOpenAI(model="gpt-4o-mini")
    | StrOutputParser()
)

answer = chain.invoke("How do I authenticate with the Spidra API?")
print(answer)
```

---

## Scrape + Chat in 10 lines

```python
from langchain_spidra import SpidraLoader
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage

docs = SpidraLoader(
    url="https://spidra.io",
    prompt="Extract all key product information",
).load()

answer = ChatOpenAI(model="gpt-4o-mini").invoke([
    HumanMessage(content=f"Summarise in 3 bullets:\n\n{docs[0].page_content}")
])
print(answer.content)
```

---

## Examples

See the [`examples/`](./examples) directory:

| File | What it shows |
|---|---|
| `scrape_and_chat.py` | Scrape a URL → chat with the content |
| `chains.py` | LCEL chain: scrape → extract → format |
| `tool_calling_agent.py` | Agent with `SpidraScrapeTool` |
| `structured_extraction.py` | Typed Pydantic output from a scraped page |
| `batch_scrape.py` | Scrape multiple URLs at once |
| `rag_pipeline.py` | Full RAG: crawl → embed → vector store → Q&A |

---

## Development

```bash
git clone https://github.com/spidra-io/spidra-langchain
cd spidra-langchain
pip install -e ".[dev]"
pytest
```

---

## Links

- [Spidra Website](https://spidra.io)
- [Spidra Documentation](https://docs.spidra.io)
- [LangChain Documentation](https://docs.langchain.com)
- [PyPI Package](https://pypi.org/project/langchain-spidra/)

---

## License

MIT © [Spidra](https://spidra.io)
