Metadata-Version: 2.4
Name: scrapurrr
Version: 0.1.6
Summary: Agentic Web Scraper
Author-email: "Klyne Chrysler C. Dotarot" <klyne@inventivlabs.io>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: click>=8.0
Requires-Dist: httpx>=0.27
Requires-Dist: litellm>=1.0
Requires-Dist: markdownify>=0.13
Requires-Dist: playwright>=1.40
Requires-Dist: pydantic>=2.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0
Provides-Extra: dev
Requires-Dist: coverage>=7.0; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-httpserver>=1.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: types-beautifulsoup4>=4.12; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0; extra == 'dev'
Description-Content-Type: text/markdown

<p align="center">
  <img src="public/COVER.png" alt="Scrapurrr" width="600" />
</p>

<p align="center">
  <strong>Agentic web scraper with schema-driven extraction.</strong>
</p>

<p align="center">
  <img src="https://img.shields.io/badge/python-3.11%2B-blue" alt="Python" />
  <img src="https://img.shields.io/badge/license-MIT-green" alt="License" />
  <img src="https://img.shields.io/badge/version-0.1.0-orange" alt="Version" />
</p>

## What is Scrapurrr?

Define a Pydantic schema, point it at a URL, and get back typed data. Scrapurrr handles rendering, anti-detection, pagination, and extraction automatically.

**Core features:**

- Schema-driven extraction. Define what you want, get a typed object back.
- Interactive chat CLI. Talk to scrapurrr in natural language, navigate pages, extract elements.
- Element inspection. Get CSS selectors, XPath, full XPath, JS path, outerHTML, and styles for any element.
- Agent mode. Autonomous navigation, clicking, scrolling, and form-filling across pages.
- 100+ LLM providers. OpenAI, Anthropic, Groq, Ollama, or any LiteLLM-compatible endpoint.
- Smart fetching. HTTP-first with automatic browser fallback for JS-heavy pages.
- Stealth built-in. Fingerprint masking, human-like behavior, proxy rotation.
- Batch and pagination. Concurrent multi-URL extraction with auto-pagination.
- MCP server. Expose scraping as tools for AI assistants.

## Install

```bash
pip install scrapurrr
playwright install chromium
```

## Quick Start

```python
import asyncio
from pydantic import BaseModel
from scrapurrr import Scrapurrr

class Article(BaseModel):
    title: str
    author: str
    published: str

async def main():
    async with Scrapurrr(provider="openai/gpt-4o", api_key="sk-...") as scraper:
        article = await scraper.extract("https://example.com/article", Article)
        print(article.title)

asyncio.run(main())
```

## Interactive Chat

Start an interactive scraping session from the terminal:

```bash
scrapurrr -p ollama/llama3 chat
```

```
scrapurrr v0.1.0

> go to https://shop.example.com
Navigated to https://shop.example.com

> find "price"
Found 3 elements matching "price":
  [0] span "$29.99"
      css: span.product-price
      xpath: //span[@class='product-price']

> get xpath of all buttons
  [0] //button[@class='add-to-cart']    "Add to Cart"
  [1] //button[@id='search']            "Search"

> what products are on this page?
There are 4 products listed: Widget Pro ($29.99), Widget Max ($49.99)...

> exit
```

The browser stays open between messages. Direct commands like `go to`, `find`, `get xpath`, `scroll`, `click`, and `back` run instantly without calling the LLM. Everything else goes through the LLM with full page context.

## Element Extraction

Extract CSS selectors, XPath, JS path, outerHTML, and computed styles for any element on a page.

```python
async with Scrapurrr(provider="ollama/llama3") as scraper:
    # All elements on a page
    elements = await scraper.extract_elements("https://shop.example.com")

    # Filter by tag or text
    buttons = await scraper.extract_elements("https://shop.example.com", tag="button")
    prices = await scraper.extract_elements("https://shop.example.com", text="price")

    # Single element lookup
    el = await scraper.find_element("Add to Cart", url="https://shop.example.com")
    print(el.css)        # "button.add-to-cart"
    print(el.xpath)      # "//button[@class='add-to-cart']"
    print(el.full_xpath) # "/html/body/div[2]/main/button[3]"
    print(el.js_path)    # "document.querySelector('button.add-to-cart')"
    print(el.outer_html) # "<button class='add-to-cart'>Add to Cart</button>"
    print(el.styles)     # {"color": "white", "backgroundColor": "#1a73e8", ...}
```

## Usage

### Extract from a single page

```python
class Product(BaseModel):
    name: str
    price: str
    rating: str

async with Scrapurrr(provider="ollama/llama3") as scraper:
    product = await scraper.extract("https://shop.example.com/item/42", Product)
```

### Extract a list of items

```python
class Job(BaseModel):
    title: str
    company: str
    location: str

async with Scrapurrr(provider="ollama/llama3") as scraper:
    jobs = await scraper.extract("https://jobs.example.com/python", list[Job])
```

### Agent mode

The agent navigates, clicks, scrolls, and fills forms autonomously.

```python
class SearchResult(BaseModel):
    title: str
    url: str
    snippet: str

async with Scrapurrr(provider="openai/gpt-4o", api_key="sk-...") as scraper:
    results = await scraper.agent(
        task="Go to https://news.ycombinator.com and collect the top 5 stories",
        schema=list[SearchResult],
        max_steps=15,
    )
```

### Batch extraction

```python
urls = ["https://shop.com/product/1", "https://shop.com/product/2", ...]

async with Scrapurrr(provider="ollama/llama3") as scraper:
    products = await scraper.extract_many(urls, Product, concurrency=10)
```

### Auto-pagination

```python
async with Scrapurrr(provider="ollama/llama3") as scraper:
    all_products = await scraper.extract_all_pages(
        "https://shop.com/products?page=1",
        schema=list[Product],
        max_pages=20,
    )
```

## Providers

Provider strings follow [LiteLLM format](https://docs.litellm.ai/docs/providers): `provider/model`.

```python
# OpenAI
scraper = Scrapurrr(provider="openai/gpt-4o", api_key="sk-...")

# Anthropic
scraper = Scrapurrr(provider="anthropic/claude-sonnet-4-20250514", api_key="sk-ant-...")

# Groq
scraper = Scrapurrr(provider="groq/llama-3.1-70b-versatile", api_key="gsk_...")

# Ollama (local, no key needed)
scraper = Scrapurrr(provider="ollama/llama3")

# Self-hosted (vLLM, LM Studio)
scraper = Scrapurrr(provider="openai/mistral-7b", base_url="http://localhost:8000/v1")
```

## Configuration

Copy the example config and point to it:

```bash
cp examples/scrapurrr.yaml scrapurrr.yaml
```

```python
from pathlib import Path
scraper = Scrapurrr(config_path=Path("scrapurrr.yaml"))
```

Constructor arguments override the config file. Environment variables are supported with the `env:` prefix:

```yaml
llm:
  provider: openai/gpt-4o
  api_key: env:OPENAI_API_KEY
```

## CLI

```bash
# Interactive chat session
scrapurrr -p ollama/llama3 chat

# Extract from a URL
scrapurrr extract "https://example.com/product" -s models:Product

# Save as CSV
scrapurrr extract "https://example.com/product" -s models:Product -o result.csv --format csv

# Agent mode
scrapurrr agent "Collect the top 10 products from https://shop.example.com" \
  -s models:Product --max-steps 30

# Batch extract from URL list
scrapurrr batch urls.txt -s models:Product --concurrency 10 -o results.json

# Start MCP server
scrapurrr serve
```

The `-s` flag takes `module:Class` format, a Pydantic model importable from your working directory.

## License

MIT. See [LICENSE](LICENSE).
