Metadata-Version: 2.4
Name: firescraper
Version: 1.0.0
Summary: Python SDK for FireScraper — web scraping for AI pipelines.
Project-URL: Homepage, https://firescraper.com
Project-URL: Documentation, https://firescraper.com/docs/python-sdk
Project-URL: Repository, https://github.com/moloks-technologies/firescraper-python
Project-URL: Issues, https://github.com/moloks-technologies/firescraper-python/issues
Project-URL: Changelog, https://github.com/moloks-technologies/firescraper-python/blob/main/CHANGELOG.md
Author-email: "Moloks Technologies Inc." <support@firescraper.com>
License-Expression: MIT
License-File: LICENSE
Keywords: ai,crawling,data-extraction,firescraper,llm,rag,web-scraping
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: >=3.9
Requires-Dist: httpx>=0.24.0
Provides-Extra: dev
Requires-Dist: mypy>=1.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest-httpx>=0.21; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Provides-Extra: langchain
Requires-Dist: langchain-firescraper; extra == 'langchain'
Description-Content-Type: text/markdown

# FireScraper Python SDK

Official Python SDK for [FireScraper](https://firescraper.com) — web scraping built for AI pipelines.

Turn websites into clean, structured text for RAG, fine-tuning, and AI agent workflows.

## Installation

```bash
pip install firescraper
```

With LangChain integration:

```bash
pip install firescraper langchain-firescraper
```

## Quick Start

```python
from firescraper import FireScraper

client = FireScraper("fsk_your_api_key")

# Start a crawl
session = client.scrape(
    name="Docs crawl",
    urls=["https://docs.example.com/"],
    max_depth=2,
    scraper="article",
)

# Wait for completion
result = client.wait_for_completion(session.id)
print(f"Scraped {result.counts.success} pages")

# Download results
download = client.get_results(session.id, format="json")
with open("results.json", "wb") as f:
    f.write(download.data)
```

## Async Usage

```python
from firescraper import AsyncFireScraper

async with AsyncFireScraper("fsk_your_api_key") as client:
    session = await client.scrape(
        name="Async crawl",
        urls=["https://example.com/"],
        max_depth=1,
    )
    result = await client.wait_for_completion(session.id)
    download = await client.get_results(session.id, format="markdown")
```

## LangChain Integration

```python
from langchain_firescraper import FireScraperLoader

loader = FireScraperLoader(
    api_key="fsk_your_api_key",
    urls=["https://docs.example.com/"],
    max_depth=2,
)

# Load all documents
docs = loader.load()
for doc in docs:
    print(doc.metadata["url"], len(doc.page_content))

# Or stream with lazy_load
for doc in loader.lazy_load():
    process(doc)
```

## API Reference

### `FireScraper(api_key, *, base_url, timeout)`

| Parameter | Type | Default | Description |
|---|---|---|---|
| `api_key` | `str` | required | API key (starts with `fsk_`) |
| `base_url` | `str` | `https://firescraper.com` | API base URL |
| `timeout` | `float` | `30.0` | HTTP request timeout in seconds |

### Methods

#### `scrape(name, urls, max_depth=1, scraper="article", **kwargs)`

Start a new crawl session.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `name` | `str` | required | Human-readable session name |
| `urls` | `list[str]` | required | Seed URLs |
| `max_depth` | `int` | `1` | Link-hop depth (0 = seeds only) |
| `scraper` | `str` | `"article"` | `"article"` or `"full"` |
| `ignore_urls` | `list[str]` | `None` | URLs to exclude |
| `webhook_url` | `str` | `None` | Callback URL on completion |
| `extraction_schema` | `dict` | `None` | JSON Schema for structured extraction |
| `respect_robots_txt` | `bool` | `None` | Respect robots.txt |
| `content_selector` | `str` | `None` | CSS selector for extraction |

Returns a `ScrapeResponse` with `.id`, `.status`, `.message`.

#### `get_session(session_id)`

Get current session status, including page counts and processing state.

#### `wait_for_completion(session_id, poll_interval=5, timeout=300, on_progress=None)`

Poll until the session reaches a terminal status (`done`, `error`, etc.).

```python
def progress(status):
    print(f"{status.counts.success}/{status.counts.total} pages")

result = client.wait_for_completion(session.id, on_progress=progress)
```

#### `list_results(session_id)`

List available result files for a completed session.

#### `get_results(session_id, format="json")`

Download results. Supported formats: `zip`, `csv`, `json`, `markdown`, `structured`, `manifest`, `documents`, `chunks`, `extracted`. Use `documents` for page-level JSONL output.

#### `get_partial_results(session_id, format="csv")`

Download mid-crawl results while the session is still running.

## Error Handling

```python
from firescraper import FireScraperError, AuthenticationError, RateLimitError

try:
    session = client.scrape(name="Test", urls=["https://example.com"])
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Rate limited — try again later")
except FireScraperError as e:
    print(f"API error: {e.message} (code={e.code}, status={e.status})")
```

## Advanced: Progress Tracking

```python
session = client.scrape(name="Large crawl", urls=urls, max_depth=5)

result = client.wait_for_completion(
    session.id,
    poll_interval=3,
    timeout=600,
    on_progress=lambda s: print(
        f"[{s.session.status}] {s.counts.success} pages, "
        f"queue: {s.processing.queue_length}"
    ),
)
```

## License

MIT
