Metadata-Version: 2.4
Name: crewai-browserless
Version: 1.0.0
Summary: A CrewAI tool for smart web scraping via the Browserless /smart-scrape API.
Project-URL: Homepage, https://www.browserless.io
Project-URL: Repository, https://github.com/browserless/crewai-tool
Project-URL: Documentation, https://docs.browserless.io
Project-URL: Issues, https://github.com/browserless/crewai-browserless/issues
Author-email: Browserless <support@browserless.io>
License-Expression: SSPL-1.0
Keywords: automation,browserless,crewai,scraping,web-scraping
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.10
Requires-Dist: crewai
Requires-Dist: requests
Description-Content-Type: text/markdown

# crewai-browserless

A [CrewAI](https://docs.crewai.com/) tool for scraping web pages using the [Browserless](https://www.browserless.io/) `/smart-scrape` API. Handles anti-bot detection, captchas, and proxying automatically.

## Installation

```bash
pip install crewai-browserless
```

Or with uv:

```bash
uv add crewai-browserless
```

## Setup

Get a Browserless API Token at https://www.browserless.io/, then set the `BROWSERLESS_API_TOKEN`:

```bash
export BROWSERLESS_API_TOKEN="your-token"

# for private deplyments:
# export BROWSERLESS_API_URL="https://your-browserless-instance.com"
```

## Usage

### With a CrewAI agent

```python
from crewai import Agent, Crew, Task
from crewai_browserless import BrowserlessSmartScrapeTool

agent = Agent(
    role="Web Researcher",
    goal="Scrape web pages and summarize their content.",
    backstory="An expert at extracting useful information from websites.",
    tools=[BrowserlessSmartScrapeTool()],
)

task = Task(
    description="Scrape https://www.browserless.io/blog/java-memory-leak and summarize what the page is about.",
    expected_output="A short summary of the page content.",
    agent=agent,
)

crew = Crew(agents=[agent], tasks=[task])
result = crew.kickoff()

print(result)
```

### Standalone

```python
from crewai_browserless import BrowserlessSmartScrapeTool

tool = BrowserlessSmartScrapeTool()

# Call the tool directly
result = tool.run(url="https://en.wikipedia.org/wiki/Headless_browser", formats=["markdown"])

print(result)
```

## Parameters

| Parameter | Type | Default | Description |
|---|---|---|---|
| `url` | `str` | *required* | The URL to scrape (http/https only) |
| `formats` | `list[str]` | `["markdown"]` | Output formats: `markdown`, `html`, `screenshot`, `pdf`, `links` |
| `timeout` | `int \| None` | `None` | Timeout in milliseconds (uses server default if not set) |

## Environment Variables

| Variable | Required | Description |
|---|---|---|
| `BROWSERLESS_API_URL` | Yes | Base URL of your Browserless instance |
| `BROWSERLESS_API_TOKEN` | No | API token for authentication |

## How It Works

The tool sends a POST request to the Browserless `/smart-scrape` endpoint, which uses a cascading strategy pipeline:

1. **HTTP fetch** — fast, direct request
2. **HTTP fetch with proxy** — retries through a residential proxy
3. **Browser rendering** — headless browser for JavaScript-heavy pages
4. **Browser with captcha solving** — handles captcha challenges automatically

The first strategy that succeeds returns the result. If `screenshot` or `pdf` formats are requested, browser strategies are used automatically.

## License

SSPL-1.0
