Metadata-Version: 2.4
Name: browsewright
Version: 0.1.0
Summary: Give an LLM a URL and a goal — it drives a real browser, fills forms, and returns structured data. The browser that scripts itself.
Author: krishnashakula
License: MIT
Project-URL: Homepage, https://github.com/krishnashakula/browsewright
Project-URL: Repository, https://github.com/krishnashakula/browsewright
Project-URL: Issues, https://github.com/krishnashakula/browsewright/issues
Keywords: llm,agent,ai-agent,browser-automation,web-scraping,form-filling,structured-extraction,playwright,mcp,anthropic
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anthropic>=0.40.0
Requires-Dist: nodriver>=0.38
Requires-Dist: curl_cffi>=0.7.0
Requires-Dist: Pillow>=10.0.0
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: mcp
Requires-Dist: mcp>=1.27.2; extra == "mcp"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Dynamic: license-file

<div align="center">

# 🦾 Browsewright

### The browser that scripts itself.

**Give an LLM a URL and a goal. It drives a real Chrome, fills out forms, gets past
bot walls, and hands you structured data not raw HTML.**

[![CI](https://github.com/krishnashakula/browsewright/actions/workflows/ci.yml/badge.svg)](https://github.com/krishnashakula/browsewright/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/browsewright?color=2563eb)](https://pypi.org/project/browsewright/)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)
[![PRs welcome](https://img.shields.io/badge/PRs-welcome-orange)](https://github.com/krishnashakula/browsewright/pulls)
[![Stars](https://img.shields.io/github/stars/krishnashakula/browsewright?style=social)](https://github.com/krishnashakula/browsewright)

</div>

---

> **Playwright automates a browser _you_ script.**
> **Browsewright is the browser that scripts _itself_.**

You don't write selectors. You don't maintain scrapers that break every time a
site ships a redesign. You give it intent — *"find the pricing"*, *"enrich this
lead"*, *"fill out this form"* — and an LLM drives a real browser to get it done.

```bash
pip install browsewright
bw "https://stripe.com" "what does this company do and who is it for"
```

```
============================================================
RESULT  [api]  412 tokens  3.1s
------------------------------------------------------------
Stripe is financial infrastructure for the internet. It provides
payment processing, billing, and treasury APIs for businesses from
startups to enterprises like Amazon and Shopify...
============================================================
```

---

## 🤯 It doesn't just read the web. It *does things* on the web.

Most "AI scrapers" hand you text. Browsewright **acts**. Point it at a real
government records form with no API, give it a profile, and walk away:

```bash
bw-tasks form \
  "https://registers.maryland.gov/RowNetWeb/Estates/frmEstateSearch2.aspx" \
  --profile examples/sample_profile.json
```

It read the field labels, mapped your profile onto the form with an LLM, picked
valid dropdown options, submitted it, and came back with:

> **Page 1 of 815** results — real names and dates, extracted as JSON.

No selectors. No XPath. No API. The form has none — it's a 20-year-old ASP.NET
page that's invisible to every HTTP scraper. Browsewright drives it like a human.

---

## 💸 And it's almost free

> **Benchmark — 50 real, diverse websites in one run:**
> **50 / 50 extracted successfully · `$0.047` total · ~1,200 tokens & ~20s median per site.**
> 28% were answered by the free API/archive shortcut with **no browser at all.**
> _(Reproduce it: `python examples/batch_test.py`.)_

It tries the **cheapest path first** — open APIs, RSS, public archives — and only
spins up Chrome when a page actually needs it. You pay pennies for the easy 80%
and a real browser for the hard 20%.

---

## How it stacks up

| | **Browsewright** | Firecrawl | Browser-Use | Tavily |
|---|:---:|:---:|:---:|:---:|
| Returns structured JSON from intent | ✅ | ✅ | ⚠️ scripted | ✅ |
| **Fills & submits real forms** | ✅ | ❌ | ✅ | ❌ |
| Drives a **real** Chrome (human motor layer) | ✅ | ❌ | ✅ | ❌ |
| Gets past Cloudflare/DataDome bot walls | ✅ | ⚠️ | ⚠️ | ❌ |
| Free API/archive shortcut before any browser | ✅ | ❌ | ❌ | ❌ |
| Runs **fully local**, your own API key | ✅ | ❌ SaaS | ✅ | ❌ SaaS |
| 5 ready-made business tasks built in | ✅ | ❌ | ❌ | ❌ |
| MIT, self-hostable | ✅ | partial | ✅ | ❌ |

*Comparisons reflect typical default usage; all four are good tools. Browsewright's
bet is **intent in → action + structured data out**, run locally for pennies.*

---

## Install

```bash
pip install browsewright          # core
pip install "browsewright[mcp]"   # + MCP server (Claude Desktop / Code / any client)
```

Or from source:

```bash
git clone https://github.com/krishnashakula/browsewright && cd browsewright
python -m venv .venv && . .venv/bin/activate    # Windows: .venv\Scripts\activate
pip install -e .
```

Add your Anthropic API key:

```bash
cp .env.example .env
# edit .env and paste your key from https://console.anthropic.com/settings/keys
```

The first browser run launches Chrome via `nodriver` (Chrome must be installed).

> **`bw` "not recognized" after install?** pip put the scripts in a folder that
> isn't on your PATH (common on Windows). Use the module form, which always works:
> `python -m browsewright "<url>" "<goal>"` · `python -m browsewright.tasks_cli enrich "<url>"`

---

## Use it

### CLI

```bash
bw "https://news.ycombinator.com" "the top story right now"
bw "https://example.com" "find the pricing" --json
bw "https://example.com" "debug this" --no-headless --verbose
```

### Python

```python
import asyncio
from browsewright import search

res = asyncio.run(search("https://stripe.com", "what does this company do"))
print(res.answer)         # synthesized answer
print(res.stage)          # "api" | "browser" | "common_crawl" | "blocked" | "error"
print(res.tokens_total, res.elapsed_s)
```

### As an MCP tool (Claude Desktop / Claude Code / any MCP client)

```jsonc
{ "mcpServers": { "browsewright": { "command": "bw-mcp" } } }
```

Your LLM now has a `read_page(url, goal)` tool.

---

## The 5 built-in tasks — `bw-tasks`

One pipeline — **fetch → structured extract (JSON) → diff/aggregate → action** —
exposed as five business workflows. Each is a CLI subcommand and a library function.

| Task | Command | Output |
|---|---|---|
| 🕵️ Competitor watch | `bw-tasks watch <url>` | Baseline now, change alerts later |
| 🎯 Lead enrichment | `bw-tasks enrich <url>` | CRM fields + a personalized cold-email line |
| 📝 Agentic form fill | `bw-tasks form <url> --profile p.json` | Understands fields, fills, submits, reads results |
| 💰 Price/stock tracking | `bw-tasks track <url>` | Price & availability change alerts |
| 📣 Brand monitoring | `bw-tasks brand <name> <urls…>` | Mentions + sentiment digest |

Common flags: `--json`, `--out FILE`, `--slack <webhook>`, `--no-headless`, `--aggressive`.

**Real `enrich` output (trimmed):**

```json
{
  "company_name": "Tavily",
  "industry": "AI/SaaS - Developer Tools",
  "tech_stack_or_integrations": ["OpenAI", "Anthropic", "Groq", "Databricks"],
  "recent_news_or_signals": ["Raised $25M Series A", "Databricks MCP partnership"],
  "icp_fit_score_1_to_10": 7,
  "personalized_cold_email_first_line": "I noticed Tavily just partnered with Databricks on the MCP Marketplace—looks like you're doubling down on enterprise adoption after your $25M Series A."
}
```

### Build your own task with the core primitive

Every task is a thin wrapper over `extract_structured(url, schema)`. Define any
schema, get JSON back:

```python
import asyncio
from browsewright import extract_structured

schema = {"headline": "string",
          "open_roles": [{"title": "string", "team": "string", "location": "string"}]}
data = asyncio.run(extract_structured(
    "https://example.com/careers", schema,
    instruction="Extract the page headline and every open job posting."))
print(data["open_roles"])
```

### Scheduling

Tasks are single-shot; snapshot/diff state persists between runs, so change
detection works across invocations. Run on cron, n8n/Make/Zapier, or `/loop`:

```bash
# every 6h, alert on competitor pricing changes
0 */6 * * * bw-tasks watch "https://competitor.com/pricing" --slack https://hooks.slack.com/services/XXX
```

---

## How it works

```
search(url, goal)
   │
   ├─ Polite gate ........ robots.txt check + per-host rate limit
   │
   ├─ Pre-flight pipeline (cheapest path first)
   │     1. Common Crawl ... public archive            (opt-in)
   │     2. Open API ....... RSS / wp-json / *.json     (no browser, ~1.5k tokens)
   │     3. Origin IP ...... CDN bypass                 (skipped in polite mode)
   │     4. Classifier ..... detect Cloudflare/Akamai/DataDome/…
   │
   └─ Browser session (only if no shortcut hit)
         • real headless Chrome via nodriver (native TLS fingerprint)
         • human motor layer — Bézier mouse, typing cadence, scroll pacing
         • LLM decides actions only at junctions (~1 call/page)
         • blind-scene shortcut: extract directly when the DOM scan is blocked
         • visual recovery: a vision call clears interstitials/challenges
```

---

## Polite by default

**Polite mode is the default and what you should ship.** It checks `robots.txt`,
rate-limits per host, and does **not** bypass CDN bot protection. `--aggressive`
(`polite=False`) enables origin-IP discovery and ignores robots — use it **only**
on targets you own or are authorized to test.

> ⚠️ **You are responsible** for complying with each site's Terms of Service,
> applicable law (CFAA and equivalents), and data-protection rules (GDPR/CCPA).
> Browsewright is for authorized research, your own properties, and sites whose
> terms permit automated access. The authors accept no liability for misuse.

---

## ⭐ Star it / contribute

If Browsewright saved you a scraper, **drop a star** — it's the whole reason this
is open source. Issues and PRs welcome: pre-flight vendors, new tasks, more sites
in the benchmark.

**MIT licensed.** Built on [`nodriver`](https://github.com/ultrafunkamsterdam/nodriver)
+ [Anthropic Claude](https://www.anthropic.com/).
