Metadata-Version: 2.4
Name: browsercontrol
Version: 0.1.3
Summary: MCP server for browser automation with Set of Marks (SoM) - AI agents can see and interact with web pages using numbered element IDs
Project-URL: Homepage, https://github.com/adityasasidhar/browsercontrol
Project-URL: Repository, https://github.com/adityasasidhar/browsercontrol
Author: Aditya Sasidhar
License: MIT
License-File: LICENSE
Keywords: agent,ai,automation,browser,llm,mcp,playwright
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.11
Requires-Dist: fastmcp>=2.14.2
Requires-Dist: markdownify>=0.14.1
Requires-Dist: pillow>=11.0.0
Requires-Dist: playwright>=1.49.0
Description-Content-Type: text/markdown

<p align="center">
  <img src="https://raw.githubusercontent.com/adityasasidhar/browsercontrol/main/assets/logo.png" alt="BrowserControl" width="140">
</p>

<h1 align="center">BrowserControl</h1>

<p align="center">
  <strong>Give your AI agent real browser superpowers.</strong><br>
  <sub>Vision-first browser automation for any MCP-compatible AI agent.</sub>
</p>

<p align="center">
  <a href="https://pypi.org/project/browsercontrol/"><img src="https://img.shields.io/pypi/v/browsercontrol?color=blue&label=PyPI" alt="PyPI"></a>
  <a href="https://www.python.org/downloads/"><img src="https://img.shields.io/badge/python-3.11+-3776ab.svg?logo=python&logoColor=white" alt="Python 3.11+"></a>
  <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/license-MIT-green.svg" alt="License: MIT"></a>
  <a href="https://modelcontextprotocol.io/"><img src="https://img.shields.io/badge/MCP-compatible-7c3aed.svg" alt="MCP Compatible"></a>
  <a href="https://github.com/adityasasidhar/browsercontrol/stargazers"><img src="https://img.shields.io/github/stars/adityasasidhar/browsercontrol?style=social" alt="GitHub Stars"></a>
</p>

<p align="center">
  <a href="#-quick-start">Quick Start</a> •
  <a href="#-the-secret-set-of-marks-som">How It Works</a> •
  <a href="#-available-tools">Tools</a> •
  <a href="#-configuration">Configuration</a> •
  <a href="#-examples">Examples</a> •
  <a href="#-contributing">Contributing</a>
</p>

---

> **Ever wished Claude or Gemini could actually browse the web?** Not just fetch URLs, but truly **see**, **click**, **type**, and **interact** with any website like a human?

**BrowserControl** is an MCP server that gives your AI agent full browser access with a **vision-first approach**—no CSS selectors, no XPath, no guessing. Just point at numbers.

<br>

## ✨ What Makes This Different

<table>
<tr>
<td width="50%">

### ❌ Traditional Approach

```
"Find the button with class 'btn-primary'
that contains 'Submit' and is inside
form#contact-form..."
```

- Parse complex DOM structures
- Guess at CSS selectors
- No JavaScript support
- No login persistence
- No debugging tools

</td>
<td width="50%">

### ✅ BrowserControl

```
"click(7)"
```

- See the **rendered page** with numbered elements
- Just say **"click 5"** or **"type in 3"**
- Full **dynamic JavaScript** support
- **Persistent sessions** across restarts
- Complete **DevTools access**

</td>
</tr>
</table>

<br>

## 🎯 The Secret: Set of Marks (SoM)

Every screenshot comes annotated with **numbered red boxes** on interactive elements:

```
Found 15 interactive elements:
  [1] button - Sign In
  [2] input - Search...
  [3] a - Products
  [4] a - Pricing
  [5] button - Get Started
```

Your agent sees the numbers and simply calls `click(1)` to sign in. **No CSS selectors. No XPath. No guessing.**

<br>

## 🚀 Quick Start

### Installation

```bash
# Using pip
pip install browsercontrol

# Or with uv (recommended for faster installs)
uv add browsercontrol

# Chromium is auto-installed on first run—no extra steps needed!
```

### Run the Server

```bash
# Using the CLI
browsercontrol

# Or as a Python module
python -m browsercontrol

# Or with FastMCP
fastmcp run browsercontrol.server:mcp
```

### Connect to Claude Desktop

Add to your Claude configuration file:

<details>
<summary><b>📁 macOS</b> — <code>~/Library/Application Support/Claude/claude_desktop_config.json</code></summary>

```json
{
  "mcpServers": {
    "browsercontrol": {
      "command": "browsercontrol"
    }
  }
}
```

</details>

<details>
<summary><b>📁 Linux</b> — <code>~/.config/Claude/claude_desktop_config.json</code></summary>

```json
{
  "mcpServers": {
    "browsercontrol": {
      "command": "browsercontrol"
    }
  }
}
```

</details>

<details>
<summary><b>📁 Windows</b> — <code>%APPDATA%\Claude\claude_desktop_config.json</code></summary>

```json
{
  "mcpServers": {
    "browsercontrol": {
      "command": "browsercontrol"
    }
  }
}
```

</details>

Then ask Claude:

> _"Go to GitHub and star the browsercontrol repo"_

Claude will navigate, find the star button, and click it—showing you screenshots along the way!

<br>

## 🥊 Head-to-Head Comparison

| Feature                       | **BrowserControl** | Playwright MCP |    Stagehand     |   Browser-Use    |     AgentQL      |
| ----------------------------- | :----------------: | :------------: | :--------------: | :--------------: | :--------------: |
| **Vision-First (SoM)**        | ✅ Numbered boxes  |  ❌ Text tree  |   ⚠️ AI vision   |   ⚠️ AI vision   |   ❌ Selectors   |
| **Multi-Tab Support**         |  ✅ Full control   |  ⚠️ Implicit   |   ⚠️ Implicit    |     ⚠️ Basic     |     ❌ None      |
| **Cookie Management**         |  ✅ Direct tools   |   ⚠️ JS only   |    ⚠️ JS only    |     ⚠️ Basic     |     ❌ None      |
| **File Uploads**              |   ✅ Native tool   |   ⚠️ Manual    |      ❌ No       |      ❌ No       |      ❌ No       |
| **Developer Tools**           |     ✅ 8 tools     |    ❌ None     |     ❌ None      |     ❌ None      |     ❌ None      |
| **Session Recording**         |    ✅ Built-in     |   ⚠️ Manual    |     ❌ None      |     ❌ None      |     ❌ None      |
| **Persistent Sessions**       |    ✅ Automatic    |   ⚠️ Manual    |     ❌ None      |     ❌ None      |     ❌ None      |
| **Token Efficiency**          |    ✅ Tiny IDs     | ⚠️ Large tree  |  ❌ Full images  |  ❌ Full images  | ⚠️ Query results |
| **100% Local/Offline**        |       ✅ Yes       |     ✅ Yes     | ❌ Needs LLM API | ❌ Needs LLM API |  ❌ Cloud only   |
| **Monthly Cost (1k actions)** |       **$0**       |       $0       |     ~$30-50      |     ~$20-40      |      ~$50+       |

<br>

## 💪 Key Advantages

### 1. Multi-Tab Orchestration

Unlike other tools that get "lost" when a new window opens:

- `list_tabs()` — See every open page, title, and URL
- `switch_tab(index)` — Multitask between different sites
- `create_tab(url)` — Open references or parallel workflows

### 2. Session & Cookie Management

Stop fighting with login forms. Inject or inspect session state directly:

- `set_cookie()` — Log in instantly by injecting an auth token
- `get_cookies()` — Debug session issues or export state
- `clear_cookies()` — Fresh start without clearing the whole profile

### 3. Reliable File Uploads

Most AI agents fail when they hit a `<input type="file">`. BrowserControl uses native browser engine hooks:

- `upload_file(id, path)` — Just point at the button and the local file

### 4. Developer Tools Suite

Debug like a pro with tools no one else provides:

```python
get_console_logs()      # See browser errors
get_network_requests()  # Monitor API calls
get_page_errors()       # Catch JS exceptions
run_in_console(code)    # Debug in real-time
inspect_element(5)      # Get computed styles
get_page_performance()  # Core Web Vitals
```

### 5. Session Recording

```
start_recording()  →  Browse around  →  stop_recording()
                                              ↓
                               session_20260202.zip
                         (View with Playwright trace viewer)
```

### 6. Dynamic Viewport Control

Test responsive designs or emulate mobile screens on the fly:

- `set_viewport(width, height)` — Change resolution without restarting

### 7. True Persistence

| What Persists   | BrowserControl | Others |
| --------------- | :------------: | :----: |
| Cookies         |       ✅       |   ❌   |
| localStorage    |       ✅       |   ❌   |
| Session tokens  |       ✅       |   ❌   |
| Login state     |       ✅       |   ❌   |
| Browser history |       ✅       |   ❌   |

**Result**: Log in once, stay logged in across sessions.

<br>

## 🛠️ Available Tools

### Navigation

| Tool                        | Description               |
| --------------------------- | ------------------------- |
| `navigate_to(url)`          | Go to a URL               |
| `go_back()`                 | Navigate back             |
| `go_forward()`              | Navigate forward          |
| `refresh_page()`            | Reload the page           |
| `scroll(direction, amount)` | Scroll up/down/left/right |

### Interaction

| Tool                            | Description                           |
| ------------------------------- | ------------------------------------- |
| `click(element_id)`             | Click element by number               |
| `click_at(x, y)`                | Click at coordinates                  |
| `type_text(element_id, text)`   | Type into input field                 |
| `press_key(key)`                | Press keyboard key (Enter, Tab, etc.) |
| `hover(element_id)`             | Hover over element                    |
| `scroll_to_element(element_id)` | Scroll element into view              |
| `upload_file(element_id, path)` | Upload a file to an input             |
| `wait(seconds)`                 | Wait for page loading                 |

### Tab Management

| Tool                | Description                  |
| ------------------- | ---------------------------- |
| `create_tab(url)`   | Open a new browser tab       |
| `switch_tab(index)` | Switch to a tab by its index |
| `close_tab(index)`  | Close a specific tab         |
| `list_tabs()`       | List all open tabs and URLs  |

### Forms

| Tool                                 | Description            |
| ------------------------------------ | ---------------------- |
| `select_option(element_id, option)`  | Select dropdown option |
| `check_checkbox(element_id)`         | Toggle checkbox        |
| `upload_file(element_id, file_path)` | Upload file to input   |

### Content Extraction

| Tool                              | Description          |
| --------------------------------- | -------------------- |
| `get_page_content()`              | Get page as markdown |
| `get_text(element_id)`            | Get element text     |
| `get_page_info()`                 | Get URL and title    |
| `run_javascript(script)`          | Execute JavaScript   |
| `screenshot(annotate, full_page)` | Take screenshot      |

### Developer Tools

| Tool                           | Description               |
| ------------------------------ | ------------------------- |
| `get_console_logs()`           | Browser console output    |
| `get_network_requests()`       | API calls and responses   |
| `get_page_errors()`            | JavaScript errors         |
| `run_in_console(code)`         | Execute JS in console     |
| `inspect_element(id)`          | Element styles/properties |
| `get_cookies()`                | List browser cookies      |
| `set_cookie(name, value, ...)` | Set a cookie              |
| `delete_cookie(name)`          | Remove a cookie           |
| `clear_cookies()`              | Clear all cookies         |
| `set_viewport(width, height)`  | Change window size        |
| `get_page_performance()`       | Load times, Web Vitals    |

### Recording

| Tool                | Description             |
| ------------------- | ----------------------- |
| `start_recording()` | Begin session recording |
| `stop_recording()`  | Save recording          |
| `take_snapshot()`   | Save screenshot + HTML  |
| `list_recordings()` | View saved sessions     |

<br>

## ⚙️ Configuration

Configure via environment variables:

| Variable                  | Default                       | Description                |
| ------------------------- | ----------------------------- | -------------------------- |
| `BROWSER_HEADLESS`        | `true`                        | Run without visible window |
| `BROWSER_VIEWPORT_WIDTH`  | `1280`                        | Viewport width in pixels   |
| `BROWSER_VIEWPORT_HEIGHT` | `720`                         | Viewport height in pixels  |
| `BROWSER_TIMEOUT`         | `30000`                       | Navigation timeout (ms)    |
| `BROWSER_USER_DATA_DIR`   | `~/.browsercontrol/user_data` | Browser profile path       |
| `BROWSER_EXTENSION_PATH`  | —                             | Path to browser extension  |
| `LOG_LEVEL`               | `INFO`                        | Logging verbosity          |

**Examples:**

```bash
# Run with visible browser (for debugging)
BROWSER_HEADLESS=false browsercontrol

# Mobile viewport emulation
BROWSER_VIEWPORT_WIDTH=375 BROWSER_VIEWPORT_HEIGHT=812 browsercontrol

# Verbose logging
LOG_LEVEL=DEBUG browsercontrol
```

<br>

## 📖 Examples

### Web Research

```
You: "Go to Wikipedia and find out when Python was created"

Claude: *navigates to wikipedia.org*
        *types "Python programming language" in search*
        *clicks search result*
        *reads the page*

        "Python was created by Guido van Rossum and first released in 1991."
```

### Debug a Web App

```
You: "Go to my app at localhost:3000 and check for any errors"

Claude: *navigates to localhost:3000*
        *calls get_console_logs()*

        "I found 2 errors:
         [ERROR] Uncaught TypeError: Cannot read property 'map' of undefined
         [ERROR] Failed to load resource: 404 /api/users"

        *calls get_network_requests()*

        "The /api/users endpoint returned 404. Check your API route."
```

### Automated Testing with Recording

```
You: "Test the login flow on my site. Start recording."

Claude: *calls start_recording()*
        *navigates to site*
        *types email in element 3*
        *types password in element 4*
        *clicks Login button (element 5)*
        *verifies successful redirect*
        *calls stop_recording()*

        "Login test passed! Recording saved to session_20260202.zip"
```

### Form Automation

```
You: "Fill out the contact form on example.com"

Claude: *navigates to example.com/contact*
        *types name in element 2*
        *types email in element 3*
        *types message in element 4*
        *clicks Submit (element 5)*

        "Form submitted successfully!"
```

<br>

## 🏗️ Architecture

```
┌─────────────────┐     ┌──────────────────┐     ┌─────────────┐
│   AI Agent      │────▶│  BrowserControl  │────▶│   Browser   │
│ (Claude/Gemini) │◀────│   MCP Server     │◀────│ (Chromium)  │
└─────────────────┘     └──────────────────┘     └─────────────┘
        │                        │                      │
        │   "click(5)"           │   mouse.click()      │
        │◀───────────────────────│◀─────────────────────│
        │   [annotated           │   [screenshot +      │
        │    screenshot]         │    element map]      │
```

### How It Works

1. **AI sends command** — `click(5)`
2. **Server finds element** — Looks up element #5 from the last screenshot
3. **Browser acts** — Clicks at the element's coordinates
4. **Capture state** — Takes new screenshot, detects elements
5. **Annotate** — Draws numbered boxes on interactive elements
6. **Return to AI** — Sends annotated image + element list

<br>

## 📁 Project Structure

```
browsercontrol/
├── __init__.py          # Package exports
├── __main__.py          # CLI entry point
├── server.py            # MCP server setup
├── browser.py           # BrowserManager with SoM
├── config.py            # Environment configuration
└── tools/
    ├── navigation.py    # Navigation tools
    ├── interaction.py   # Click, type, hover tools
    ├── forms.py         # Form handling tools
    ├── content.py       # Content extraction tools
    ├── devtools.py      # Developer tools
    ├── recording.py     # Session recording tools
    └── tabs.py          # Tab management tools
```

<br>

## 🔧 Troubleshooting

<details>
<summary><b>"Missing X server" Error</b></summary>

Set `BROWSER_HEADLESS=true` or run with xvfb:

```bash
xvfb-run browsercontrol
```

</details>

<details>
<summary><b>Browser Not Starting</b></summary>

Chromium auto-installs on first run. If it fails, install manually:

```bash
python -m playwright install chromium
```

</details>

<details>
<summary><b>Session Not Persisting</b></summary>

Check that `BROWSER_USER_DATA_DIR` is writable:

```bash
ls -la ~/.browsercontrol/
```

</details>

<details>
<summary><b>Connection Refused</b></summary>

Ensure no other instance is running:

```bash
pkill -f browsercontrol
browsercontrol
```

</details>

<details>
<summary><b>View Session Recordings</b></summary>

Open recordings in the Playwright trace viewer:

```bash
npx playwright show-trace ~/.browsercontrol/recordings/session.zip
```

</details>

<br>

## 🤝 Contributing

Contributions are welcome! Check out our [Contributing Guide](CONTRIBUTING.md) for details.

**Ideas for contributions:**

- [ ] Firefox/WebKit support
- [ ] DOM diffing (detect changes)
- [ ] Accessibility audit tools
- [ ] Mobile emulation presets
- [ ] Cookie import/export files

```bash
# Clone and install
git clone https://github.com/adityasasidhar/browsercontrol
cd browsercontrol
uv sync

# Run tests
uv run pytest

# Run in development
uv run fastmcp dev browsercontrol/server.py
```

<br>

## 📄 License

[MIT License](LICENSE) — Use it however you want.

<br>

## 🙏 Acknowledgments

- Vision-first approach inspired by **Google's AntiGravity IDE**
- Built with [FastMCP](https://gofastmcp.com) and [Playwright](https://playwright.dev)
- Thanks to the MCP community for making AI-tool integration accessible

---

<p align="center">
  <strong>Built for AI agents that need to see the web.</strong>
</p>

<p align="center">
  <a href="https://github.com/adityasasidhar/browsercontrol">⭐ Star on GitHub</a> •
  <a href="https://github.com/adityasasidhar/browsercontrol/issues">🐛 Report Bug</a> •
  <a href="https://github.com/adityasasidhar/browsercontrol/issues">💡 Request Feature</a>
</p>
