Metadata-Version: 2.4
Name: multi-browser-crawler
Version: 0.5.2
Summary: Focused browser automation package for web scraping and content extraction
Author-email: Spider MCP Team <team@spider-mcp.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/spider-mcp/multi-browser-crawler
Project-URL: Documentation, https://multi-browser-crawler.readthedocs.io/
Project-URL: Repository, https://github.com/spider-mcp/multi-browser-crawler
Project-URL: Bug Reports, https://github.com/spider-mcp/multi-browser-crawler/issues
Keywords: browser,automation,crawling,scraping,playwright,selenium,web,testing,multiprocess,proxy
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: patchright>=1.52.0
Requires-Dist: playwright>=1.40.0
Requires-Dist: undetected-chromedriver>=3.5.0
Requires-Dist: selenium>=4.15.0
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: html5lib>=1.1
Requires-Dist: redis>=6.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=6.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
Requires-Dist: myst-parser>=1.0.0; extra == "docs"
Provides-Extra: performance
Requires-Dist: uvloop>=0.17.0; sys_platform != "win32" and extra == "performance"
Requires-Dist: orjson>=3.8.0; extra == "performance"
Dynamic: license-file

# Multi-Browser Crawler

A focused browser automation package for web scraping and content extraction.

## Features

- **Browser Pool Management**: Auto-scaling browser pools with session management
- **Proxy Support**: Built-in proxy rotation and management  
- **Image Download**: Automatic image capture and localization
- **API Discovery**: Network request capture and pattern matching
- **Session Persistence**: Stateful browsing with cookie/session support

## Installation

```bash
pip install multi-browser-crawler
```

## Quick Start

```python
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig

async def main():
    # Simple configuration
    config = BrowserConfig(headless=True, timeout=30)
    pool = BrowserPoolManager(config.to_dict())

    try:
        await pool.initialize()
        
        # Fetch HTML
        result = await pool.fetch_html(
            url="https://example.com",
            session_id="my_session"
        )

        if result['status']['success']:
            print(f"✅ Success! Title: {result.get('title', 'N/A')}")
            print(f"HTML size: {len(result.get('html', ''))} characters")
        else:
            print(f"❌ Error: {result['status'].get('error')}")

    finally:
        await pool.shutdown()

if __name__ == "__main__":
    asyncio.run(main())
```

## Configuration Options

```python
config = BrowserConfig(
    headless=True,              # Run in headless mode
    timeout=30,                 # Page load timeout (seconds)
    min_browsers=1,             # Minimum browsers in pool
    max_browsers=5,             # Maximum browsers in pool
    proxy_url="http://proxy:8080",  # Optional proxy URL
    download_images_dir="/tmp/images"  # Image download directory
)
```

## API Methods

### fetch_html()

```python
result = await pool.fetch_html(
    url="https://example.com",
    session_id="optional_session",      # For persistent sessions
    timeout=30,                         # Request timeout
    api_patterns=["*/api/*"],          # Capture API calls
    images_to_capture=["*.jpg", "*.png"] # Download images
)
```

**Response format:**
```python
{
    'status': {'success': True, 'url': '...', 'load_time': 1.23},
    'html': '<html>...</html>',
    'title': 'Page Title',
    'api_calls': [...],  # Captured API requests
    'images': [...]      # Downloaded images
}
```

## Session Management

```python
# Persistent session - maintains cookies/state
result1 = await pool.fetch_html(url="https://site.com/login", session_id="user1")
result2 = await pool.fetch_html(url="https://site.com/profile", session_id="user1")

# Non-persistent - fresh browser each time  
result3 = await pool.fetch_html(url="https://site.com", session_id=None)
```

## Proxy Support

```python
# Single proxy
config = BrowserConfig(proxy_url="http://proxy:8080")

# The package integrates with rotating-mitmproxy for advanced proxy rotation
```

## Testing

```bash
# Run all tests
python -m pytest tests/ -v

# Run specific test categories
python -m pytest tests/ -m "not slow" -v
```

## License

MIT License - see LICENSE file for details.
