Metadata-Version: 2.4
Name: multi-browser-crawler
Version: 0.2.0
Summary: Enterprise-grade browser automation with advanced features
Author-email: Spider MCP Team <team@spider-mcp.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/spider-mcp/multi-browser-crawler
Project-URL: Documentation, https://multi-browser-crawler.readthedocs.io/
Project-URL: Repository, https://github.com/spider-mcp/multi-browser-crawler
Project-URL: Bug Reports, https://github.com/spider-mcp/multi-browser-crawler/issues
Keywords: browser,automation,crawling,scraping,playwright,selenium,web,testing,multiprocess,proxy
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP :: Browsers
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: undetected-chromedriver>=3.5.0
Requires-Dist: aiohttp>=3.8.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: html5lib>=1.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: sphinx>=6.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.2.0; extra == "docs"
Requires-Dist: myst-parser>=1.0.0; extra == "docs"
Provides-Extra: performance
Requires-Dist: uvloop>=0.17.0; sys_platform != "win32" and extra == "performance"
Requires-Dist: orjson>=3.8.0; extra == "performance"
Dynamic: license-file

# Multi-Browser Crawler

A clean, focused browser automation package for web scraping and content extraction.

## 🎯 **Ultra-Clean Architecture**

This package provides **4 essential components** for browser automation:

- **BrowserPoolManager**: Browser pool management with undetected-chromedriver
- **ProxyManager**: Simple proxy management with Chrome-ready format
- **DebugPortManager**: Thread-safe debug port allocation
- **BrowserConfig**: Clean configuration management

## ✨ **Key Features**

- **Zero Redundancy**: Every line serves a purpose
- **Built-in Features**: Image download, API discovery, JS execution
- **Direct Usage**: No unnecessary wrapper layers
- **Chrome Integration**: Undetected-chromedriver for stealth browsing
- **Proxy Support**: Single regex parsing, Chrome-ready format
- **Session Management**: Persistent and non-persistent browsers

## 📦 **Installation**

```bash
pip install multi-browser-crawler
```

## 🚀 **Quick Start**

```python
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig

async def main():
    # Create configuration
    config = BrowserConfig(
        headless=True,
        timeout=30,
        browser_data_dir="tmp/browser-data"
    )

    # Initialize browser pool
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        # Fetch a webpage
        result = await browser_pool.fetch_html(
            url="https://example.com",
            session_id=None  # Non-persistent browser
        )

        print(f"✅ Success!")
        print(f"   Title: {result.get('title', 'N/A')}")
        print(f"   Load time: {result.get('load_time', 0):.2f}s")
        print(f"   HTML size: {len(result.get('html', ''))} characters")

    finally:
        await browser_pool.shutdown()

if __name__ == "__main__":
    asyncio.run(main())
```

## ⚙️ **Configuration**

### **Basic Configuration**

```python
from multi_browser_crawler import BrowserConfig

config = BrowserConfig(
    headless=True,                    # Run in headless mode
    timeout=30,                       # Page load timeout in seconds
    browser_data_dir="tmp/browsers",  # Browser data directory
    proxy_file_path="proxies.txt",    # Optional proxy file
    min_browsers=1,                   # Minimum browsers in pool
    max_browsers=5,                   # Maximum browsers in pool
    idle_timeout=300,                 # Browser idle timeout (seconds)
    debug_port_start=9222,            # Debug port range start
    debug_port_end=9322,              # Debug port range end
)
```

### **Environment Variables**

```bash
export BROWSER_HEADLESS=true
export BROWSER_TIMEOUT=30
export BROWSER_DATA_DIR="/tmp/browsers"
export PROXY_FILE_PATH="/path/to/proxies.txt"
export MIN_BROWSERS=1
export MAX_BROWSERS=5
export DEBUG_PORT_START=9222
export DEBUG_PORT_END=9322
```

## 📝 **Proxy File Format**

Create a `proxies.txt` file with one proxy per line:

```
# Basic proxies
127.0.0.1:8080
192.168.1.100:3128
proxy.example.com:8080

# Proxies with authentication
user:pass@192.168.1.1:3128
admin:secret@proxy.example.com:9999

# Complex passwords (supported)
user:complex@pass@host.com:8080
```

## 📚 **Usage Examples**

### **1. Basic Web Scraping**

```python
import asyncio
from multi_browser_crawler import BrowserPoolManager, BrowserConfig

async def basic_scraping():
    config = BrowserConfig(
        headless=True,
        browser_data_dir="tmp/browser-data"
    )

    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://httpbin.org/html",
            session_id=None
        )

        print(f"Status: {result['status']}")
        print(f"HTML: {result['html'][:100]}...")

    finally:
        await browser_pool.shutdown()

asyncio.run(basic_scraping())
```

### **2. Using Proxies**

```python
async def proxy_scraping():
    config = BrowserConfig(
        headless=True,
        browser_data_dir="tmp/browser-data",
        proxy_file_path="proxies.txt"  # Use proxy file
    )

    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://httpbin.org/ip",
            session_id=None,
            use_proxy=True  # Enable proxy usage
        )

        print(f"IP: {result['html']}")

    finally:
        await browser_pool.shutdown()
```

### **3. Persistent Sessions**

```python
async def persistent_session():
    config = BrowserConfig(browser_data_dir="tmp/browser-data")
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        # First request - set cookie
        result1 = await browser_pool.fetch_html(
            url="https://httpbin.org/cookies/set/test/value123",
            session_id="my_session"  # Persistent session
        )

        # Second request - check cookie (same session)
        result2 = await browser_pool.fetch_html(
            url="https://httpbin.org/cookies",
            session_id="my_session"  # Same session
        )

        print("Cookie persisted between requests!")

    finally:
        await browser_pool.shutdown()
```

### **4. JavaScript Execution**

```python
async def javascript_execution():
    config = BrowserConfig(browser_data_dir="tmp/browser-data")
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://httpbin.org/html",
            session_id=None,
            js_action="document.title = 'Modified by JS';"
        )

        print(f"Modified title: {result.get('title')}")

    finally:
        await browser_pool.shutdown()
```

### **5. Image Downloading**

```python
async def download_images():
    config = BrowserConfig(
        browser_data_dir="tmp/browser-data",
        download_images_dir="tmp/images"
    )

    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://example.com",
            session_id=None,
            download_images=True  # Enable image downloading
        )

        print(f"Downloaded images: {result.get('downloaded_images', [])}")

    finally:
        await browser_pool.shutdown()
```

### **6. API Discovery**

```python
async def api_discovery():
    config = BrowserConfig(browser_data_dir="tmp/browser-data")
    browser_pool = BrowserPoolManager(config.to_dict())

    try:
        result = await browser_pool.fetch_html(
            url="https://spa-app.example.com",
            session_id=None,
            capture_api_calls=True  # Enable API discovery
        )

        print(f"API calls: {result.get('api_calls', [])}")

    finally:
        await browser_pool.shutdown()
```

## 🖥️ **CLI Usage**

```bash
# Fetch a single URL
python -m multi_browser_crawler.browser_cli fetch https://example.com

# Fetch with proxy
python -m multi_browser_crawler.browser_cli fetch https://example.com --proxy-file proxies.txt

# Test proxies
python -m multi_browser_crawler.browser_cli test-proxies proxies.txt
```

## 🧪 **Testing**

Run the test suite:

```bash
# Run all tests
python tests/test_browser.py
python tests/test_proxy_manager.py
python tests/test_debug_port_manager.py

# Run usage examples
python examples/01_basic_usage.py
python examples/02_advanced_features.py
```

## 📄 **License**

MIT License - see LICENSE file for details.

## 🤝 **Contributing**

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request

## 📞 **Support**

- **GitHub Issues**: Report bugs and request features
- **Documentation**: [Coding Principles](docs/coding_principles.md) and [Examples](examples/)
- **Examples**: Comprehensive usage patterns and principle demonstrations
