Metadata-Version: 2.4
Name: thordata-mcp-server
Version: 0.6.0
Summary: Official MCP Server for Thordata - Give your AI Agents real-time web scraping superpowers.
Author-email: Thordata Developer Team <support@thordata.com>
License: MIT
Keywords: mcp,web-scraping,ai,llm,thordata,serp,browser-automation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: mcp[cli]>=1.0.0
Requires-Dist: sse-starlette>=1.6.1
Requires-Dist: thordata-sdk>=1.8.4
Requires-Dist: pydantic-settings
Requires-Dist: markdownify
Requires-Dist: html2text
Requires-Dist: python-dotenv
Requires-Dist: playwright>=1.40.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: anyio
Requires-Dist: starlette
Requires-Dist: uvicorn

# Thordata MCP Server

**Give your AI Agents real-time web scraping superpowers.**

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![PyPI version](https://badge.fury.io/py/thordata-mcp-server.svg)](https://badge.fury.io/py/thordata-mcp-server)

A production-ready MCP (Model Context Protocol) server that provides AI agents with powerful web scraping capabilities. Optimized for LLM-friendly interactions with comprehensive error handling, batch operations, and intelligent tool selection.

## ✨ Features

### 🎯 Core Capabilities

- **🔍 Search Engine Tools**: High-level web search with LLM-friendly results
  - `search_engine`: Single-query search with light JSON results
  - `search_engine_batch`: Batch search with concurrent processing
  - Supports Google, Bing, Yandex with pagination

- **🌐 Universal Web Scraper**: Extract content from any webpage
  - `unlocker`: Universal page unlocking with JS rendering & anti-bot handling
  - `unlocker_batch`: Batch scraping with error isolation
  - Output formats: HTML, Markdown, PNG
  - Smart error handling for HTTP status codes

- **🤖 Browser Automation**: Full browser-level scraping
  - `browser`: Navigate and capture ARIA/DOM snapshots
  - JavaScript rendering support
  - Filtered accessibility tree for AI-friendly output

- **🧠 Smart Scraping**: Intelligent tool selection
  - `smart_scrape`: Auto-selects best scraper (SERP, Web Scraper, Unlocker)
  - Automatic fallback to universal scraper
  - Structured data extraction when available

- **📊 SERP API**: Low-level search result scraping
  - `serp`: Advanced SERP operations with full parameter control
  - Batch search support
  - Multiple output formats

### 🚀 Key Highlights

- **✅ Production Ready**: 100% test coverage with comprehensive error handling
- **🎯 LLM Optimized**: Clean tool surface designed for AI agents
- **⚡ High Performance**: Concurrent batch operations, optimized response times
- **🛡️ Robust Error Handling**: Detailed error messages with diagnostic information
- **📦 Batch Support**: Efficient batch processing for multiple URLs/queries
- **🌍 Multi-Engine**: Support for Google, Bing, Yandex search engines

## 📦 Installation

### Prerequisites

- Python 3.10 or higher
- Thordata API credentials ([Get your tokens](https://thordata.com))

### Install from PyPI

```bash
pip install thordata-mcp-server
```

### Install from Source

```bash
# Clone the repository
git clone https://github.com/thordata/thordata-mcp-server.git
cd thordata-mcp-server

# Install dependencies
pip install -e .

# Install Playwright browsers (for browser automation)
playwright install chromium
```

## 🔧 Configuration

### Environment Variables

Create a `.env` file in the root directory or set environment variables:

```env
# Required: Thordata API Credentials
THORDATA_SCRAPER_TOKEN=your_scraper_token
THORDATA_PUBLIC_TOKEN=your_public_token
THORDATA_PUBLIC_KEY=your_public_key

# Optional: Browser Automation (for browser tool)
THORDATA_BROWSER_USERNAME=your_browser_username
THORDATA_BROWSER_PASSWORD=your_browser_password
```

### Tool Exposure Control

Control which tools are exposed via environment variables:

```env
# Expose all tools (default: compact set)
THORDATA_MODE=pro

# Or specify tools explicitly
THORDATA_TOOLS=search_engine,search_engine_batch,unlocker,unlocker_batch,serp,browser,smart_scrape
```

## 🏃 Quick Start

### Running Locally (Stdio - Recommended)

Standard mode for MCP clients (Claude Desktop, Cursor, etc.):

```bash
thordata-mcp
```

Or using Python module:

```bash
python -m thordata_mcp.main --transport stdio
```

### Running with HTTP (SSE)

For remote debugging or HTTP-based clients:

```bash
thordata-mcp --transport streamable-http --port 8000
```

### Claude Desktop Configuration

Add to your `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "thordata": {
      "command": "thordata-mcp",
      "env": {
        "THORDATA_SCRAPER_TOKEN": "your_token",
        "THORDATA_PUBLIC_TOKEN": "your_token",
        "THORDATA_PUBLIC_KEY": "your_key"
      }
    }
  }
}
```

### Cursor Configuration

Add to your `~/.cursor/mcp.json`:

```json
{
  "mcpServers": {
    "thordata": {
      "command": "thordata-mcp",
      "env": {
        "THORDATA_SCRAPER_TOKEN": "your_token",
        "THORDATA_PUBLIC_TOKEN": "your_token",
        "THORDATA_PUBLIC_KEY": "your_key"
      }
    }
  }
}
```

## 🛠️ Available Tools

### Default Tools (Compact Surface)

By default, the server exposes a **compact, LLM-friendly tool set**:

#### 1. `search_engine` - Web Search

High-level web search wrapper optimized for LLMs.

**Parameters:**
- `q` (required): Search query string
- `engine` (default: "google"): Search engine ("google", "bing", "yandex")
- `num` (default: 10): Number of results (1-50)
- `start` (default: 0): Starting position for pagination
- `country`: Country code for geolocation (e.g., "US", "JP")
- `language`: Language code (e.g., "en", "ja")

**Example:**
```json
{
  "q": "Python web scraping",
  "engine": "google",
  "num": 10
}
```

**Response:**
```json
{
  "ok": true,
  "output": {
    "results": [
      {
        "title": "Web Scraping with Python",
        "link": "https://example.com",
        "description": "Learn web scraping..."
      }
    ],
    "meta": {
      "engine": "google",
      "q": "Python web scraping",
      "num": 10
    }
  }
}
```

#### 2. `search_engine_batch` - Batch Web Search

Batch search with concurrent processing and per-item error handling.

**Parameters:**
- `requests` (required): Array of search request objects
- `concurrency` (default: 5): Number of concurrent requests (1-20)
- `engine` (default: "google"): Default engine for all requests
- `num` (default: 10): Default number of results per request

**Example:**
```json
{
  "requests": [
    {"q": "Python programming"},
    {"q": "JavaScript frameworks"},
    {"q": "Machine learning"}
  ],
  "concurrency": 3
}
```

#### 3. `unlocker` - Universal Web Scraper

Extract content from any webpage with JavaScript rendering support.

**Parameters:**
- `url` (required): Target URL to scrape
- `js_render` (default: false): Enable JavaScript rendering
- `output_format` (default: "html"): Output format ("html", "markdown", "png")
- `country`: Country code for geolocation
- `wait_ms`: Wait time in milliseconds before capture
- `wait_for`: CSS selector or text to wait for
- `block_resources`: Block resource types ("script", "image", "video")

**Example:**
```json
{
  "url": "https://example.com",
  "js_render": true,
  "output_format": "markdown"
}
```

**Response:**
```json
{
  "ok": true,
  "output": {
    "markdown": "# Example Page\n\nContent here...",
    "format": "markdown"
  }
}
```

#### 4. `unlocker_batch` - Batch Web Scraping

Batch web scraping with concurrent processing and error isolation.

**Parameters:**
- `requests` (required): Array of request objects with `url` and optional parameters
- `concurrency` (default: 5): Number of concurrent requests (1-20)

**Example:**
```json
{
  "requests": [
    {"url": "https://example.com", "js_render": true},
    {"url": "https://example.org", "output_format": "markdown"}
  ],
  "concurrency": 3
}
```

#### 5. `browser` - Browser Scraper

Navigate and capture ARIA/DOM snapshots using Playwright.

**Parameters:**
- `url` (required): Target URL to navigate
- `filtered` (default: true): Return filtered ARIA snapshot
- `mode` (default: "accessibility"): Snapshot mode ("accessibility" or "dom")
- `max_items` (default: 100): Maximum items in snapshot (1-500)
- `max_chars` (default: 20000): Maximum characters in snapshot
- `include_dom` (default: false): Include DOM snapshot

**Example:**
```json
{
  "url": "https://example.com",
  "filtered": true,
  "max_items": 50
}
```

#### 6. `smart_scrape` - Intelligent Scraping

Automatically selects the best scraping method for any URL.

**Parameters:**
- `url` (required): Target URL to scrape
- `prefer_structured` (default: true): Prefer structured data extraction
- `preview` (default: true): Include raw HTML/JSON preview
- `preview_max_chars` (default: 20000): Maximum characters in preview
- `max_wait_seconds` (default: 300): Maximum wait time for task completion
- `unlocker_output` (default: "markdown"): Output format when using Unlocker fallback

**Example:**
```json
{
  "url": "https://amazon.com/dp/B08N5WRWNW",
  "prefer_structured": true
}
```

**Response:**
```json
{
  "ok": true,
  "output": {
    "tool_used": "amazon_product",
    "structured_data": {
      "title": "Product Title",
      "price": "$99.99",
      ...
    },
    "preview": "..."
  }
}
```

#### 7. `serp` - SERP API (Advanced)

Low-level SERP scraper with full parameter control.

**Parameters:**
- `action` (required): Action to perform ("search" or "batch_search")
- `params` (required): Parameters dictionary

**Example:**
```json
{
  "action": "search",
  "params": {
    "q": "Python programming",
    "engine": "google",
    "num": 10,
    "format": "light_json"
  }
}
```

## 🎯 Error Handling

The server provides comprehensive error handling with detailed diagnostic information:

### Error Response Format

```json
{
  "ok": false,
  "error": {
    "type": "not_found",
    "code": "E3003",
    "message": "HTTP 404 error: Page returned empty content...",
    "details": {
      "url": "https://example.com/not-found",
      "status_code": 404
    }
  },
  "request_id": "unique-request-id"
}
```

### Error Types

- **`validation_error`**: Invalid parameters (E4001)
- **`not_found`**: Resource not found (E3003)
- **`permission_denied`**: Access forbidden (E1004)
- **`upstream_internal_error`**: Server errors (E2106)
- **`timeout_error`**: Request timeout (E2003)
- **`network_error`**: Network issues (E2001)

### Special Features

- **Special Character Detection**: Automatically detects and reports problematic characters in search queries
- **HTTP Status Code Mapping**: Clear error messages for 404, 500, 403, etc.
- **Empty Result Hints**: Helpful notes for empty search results (e.g., Chinese query limitations)
- **Batch Error Isolation**: Individual request failures don't affect batch operations

## 📊 Performance

- **Response Time**: 0.4-2 seconds for most operations
- **Concurrent Processing**: Supports up to 20 concurrent requests
- **Batch Operations**: Efficient batch processing with error isolation
- **Resource Optimization**: Smart caching and request optimization

## 🧪 Testing

The server has been extensively tested with 60+ test scenarios:

- ✅ **HTTP Error Handling**: All status codes properly handled
- ✅ **Special Character Processing**: Automatic detection and clear error messages
- ✅ **Batch Operations**: Concurrent processing with error isolation
- ✅ **Empty Result Handling**: Helpful hints for empty results
- ✅ **Performance**: Optimized response times and resource usage

**Test Coverage**: 100% of reported issues resolved

## 🏗️ Architecture

```
thordata_mcp/
├── main.py              # Entry point
├── registry.py          # Tool registration
├── config.py            # Configuration management
├── context.py           # Server context (client, browser session)
├── utils.py             # Common utilities (error handling, responses)
├── browser_session.py   # Browser session management (Playwright)
├── aria_snapshot.py     # ARIA snapshot filtering
└── tools/
    ├── product_compact.py  # Main tool definitions (compact surface)
    ├── product.py          # Full product implementation
    └── data/               # Data plane tools
        ├── serp.py         # SERP backend integration
        ├── universal.py    # Universal scraper integration
        ├── browser.py      # Browser automation
        └── tasks.py        # Structured scraping tasks
```

## 🎯 Design Principles

1. **LLM-Friendly**: Clean tool surface optimized for AI agents
2. **Robust Error Handling**: Detailed error messages with diagnostic information
3. **Batch Support**: Efficient concurrent processing
4. **Performance Optimized**: Fast response times and resource efficiency
5. **Production Ready**: Comprehensive testing and error handling

## 🚀 Deployment

### Docker

```bash
docker build -t thordata-mcp-server .
docker run -e THORDATA_SCRAPER_TOKEN=... thordata-mcp-server
```

### Docker Compose

See `docker-compose.yml` for a complete setup with Caddy reverse proxy.

## 📝 License

MIT License. Copyright (c) 2026 Thordata.

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## 📞 Support

- **Documentation**: [GitHub Wiki](https://github.com/thordata/thordata-mcp-server/wiki)
- **Issues**: [GitHub Issues](https://github.com/thordata/thordata-mcp-server/issues)
- **Email**: support@thordata.com

## 🙏 Acknowledgments

Built with:
- [MCP](https://modelcontextprotocol.io/) - Model Context Protocol
- [Thordata SDK](https://github.com/thordata/thordata-sdk) - Web scraping SDK
- [Playwright](https://playwright.dev/) - Browser automation

---

**Ready to give your AI agents web scraping superpowers?** 🚀

Install now: `pip install thordata-mcp-server`
