Metadata-Version: 2.4
Name: www-search-mcp
Version: 1.1.1
Summary: MCP server providing web search (DuckDuckGo), HTTP fetch, browser fetch (Playwright), and file download.
Project-URL: Homepage, https://github.com/naifs/www-search-mcp
Project-URL: Repository, https://github.com/naifs/www-search-mcp
Project-URL: Issues, https://github.com/naifs/www-search-mcp/issues
Author-email: Naifs <naifs.rage@gmail.com>
License: MIT
License-File: LICENSE
Keywords: duckduckgo,mcp,niquests,playwright,www-search-mcp
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: ddgs>=9.11.3
Requires-Dist: markdownify>=0.13.1
Requires-Dist: mcp>=1.26.0
Requires-Dist: niquests>=3.10.0
Requires-Dist: playwright>=1.55.0
Requires-Dist: truststore>=0.10.4
Description-Content-Type: text/markdown

# www-search-mcp

MCP (Model Context Protocol) server providing **web search**, **HTTP fetch**, **browser-based fetch** (Playwright), **file download**, and **package search** (PyPI, GitHub).

Gives AI assistants (Claude Desktop, Cursor, etc.) the ability to search the web, read web pages, download files, and discover Python packages.

This server is optimized for **batching**: tools accept either a single string or a list. Agents should prefer passing lists (multi-query / multi-URL) to reduce round-trips.

## Project Structure

High-level layout:

- `src/www_search_mcp/server.py`: MCP entry point (FastMCP creation, tool registration, shutdown hooks)
- `src/www_search_mcp/info.py`: single source of truth for tool docs + `web_mcp_info`
- `src/www_search_mcp/tools/`: one file per MCP tool implementation
- `src/www_search_mcp/search.py`: DuckDuckGo search + GitHub/PyPI API logic
- `src/www_search_mcp/utils/`: shared reusable building blocks (no MCP wiring)

`utils/` modules (shared code):

- `utils/http_async.py`: shared `niquests.AsyncSession`, DNS resolver preset, global outbound request semaphore (`WEB_REQUEST_LIMIT`)
- `utils/fetch_processing.py`: response -> standardized payload helpers (fetch markdown, save-to-file, streamed download limits)
- `utils/paths.py`: URL/path validation and small normalizations (`validate_url`, `resolve_path`, `normalize_fetch_div`)
- `utils/search_exec.py`: common search execution logic (`execute_search`, `FIELD_MAPS`, clamping)
- `utils/cache.py`: TTL/LRU caches (sync + async-safe)
- `utils/throttle.py`: global throttling (sync + async)
- `utils/html.py`: HTML -> Markdown conversion helpers

Note: tools typically import from the `www_search_mcp.utils` facade, which re-exports the stable public helpers from these modules.

## System Requirements

| Requirement | Version |
|---|---|
| **uv** | [Install guide](https://docs.astral.sh/uv/getting-started/installation/) |
| **Python** | 3.10+ (managed automatically by `uv`) |
| **Playwright** | Chromium browser (installed via post-install script) |

### Install uv

**macOS / Linux:**
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

**Windows (PowerShell):**
```powershell
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
```

**macOS (Homebrew):**
```bash
brew install uv
```

## Installation

### Option 1: Run directly with `uvx` (recommended)

No clone needed. Runs from PyPI. Updates automatically on each run:

```bash
uvx www-search-mcp
```

MCP client config:

```json
{
  "mcpServers": {
    "www-search-mcp": {
      "command": "uvx",
      "args": ["www-search-mcp"]
    }
  }
}
```

### Option 2: Install as a `uv tool`

```bash
# From PyPI
uv tool install www-search-mcp

# After installation, the command is available globally:
www-search-mcp
```

To update or reinstall:

```bash
uv tool upgrade www-search-mcp
# or force reinstall latest:
uv tool install --force www-search-mcp@latest
```

MCP client config (global tool):

```json
{
  "mcpServers": {
    "www-search-mcp": {
      "command": "www-search-mcp"
    }
  }
}
```

### Option 3: Run from local source (for development)

```bash
git clone https://github.com/naifs/www-search-mcp.git
cd www-search-mcp
uv sync
uv run www-search-mcp
```

To update:

```bash
git pull && uv sync
```

MCP client config (local source):

```json
{
  "mcpServers": {
    "www-search-mcp": {
      "command": "uv",
      "args": [
        "run",
        "--project",
        "/absolute/path/to/www-search-mcp",
        "www-search-mcp"
      ]
    }
  }
}
```

### Option 4: Install from built wheel

```bash
cd /path/to/www-search-mcp
uv build
uv tool install dist/*.whl
```

To update:

```bash
uv build && uv tool install --force dist/*.whl
```

> **Note:** Playwright Chromium is installed automatically on first use when a browser tool is called. If auto-install fails (e.g. no network), run manually:
> ```bash
> uv run python -m playwright install chromium
> ```

## Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `WEB_TIMEOUT` | Legacy total request timeout in seconds | `30` |
| `WEB_TIMEOUT_TOTAL` | Total request timeout in seconds (fallback: `WEB_TIMEOUT`) | `30` |
| `WEB_TIMEOUT_CONNECT` | Connect timeout in seconds | `5` |
| `WEB_TIMEOUT_READ` | Read timeout in seconds | `25` |
| `WEB_MAX_RESULTS` | Default max search results per query (1..25) | `5` |
| `WEB_MAX_FETCH_CHARS` | Max characters returned in fetch body | `200000` |
| `WEB_RETRIES` | Retry attempts on timeout/rate-limit (0..5) | `2` |
| `WEB_MIN_INTERVAL` | Minimum seconds between outbound requests (throttle) | `1.0` |
| `WEB_MAX_DOWNLOAD_MB` | Max size per downloaded file, in MB (1 MB = 1024·1024 B) | `50` |
| `WEB_DEBUG` | Enable debug logging (`1`/`true`/`yes`/`on`) | `false` |
| `WEB_SESSION_ENABLED` | Enable persistent cookies/session by default | `false` |
| `WEB_TRANSPORT` | MCP transport: `stdio`\|`sse`\|`streamable-http` | `stdio` |
| `WEB_HTTP_HOST` | Host for HTTP-based transports | `127.0.0.1` |
| `WEB_HTTP_PORT` | Port for HTTP-based transports | `8000` |
| `WEB_PROXY` | HTTP proxy URL (e.g. `http://proxy:8080`) | — |
| `WEB_REQUEST_LIMIT` | Max concurrent outbound requests (global limit) | `50` |
| `WEB_DNS_RESOLVER` | DNS preset: `google`\|`cloudflare`\|`yandex` | `google` |
| `WEB_SSL_VERIFY` | Verify TLS certificates (system trust store via `truststore`) | `true` |
| `WEB_SSL_PATH` | Optional path to extra CA certs (PEM file or OpenSSL-hashed dir), loaded **in addition to** the system trust store | — |
| `WEB_USER_AGENT` | Custom User-Agent string for HTTP requests | Chrome 135 UA |
| `WEB_GITHUB_TOKEN` | Optional GitHub token to increase API rate limits | — |
| `WEB_REQUEST_TOKEN` | Optional default Authorization token for `web_request` (used if `headers.Authorization` is not set). If token has no scheme, `Bearer` is assumed. | — |

## Provided Tools

### Search Tools

- `web_search` (general web pages)
  - Input:
    - `queries: str | list[str]`
    - `max_results: int` (typical 5..25)
  - Output:
    - `status`, `query`, `result_count`
    - `results[]` with `title`, `url`, `snippet`
  - Note: safe search is always disabled

- `web_search_images` (image URLs)
  - Input:
    - `queries: str | list[str]`
    - `max_results: int` (typical 5..25)
  - Output:
    - `status`, `query`, `result_count`
    - `results[]` with `title`, `image`, `url`, `thumbnail`, `height`, `width`, `source`
  - Note: safe search is always disabled

- `web_search_github` (GitHub repos via API)
  - Input:
    - `queries: str | list[str]`
    - `max_results: int` (typical 5..25)
  - Output:
    - `status`, `query`, `result_count`
    - `results[]` with `title`, `url`, `description`, `stars`, `forks`, `language`
  - Note: uses GitHub REST API (no token needed, but rate-limited to ~10 req/min)

- `web_search_pypi` (PyPI packages)
  - Input:
    - `queries: str | list[str]`
    - `max_results: int` (typical 5..25)
  - Output:
    - `status`, `query`, `result_count`
    - `results[]` with `name`, `version`, `summary`, `author`, `license`, `requires_python`, `url`, `repository`, `py_versions`, `dependencies`
  - Note: uses DuckDuckGo discovery + PyPI JSON API for enriched metadata

### Fetch & Download Tools

- `web_fetch` (read known URL(s) as markdown)
  - Input:
    - `urls: str | list[str]` (`http`/`https` only)
    - `fetch_div: str = ""` — optional CSS selector (e.g. `article`, `.post-body`)
    - `save_file: str = ""` — optional absolute file path with extension
    - `use_session: bool = False` — reuse cookies from previous requests
  - Output:
    - `status`, `http_status`, `url`, `truncated`, `title?`, `content_type?`, `body` (or `saved_to`/`bytes_written` when `save_file` is used)

- `web_fetch_browser` (browser-rendered fetch for JS/login/captcha)
  - Input:
    - `urls: str | list[str]` (`http`/`https` only)
    - `fetch_div: str = ""` — optional CSS selector
    - `save_file: str = ""` — optional absolute file path with extension
    - `headless: bool = True` — show browser window or run hidden
    - `wait_seconds: int = 0` — extra wait after page load
    - `use_session: bool = False` — reuse browser cookies/context
  - Output: same as `web_fetch` plus `title`
  - Use for JS-heavy pages or sites that block plain HTTP clients

- `web_download` (download bytes to disk)
  - Input:
    - `urls: str | list[str]` (`http`/`https` only)
    - `save_files: str | list[str]` — required file path(s) with extension
    - `use_session: bool = False` — reuse cookies from previous requests
  - Output:
    - `status`, `url`, `saved_to`, `bytes`, `content_type`

### API Tool

- `web_request` (call REST/GraphQL APIs, optionally load-test)
  - Input: `queries: dict | list[dict]` (one spec or a batch)
  - Shared spec fields:
    - `type: 'rest' | 'graphql'`
    - `method: 'GET'|'POST'|'PUT'|'PATCH'|'DELETE'|'HEAD'|'OPTIONS'`
    - `url: str` (http/https)
    - `headers: dict = {}` (optional)
    - `requests: int = 1`
    - `concurrency: int = 1` (async workers, not OS threads)
    - `time: float = 0`
      - `0`: send exactly `requests * concurrency` requests as fast as possible
      - `>0`: best-effort pacing at ~`requests * concurrency` requests/sec for `time` seconds
  - REST-specific:
    - `body: dict|list|str|None`
      - dict/list: sent as JSON
      - str: sent as raw body
  - GraphQL-specific:
    - either `query: str` (recommended) OR `body: str|dict`
    - optional `variables: dict`
    - optional `operationName: str` (only needed when the GraphQL document contains multiple operations)
  - Auth:
    - If `headers.Authorization` is not provided, `WEB_REQUEST_TOKEN` is used as default.
  - Output:
    - aggregated stats: `status_counts`, `http_status_counts`, `latency_ms` percentiles
    - for small runs (`total_requests <= 3`): includes `response_samples` (truncated)

## Quick Verification

```bash
# Search the web
uv run python -c "import asyncio; from www_search_mcp.tools.web_search import web_search; r=asyncio.run(web_search('python mcp protocol', max_results=3)); print(r['status'], r['result_count'])"

# Fetch a page
uv run python -c "import asyncio; from www_search_mcp.tools.web_fetch import web_fetch; r=asyncio.run(web_fetch('https://example.com')); print(r['status'], r['http_status'])"

# Search GitHub
uv run python -c "import asyncio; from www_search_mcp.tools.web_search_github import web_search_github; r=asyncio.run(web_search_github('fastapi', max_results=3)); print(r['status'], r['result_count'])"

# Search PyPI
uv run python -c "import asyncio; from www_search_mcp.tools.web_search_pypi import web_search_pypi; r=asyncio.run(web_search_pypi('httpx', max_results=3)); print(r['status'], r['result_count'])"

# Call a REST API
uv run python -c "import asyncio; from www_search_mcp.tools.web_request import web_request; r=asyncio.run(web_request({'type':'rest','method':'POST','url':'https://httpbin.org/post','body':{'hello':'world'},'requests':1,'concurrency':1,'time':0})); print(r['status'], r['http_status_counts'])"

# Call a GraphQL API
uv run python -c "import asyncio; from www_search_mcp.tools.web_request import web_request; r=asyncio.run(web_request({'type':'graphql','method':'POST','url':'https://countries.trevorblades.com/','query':'{ __typename }','requests':1,'concurrency':1,'time':0})); print(r['status'], r['http_status_counts'])"
```

## Local Development Commands

All commands are intended to be run from the repository root and use `uv run` (no manual venv activation).

### Install / Sync

```bash
uv sync --all-groups
```

### Format + Lint

```bash
uv run ruff format src/ tests/
uv run ruff check src/ tests/
```

### Type Check

```bash
uv run ty check src/
```

### Tests (xdist)

`pytest-xdist` is enabled by default via `pyproject.toml` (`-n auto`).

```bash
uv run python -m pytest tests/ -q

# Explicit override:
uv run python -m pytest tests/ -q -n auto
```

### Security Scans

```bash
uv run bandit -r src/
uv run pip-audit
```

### Build + Install Wheel Locally

```bash
rm -rf dist/
uv build
uv tool install --force dist/*.whl
```

### Run MCP Server Locally

```bash
uv run www-search-mcp
```

## Troubleshooting

### `uv` not found
Install `uv` and reopen your terminal. See [System Requirements](#system-requirements).

### Dependencies missing
```bash
uv sync
```

### Playwright browser not found
```bash
uv run python -m playwright install chromium
```

### GitHub API rate limit exceeded
The GitHub API allows ~10 requests/minute without authentication. To increase the limit, set a GitHub token:
```bash
export WEB_GITHUB_TOKEN=ghp_your_token_here
```

### Binary content error in `web_fetch`
`web_fetch` rejects binary content (images, PDFs, etc.). Use `web_download` instead to save binary files to disk.

### MCP tools not appearing in client
1. Check that the MCP client config JSON is valid.
2. Ensure the `--project` path is absolute and correct.
3. Reload the MCP client after config changes.
4. Check `WEB_DEBUG=true` for detailed logs.

### Wrong project path in config
The `--project` argument must point to the **root directory** of `www-search-mcp` (where `pyproject.toml` is located), not to the `src/` subdirectory.

## Session & Timeouts

### Session persistence (`use_session` / `WEB_SESSION_ENABLED`)

When session persistence is enabled, cookies are stored **per MCP session** (not globally) using the FastMCP-injected request `Context`.

- `web_fetch` / `web_download`: reuse a per-session `niquests.AsyncSession` (cookie jar + connection pool).
- `web_fetch_browser`: reuse a per-session Playwright `BrowserContext` (browser cookies/storage).

If session persistence is disabled, tools use ephemeral clients/contexts.

### Tool execution deadline

Tools are bounded by a tool-level deadline equal to `WEB_TIMEOUT_TOTAL` (fallback `WEB_TIMEOUT`).
This deadline is independent of HTTP connect/read timeouts and prevents long-running tool calls.

## HTTP Mode (Optional)

By default the server runs via `stdio` transport (best for desktop clients).

To run with HTTP-based transports, set `WEB_TRANSPORT`:

```bash
export WEB_TRANSPORT=streamable-http
export WEB_HTTP_HOST=127.0.0.1
export WEB_HTTP_PORT=8000
uv run www-search-mcp
```

Operational endpoints (HTTP transports only):

- `GET /healthz` -> `{ "status": "ok" }`
- `GET /readyz` -> `{ "status": "ready", "version": "..." }`

Note: FastMCP custom routes are **not** protected by server auth middleware by design; keep them non-sensitive.

## License

MIT
