Metadata-Version: 2.4
Name: www-search-mcp
Version: 1.3.2
Summary: MCP server providing web search (DuckDuckGo), HTTP fetch, browser fetch (Playwright), and file download.
Project-URL: Homepage, https://github.com/naifs/www-search-mcp
Project-URL: Repository, https://github.com/naifs/www-search-mcp
Project-URL: Issues, https://github.com/naifs/www-search-mcp/issues
Author-email: Naifs <naifs.rage@gmail.com>
License: MIT
License-File: LICENSE
Keywords: duckduckgo,mcp,niquests,playwright,www-search-mcp
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: beautifulsoup4>=4.12
Requires-Dist: ddgs>=9.14.2
Requires-Dist: markdownify>=0.13.1
Requires-Dist: mcp>=1.27.1
Requires-Dist: niquests>=3.18.8
Requires-Dist: pydantic>=2.13.4
Requires-Dist: truststore>=0.10.4
Provides-Extra: all
Requires-Dist: playwright>=1.59.0; extra == 'all'
Provides-Extra: browser
Requires-Dist: playwright>=1.59.0; extra == 'browser'
Description-Content-Type: text/markdown

# www-search-mcp

MCP server for **web search**, **HTTP fetch**, **browser fetch** (Playwright), **file download**, **API requests**, and **package search** (PyPI, GitHub).

Optimized for **batching**: every tool accepts lists (multi-query / multi-URL) to reduce round-trips.

## Quick Install

```bash
# Run without installing (recommended)
uvx www-search-mcp

# Or install as a global tool
uv tool install www-search-mcp
```

**VS Code / Cursor config:**
```json
{
  "mcpServers": {
    "www-search": {
      "command": "uvx",
      "args": ["www-search-mcp"]
    }
  }
}
```

## Tools Overview

| Tool | What it does |
|------|-------------|
| `web_search` | General web search (DuckDuckGo) |
| `web_search_images` | Image search |
| `web_search_github` | GitHub repo search |
| `web_search_pypi` | PyPI package search |
| `web_fetch` | Fetch URL as Markdown |
| `web_fetch_browser` | Browser-rendered fetch (JS sites) |
| `web_download` | Download files to disk |
| `web_request` | REST/GraphQL API calls + load tests |
| `web_mcp_info` | Server config and tool docs |
| `web_mcp_status` | Real-time diagnostics |

**Batching:** pass `queries=["q1", "q2"]` or `urls=["url1", "url2"]` instead of calling one-by-one.

---

## Search Tools

### `web_search`
```
queries: str | list[str]
max_results: int = 5
```
Returns `title`, `url`, `snippet`. Safe search disabled.

### `web_search_images`
```
queries: str | list[str]
max_results: int = 5
```
Returns `title`, `image`, `thumbnail`, `height`, `width`, `source`.

### `web_search_github`
```
queries: str | list[str]
max_results: int = 5
```
Returns `title`, `url`, `stars`, `forks`, `language`. Uses GitHub REST API (rate-limited to ~10/min without token).

### `web_search_pypi`
```
queries: str | list[str]
max_results: int = 5
```
Returns `name`, `version`, `summary`, `author`, `license`, `requires_python`, `dependencies`.

---

## Fetch & Download Tools

### `web_fetch`
```
urls: str | list[str]
fetch_div: str = ""          # CSS selector (e.g. "article")
save_file: str = ""         # Absolute path to save
use_session: bool = False   # Reuse cookies
```
Returns Markdown body. Rejects binary content — use `web_download` for files.

### `web_fetch_browser`
```
urls: str | list[str]
fetch_div: str = ""
save_file: str = ""
headless: bool = True
wait_seconds: int = 0
use_session: bool = False
```
Same output as `web_fetch` but renders JS. Use for SPAs, login walls, bot-blocked sites.

### `web_download`
```
urls: str | list[str]
save_files: str | list[str]  # Required target path(s)
use_session: bool = False
```
Returns `saved_to`, `bytes`, `content_type`.

---

## API Tool

### `web_request`
```
queries: dict | list[dict]
```

Spec fields per query:
- `type`: `"rest" | "graphql"`
- `method`: `"GET" | "POST" | ...`
- `url`: target URL
- `headers`: optional dict
- `requests`: repeat count (default 1)
- `concurrency`: async workers (default 1)
- `time`: duration in seconds (0 = fixed count)

REST: `body: dict|list|str`
GraphQL: `query: str`, `variables: dict`, `operationName: str`

Auth: `WEB_REQUEST_TOKEN` used as default `Authorization` if not provided in headers.

Output: `status_counts`, `http_status_counts`, `latency_ms` percentiles. Small runs (`<=3`) include response samples.

---

## Diagnostic Tools

### `web_mcp_info`
Server configuration, tool descriptions, environment variables.

### `web_mcp_status`
Real-time diagnostics:
- `uptime_seconds`, `pid`, `python_version`
- `throttle`: last request, interval
- `sessions`: total, with browser, stale
- `connections`: niquests version, pool size
- `resources`: FD limit, memory RSS, event loop tasks
- `counters`: requests, errors, timeouts
- `config`: transport, timeouts
- `health`: DDGS, Playwright availability
- `metrics`: latency percentiles (p50/p95/p99), subtask stats

**Privileged status server**: A separate Starlette-based HTTP server runs on a daemon thread (default port 8081), providing `/status`, `/healthz`, and `/tasks` endpoints. This server remains responsive even under heavy load because it operates independently from the main event loop. Configure via `WEB_STATUS_PORT` (set to `0` to disable).

### Diagnostic HTTP Routes

When running in HTTP transport mode (`WEB_TRANSPORT=streamable-http`):

- `GET /healthz` — Health check
- `GET /readyz` — Readiness probe (200/503)
- `GET /status` — Same JSON as `web_mcp_status`
- `GET /tasks` — Active event loop tasks (debugging)
- `GET /memory` — Memory breakdown (RSS, arenas)
- `GET /error-types` — Error hierarchy for client introspection

---

## Configuration

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `WEB_TIMEOUT_TOTAL` | `30` | Total timeout (sec) |
| `WEB_TIMEOUT_CONNECT` | `5` | Connect timeout (sec) |
| `WEB_TIMEOUT_READ` | `25` | Read timeout (sec) |
| `WEB_MAX_RESULTS` | `5` | Default search results |
| `WEB_REQUEST_LIMIT` | `50` | Max concurrent requests |
| `WEB_MIN_INTERVAL` | `1.0` | Throttle gap (sec) |
| `WEB_RETRIES` | `2` | Retry attempts |
| `WEB_MAX_FETCH_CHARS` | `200000` | Max fetch body length |
| `WEB_MAX_DOWNLOAD_MB` | `50` | Max download size |
| `WEB_DEBUG` | `false` | Debug logging |
| `WEB_LOG_FORMAT` | `text` | `text` or `json` |
| `WEB_SESSION_ENABLED` | `false` | Persistent cookies |
| `WEB_TRANSPORT` | `stdio` | `stdio` or `streamable-http` |
| `WEB_HTTP_HOST` | `127.0.0.1` | HTTP bind host |
| `WEB_HTTP_PORT` | `8000` | HTTP bind port |
| `WEB_MCP_IDLE_LIFETIME` | `300` | stdio idle timeout (sec), `0` to disable |
| `WEB_DNS_RESOLVER` | `system` | `google`, `cloudflare`, `yandex`, `quad9`, `system` |
| `WEB_DNS_STRATEGY` | — | `only_ipv4`, `only_ipv6`, or dual-stack |
| `WEB_PROXY` | — | HTTP proxy URL |
| `WEB_SSL_VERIFY` | `true` | TLS verification |
| `WEB_SSL_PATH` | — | Extra CA certs |
| `WEB_USER_AGENT` | Chrome 135 | Custom UA |
| `WEB_UA_ROTATION` | `false` | Rotate UA pool |
| `WEB_UA_LIST` | — | Custom UA list (`\|\|\|` separated) |
| `WEB_GITHUB_TOKEN` | — | GitHub API token |
| `WEB_REQUEST_TOKEN` | — | Default auth token |
| `WEB_STATUS_PORT` | `8081` | Privileged status server port (0 to disable) |
| `WEB_MAX_URLS_PER_CALL` | `50` | Maximum URLs per `web_fetch` call |
| `WEB_FETCH_SUBTASK_TIMEOUT` | `15` | Per-subtask timeout for fetch (sec) |
| `WEB_SUBTASK_TIMEOUT` | `5` | Per-subtask timeout for generic ops (sec) |
| `WEB_SUBTASK_RETRIES` | `3` | Retry attempts for subtasks |
| `WEB_STREAM_CHUNK_SIZE` | `65536` | HTTP stream chunk size (bytes) |
| `WEB_SESSION_IDLE_TIMEOUT` | `5` | Session idle timeout (sec) |
| `WEB_SESSION_CLEANUP_INTERVAL` | `5` | Session cleanup interval (sec) |
| `WEB_MAX_LATENCY_SAMPLES` | `20000` | Maximum latency samples to track |
| `WEB_METRICS_MAX_SAMPLES` | `1000` | Maximum metrics samples to track |
| `WEB_SEARCH_CACHE_TTL` | `300` | Search cache TTL (sec) |
| `WEB_SEARCH_CACHE_MAXSIZE` | `100` | Search cache max entries |
| `WEB_FETCH_CACHE_TTL` | `60` | Fetch cache TTL (sec) |
| `WEB_FETCH_CACHE_MAXSIZE` | `50` | Fetch cache max entries |
| `WEB_API_CACHE_TTL` | `300` | API cache TTL (sec) |
| `WEB_API_CACHE_MAXSIZE` | `200` | API cache max entries |

### Session Persistence

When `use_session=True` or `WEB_SESSION_ENABLED=1`:
- `web_fetch` / `web_download`: reuse per-session `niquests.AsyncSession`
- `web_fetch_browser`: reuse per-session Playwright `BrowserContext`

Sessions are scoped to the MCP session (FastMCP `Context`), not global.

---

## HTTP Mode

```bash
export WEB_TRANSPORT=streamable-http
export WEB_HTTP_HOST=127.0.0.1
export WEB_HTTP_PORT=8000
www-search-mcp
```

**Operational endpoints:**
- `GET /healthz` — `{ "status": "ok" }`
- `GET /readyz` — readiness probe (200/503)
- `GET /status` — same JSON as `web_mcp_status`
- `GET /error-types` — error hierarchy

**MCP endpoint:**
- `POST /mcp` — streamable-http

### Idle Timeout (stdio)

stdio process auto-terminates after `WEB_MCP_IDLE_LIFETIME` seconds of inactivity (default 5 min). Set to `0` to disable.

---

## Development

```bash
# Setup
git clone https://github.com/naifs/www-search-mcp.git
cd www-search-mcp
uv sync --all-groups

# Format + lint
uv run ruff format src/ tests/
uv run ruff check src/ tests/

# Type check
uv run ty check src/

# Tests
uv run pytest tests/ -q          # parallel (default)
uv run pytest tests/ -q -n0      # sequential

# Security
uv run bandit -r src/

# Build + install
rm -rf dist/
uv build
uv tool install --force dist/*.whl
```

### Troubleshooting

| Problem | Fix |
|---------|-----|
| `uv` not found | Install from [astral.sh](https://docs.astral.sh/uv/getting-started/installation/) |
| Browser not found | `uv run python -m playwright install chromium` |
| GitHub rate limit | Set `WEB_GITHUB_TOKEN` |
| Binary in `web_fetch` | Use `web_download` instead |
| Tools not showing | Check config JSON, reload client, enable `WEB_DEBUG` |

---

## License

MIT
