Metadata-Version: 2.4
Name: mcpaisuite-websearchmcp
Version: 1.0.3
Summary: Multi-engine web search and content extraction, exposed as an MCP server
Author: Gael
License: AGPL-3.0-or-later
Project-URL: Homepage, https://github.com/gashel01/websearchmcp
Project-URL: Repository, https://github.com/gashel01/websearchmcp
Project-URL: Issues, https://github.com/gashel01/websearchmcp/issues
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests>=2.31
Requires-Dist: click>=8.1
Provides-Extra: bs4
Requires-Dist: beautifulsoup4>=4.12; extra == "bs4"
Requires-Dist: lxml>=5.0; extra == "bs4"
Provides-Extra: browser
Requires-Dist: playwright>=1.40; extra == "browser"
Provides-Extra: api
Requires-Dist: fastapi>=0.100; extra == "api"
Requires-Dist: uvicorn>=0.23; extra == "api"
Provides-Extra: rerank
Requires-Dist: fastembed>=0.6; extra == "rerank"
Provides-Extra: extract
Requires-Dist: trafilatura>=1.8; extra == "extract"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: beautifulsoup4>=4.12; extra == "dev"
Requires-Dist: lxml>=5.0; extra == "dev"
Dynamic: license-file

# websearchmcp

> Multi-engine web search and content extraction, exposed as an MCP server

Part of the [MCP AI Suite](https://mcpaisuite.dev).

## Features

- **Multi-engine search** with priority fallback: SearXNG (self-hosted) -> DuckDuckGo -> Mojeek -> Brave
- **Parallel search + Reciprocal Rank Fusion** (optional) -- query all engines at once and fuse for better coverage
- **Cross-encoder reranking** (optional) -- reorder results by relevance to the query, no API key
- **Deep rerank** (optional) -- re-score the top candidates on their actual page content, not just snippets
- **Search + answer** (`search_with_answer`) -- ranked sources + a synthesized answer via a bring-your-own-LLM callback (the agent-facing surface of commercial search APIs, at zero search cost)
- **Passage trimming** -- return only the query-relevant passages of a page, not the whole thing (≈35% fewer tokens for the downstream LLM, with no answer loss in our benchmark)
- **Content extraction** via trafilatura (optional, state-of-the-art boilerplate removal) with regex/BeautifulSoup fallback
- **Playwright browser_fetch** for JavaScript-rendered pages with full DOM access and screenshots
- **Search + fetch caching** -- in-memory TTL caches avoid re-querying engines and re-downloading pages
- **Per-engine circuit breaker** -- stops retrying failed engines for a cooldown period
- **Per-engine rate limiter** -- max N requests per minute per engine
- **In-memory TTL cache** for search results with configurable expiry
- **CAPTCHA detection** -- auto-detects bot challenges and suggests browser_fetch fallback
- **Result deduplication** based on normalized URL domain+path

## Installation

```bash
pip install mcpaisuite-websearchmcp
# Optional extras:
pip install mcpaisuite-websearchmcp[bs4]       # BeautifulSoup for better content extraction
pip install mcpaisuite-websearchmcp[browser]   # Playwright for JS-rendered pages
pip install mcpaisuite-websearchmcp[rerank]    # fastembed cross-encoder for relevance reranking
pip install mcpaisuite-websearchmcp[extract]   # trafilatura for high-quality content extraction
pip install mcpaisuite-websearchmcp[dev]       # Development tools
```

> **Note:** BeautifulSoup (`beautifulsoup4` + `lxml`) is optional. Without it, websearchmcp uses a built-in regex extractor that works for most pages. Install the `bs4` extra for higher-quality extraction on complex HTML.

## Quick Start

```python
from websearchmcp import WebSearchFactory

pipeline = WebSearchFactory.from_env()
results = await pipeline.search("Python 3.13 new features", max_results=5)
for r in results:
    print(f"{r.title}: {r.url}")
```

## MCP Server

```bash
websearchmcp-server
```

## Robust backend: SearXNG (recommended)

By default websearchmcp scrapes DuckDuckGo/Mojeek/Brave HTML, which is fragile
(CAPTCHA, 403s, parser breakage). For a reliable, **key-free, self-hosted** backend,
run [SearXNG](https://docs.searxng.org/) — a metasearch engine with a clean JSON API.
When `SEARXNG_URL` is set, it's used as **Priority 1**, with scraping as fallback.

```bash
cd deploy/searxng       # docker-compose.yml + settings.yml provided
docker compose up -d
export SEARXNG_URL=http://localhost:8080
```

> **Already running SearXNG?** Its JSON API is **off by default** — websearchmcp's
> `format=json` request then 403s. Verify with
> `curl "http://localhost:8080/search?q=test&format=json"`; if it's not JSON, add
> `search.formats: [html, json]` (and `server.limiter: false`) to your `settings.yml`
> and restart. See [`deploy/searxng/`](deploy/searxng/) for a ready-made config.

## Configuration

| Variable | Default | Description |
|---|---|---|
| `SEARXNG_URL` | -- | Base URL for self-hosted SearXNG instance |
| `WEBSEARCH_ENGINES` | `duckduckgo,mojeek,brave` | Comma-separated engine list |
| `WEBSEARCH_MAX_LENGTH` | `8000` | Max content length for extraction |
| `WEBSEARCH_RERANK` | `false` | Enable cross-encoder result reranking (needs `[rerank]`) |
| `WEBSEARCH_RERANK_MODEL` | `Xenova/ms-marco-MiniLM-L-6-v2` | Reranker model override |
| `WEBSEARCH_TRAFILATURA` | `true` | Prefer trafilatura extraction when installed (needs `[extract]`) |

## API Reference

### WebSearchPipeline

Priority-based search pipeline with cache, circuit breaker, and deduplication.

```python
await pipeline.search(query, max_results=10, rerank=None,
                      deep_rerank=False, deep_rerank_k=5, parallel=False) -> list[SearchResult]
await pipeline.fetch(url, max_length=8000) -> FetchResult
await pipeline.browser_fetch(url, timeout_ms=30000, wait_until="networkidle",
                             screenshot=False) -> FetchResult
await pipeline.search_with_answer(query, max_results=5, answer_fn=None,
                                  fetch_content=False, rerank=None,
                                  trim_passages=True, passages_per_source=3) -> AnswerResult
```

#### Reranking & answers (bring-your-own-LLM)

```python
from websearchmcp import WebSearchFactory

pipeline = WebSearchFactory.create(enable_rerank=True)  # cross-encoder relevance

# Reranked results (most relevant first), no LLM needed:
results = await pipeline.search("capital of australia", rerank=True)

# Search + synthesized answer: you supply the LLM, we supply ranked+grounded sources.
def answer_fn(query, sources):           # sources: [{title, url, snippet, content?}]
    ctx = "\n".join(f"[{i+1}] {s['title']}: {s.get('content', s['snippet'])}"
                    for i, s in enumerate(sources))
    return my_llm(f"Answer with citations.\nQ: {query}\nSources:\n{ctx}")

res = await pipeline.search_with_answer("capital of australia", answer_fn=answer_fn,
                                        fetch_content=True)
print(res.answer)       # "The capital of Australia is Canberra [1][3]..."
print(res.synthesized)  # True (LLM); False = extractive snippet fallback
```

> **Honest scope:** websearchmcp aggregates *free* engines (no proprietary index), so
> raw result quality/freshness depends on those engines. What this layer adds is the
> agent-facing surface — relevance reranking + cited answer synthesis — plus a focus on
> **token economy**: trafilatura extraction and passage trimming mean the downstream
> LLM reads only the relevant text (≈35% fewer tokens with no answer loss in
> `benchmarks/quality_bench.py`), at **zero search-API cost and no key/lock-in**. It is
> not a drop-in replacement for a paid search index; it's the open, self-hosted
> alternative.

#### Cost / token economy

The cost of agentic search is mostly the tokens your LLM ingests. websearchmcp
minimizes that:

- **trafilatura extraction** strips menus/ads/cookie banners → less boilerplate per page.
- **reranking** lets you return top-3 instead of top-10 and still have the answer.
- **passage trimming** returns only the query-relevant passages of a page.
- **fetch + search caches** avoid paying twice for the same page/query.

Run `python benchmarks/quality_bench.py` to measure relevance@3 and the token saving
on live queries (no LLM calls, so the benchmark itself is free).

### WebSearchFactory

```python
WebSearchFactory.from_env()                          # Build from environment variables
WebSearchFactory.create(searxng_url=..., engines=...) # Explicit config
```

## Architecture

WebSearchPipeline implements a priority-based search strategy: SearXNG (if configured) is tried first as a reliable self-hosted option, then the pipeline rotates through DuckDuckGo, Mojeek, and Brave engines. Each engine has its own circuit breaker and rate limiter. Results are deduplicated by URL and cached with a TTL. Content extraction uses WebExtractor (regex-based or BeautifulSoup) to convert raw HTML into clean text suitable for LLM consumption.

## Testing

```bash
pip install -e ".[dev]"
pytest tests/ -v
```

## License

AGPL-3.0 — see [LICENSE](LICENSE).

Open source for individuals and open-source projects. For commercial use in closed-source products, a commercial license is available — contact [gaeldev@gmail.com](mailto:gaeldev@gmail.com).
