Metadata-Version: 2.4
Name: mcp-webs
Version: 1.0.3
Summary: MCP Web Search service for AI ecosystem
License-File: LICENSE
Requires-Python: >=3.12
Requires-Dist: beautifulsoup4>=4.13
Requires-Dist: bleach>=6.0
Requires-Dist: ddgs>=8.0
Requires-Dist: fastmcp>=3.2
Requires-Dist: httpx>=0.28
Requires-Dist: instructor>=1.7
Requires-Dist: langchain>=0.3
Requires-Dist: langgraph>=0.2
Requires-Dist: prometheus-client>=0.20
Requires-Dist: pydantic-settings>=2.8
Requires-Dist: readability-lxml>=0.8
Requires-Dist: redis>=5.2
Requires-Dist: structlog>=25.1
Requires-Dist: tavily>=0.4
Requires-Dist: trafilatura>=2.0
Description-Content-Type: text/markdown

# MCP Web Search

[![Python](https://img.shields.io/badge/Python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Build Status](https://img.shields.io/badge/build-placeholder-gray.svg)](https://github.com/M0M0S/mcp-webs/actions)
[![ruff](https://img.shields.io/badge/lint-ruff-ff69b4.svg)](https://github.com/astral-sh/ruff)
[![mypy](https://img.shields.io/badge/typecheck-mypy-white.svg)](https://mypy.readthedocs.io/)

MCP service for web search and content extraction, implemented via **Model Context Protocol** (FastMCP).

## Features

Three MCP tools:

1. **`search`** — web search with smart filtering and fallback chain
2. **`content`** — clean text extraction from URLs with SSRF protection
3. **`webfetch`** — agent-based search via LangGraph StateGraph + LLM-as-Judge
4. **`llm_health`** — LLM model health status in failover chain

## Architecture

```
FastMCP (primary server)
├── search tool    → DuckDuckGo + fallback chain + smart filtering
├── content tool   → Trafilatura + SSRF protection + cache
└── webfetch tool  → LangGraph StateGraph (8 nodes) + LLM-as-Judge
```

## Installation

```bash
# Clone the repository
git clone https://github.com/M0M0S/mcp-webs.git
cd mcp-webs

# Install dependencies
uv sync

# Configure environment variables
cp .env.example .env
# fill .env (LLM_API_KEY, LLM_BASE_URL, etc.)
```

## Usage

### Start MCP Server

```bash
uv run python -m app.main
```

### Connect to Claude Desktop (example)

Add to `claude_desktop_config.json`:

```json
{
  "mcpServers": {
    "web-search": {
      "command": "uv",
      "args": ["run", "python", "-m", "app.main"],
      "env": {
        "LLM_API_KEY": "your-key",
        "LLM_BASE_URL": "https://api.openai.com/v1"
      }
    }
  }
}
```

### MCP Tools

| Tool | Description | Parameters |
|------|-------------|------------|
| `search` | Web search with fallback chain | `query`, `max_results`, `provider` |
| `content` | Extract text content from URL | `url`, `token_limit` |
| `webfetch` | Agent-based search via LangGraph | `query`, `max_concurrent` |

## Development

### Project Standards

- [CONTRIBUTING.md](CONTRIBUTING.md) — how to contribute, process, standards
- [SECURITY.md](SECURITY.md) — security policy, SSRF, secret handling
- [docs/standards/](docs/standards/) — detailed standards reference

### Commands

```bash
# Tests
uv run pytest tests/ -v

# Coverage
uv run pytest tests/ --cov=app --cov-report=term-missing

# Linting
uv run ruff check app/ tests/

# Formatting
uv run ruff format app/ tests/

# Type checking
uv run mypy app/

# Security scan
uv run bandit -r app/
```

### Configuration

Environment variables documented in [docs/standards/configuration.md](docs/standards/configuration.md).

## Search Logic

### `search` — search with fallback chain:
1. Caching (Redis cache-aside)
2. DuckDuckGo → SearxNG → Tavily → Google (fallback chain)
3. Smart filtering (SEO spam, clickbait, blacklist)
4. Result caching

### `content` — content extraction:
1. SSRF protection (whitelist + private IP check)
2. Trafilatura → readability-lxml → bs4 (fallback chain)
3. HTML sanitization (bleach)
4. Caching (TTL: 24h)

### `webfetch` — agent-based search:
1. **Stage 1**: Generate queries via LLM
2. **Stage 2**: Parallel searches (6 concurrent)
3. **Stage 3**: Select URLs for extraction
4. **Stage 4**: Judge URLs (LLM-as-Judge, threshold ≥0.85)
5. **Stage 5**: Fetch content (Trafilatura)
6. **Stage 6**: Generate features (Pydantic models)
7. **Stage 7**: Judge Features (threshold ≥0.92)
8. **Fallback**: Simple search on agent failure

## Prometheus Metrics

Implemented metrics (via `app/core/metrics.py`):

| Metric | Type | Description |
|--------|------|-------------|
| `provider_search_total` | Counter | Search attempts per provider |
| `provider_search_failure_total` | Counter | Failed searches per provider |
| `provider_health_score` | Gauge | Provider health (0.0–1.0) |
| `provider_chain_position` | Gauge | Provider position in fallback chain |
| `llm_failover_total` | Counter | LLM failover events (from→to model) |
| `llm_failover_duration_seconds` | Histogram | Failover duration |
| `llm_model_health_score` | Gauge | LLM model health (0.0–1.0) |
| `llm_active_model_index` | Gauge | Active LLM model index |
| `webfetch_checkpoint_save_total` | Counter | WebFetch checkpoint saves |
| `webfetch_checkpoint_resume_total` | Counter | WebFetch checkpoint resumes |
| `webfetch_checkpoint_size_bytes` | Histogram | Checkpoint payload size |
| `webfetch_active_checkpoints` | Gauge | Active checkpoints per tenant |
| `cache_ttl_distribution_seconds` | Histogram | Cache TTL distribution |
| `cache_stale_invalidations_total` | Counter | Cache stale invalidations |
| `cache_freshness_avg` | Gauge | Average cache freshness |
| `knowledge_graph_concepts_count` | Gauge | KG concepts count |
| `knowledge_graph_terms_count` | Gauge | KG related terms count |
| `kg_expansion_applied_total` | Counter | KG expansion events |
| `kg_enriched_concepts_total` | Counter | KG enriched concepts |

## See Also

- [CHANGELOG.md](./CHANGELOG.md) — version history
- [pyproject.toml](./pyproject.toml) — dependencies and configuration
