Metadata-Version: 2.4
Name: doc-builder-mcp
Version: 0.1.3
Summary: MCP Server for intelligent documentation scraping, vectorization, and semantic search with dynamic ontology extraction
Project-URL: Homepage, https://github.com/Hexecu/mcp-doc-builder
Project-URL: Documentation, https://github.com/Hexecu/mcp-doc-builder#readme
Project-URL: Repository, https://github.com/Hexecu/mcp-doc-builder
Project-URL: Issues, https://github.com/Hexecu/mcp-doc-builder/issues
Author: NeuralCode Team
License-Expression: MIT
Keywords: documentation,knowledge-graph,llm,mcp,neo4j,scraping,vector-search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: google-auth>=2.0.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: litellm>=1.40.0
Requires-Dist: lxml>=5.0.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: neo4j>=5.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: questionary>=2.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: tenacity>=8.2.0
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: trafilatura>=1.8.0
Requires-Dist: uvicorn>=0.30.0
Provides-Extra: all
Requires-Dist: mypy>=1.9.0; extra == 'all'
Requires-Dist: playwright>=1.40.0; extra == 'all'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'all'
Requires-Dist: pytest-cov>=4.1.0; extra == 'all'
Requires-Dist: pytest>=8.0.0; extra == 'all'
Requires-Dist: ruff>=0.3.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.9.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.3.0; extra == 'dev'
Provides-Extra: js
Requires-Dist: playwright>=1.40.0; extra == 'js'
Description-Content-Type: text/markdown

# MCP Doc Builder

Intelligent Documentation Scraping, Vectorization, and Semantic Search for AI Coding Assistants.

## Overview

MCP Doc Builder is a Model Context Protocol (MCP) server that provides:

- **Intelligent Web Scraping**: LLM-guided crawler that intelligently decides which documentation pages to index
- **Semantic Vectorization**: Gemini text-embedding-004 for semantic search across documentation
- **Dynamic Ontology**: Automatically extracts concepts and relationships from documentation
- **Knowledge Graph**: Neo4j-based storage with full graph traversal capabilities
- **Hybrid Search**: Combined vector similarity and fulltext search for optimal results

## Features

### Intelligent Crawling
- LLM-powered link evaluation decides which pages to follow
- Respects rate limits to avoid overwhelming documentation servers
- Configurable depth (1-5 hops from root URL)
- Smart content extraction with trafilatura

### Semantic Search
- Gemini text-embedding-004 for 768-dimensional vectors
- Neo4j Vector Index for fast similarity search
- Fulltext search with Lucene
- Hybrid search combining both methods

### Dynamic Ontology
- Automatic concept extraction (APIs, patterns, entities)
- Relationship inference (uses, extends, requires, etc.)
- Chunk-to-concept linking
- Concept co-occurrence analysis

### MCP Integration
- 6 tools for complete documentation management
- Resources for graph exploration
- Workflow prompts for common tasks

## Quick Start

### 1. Prerequisites

- Python 3.11+
- Docker (for Neo4j)
- LiteLLM Gateway or Gemini API key

### 2. Installation

You can install `doc-builder-mcp` globally using `pipx` (recommended) or in a local virtual environment.

#### Option 1: One-Line Install (Recommended)

```bash
# Install the package
pipx install doc-builder-mcp

# Run the interactive Setup Wizard
doc-mcp-setup
```

The wizard will:
1.  Check for Docker and Neo4j.
2.  Ask for your **LiteLLM / Gemini Credentials**.
3.  Configure the **LLM Mode** (LiteLLM vs Gemini Direct).
4.  Generate a secure `.env` file.

<details>
<summary>❓ Don't have <code>pipx</code>? Click here to install it</summary>

**macOS:**
```bash
brew install pipx
pipx ensurepath
```

**Windows:**
```bash
winget install pipx
pipx ensurepath
```

**Linux (Debian/Ubuntu):**
```bash
sudo apt install pipx
pipx ensurepath
```

*Restart your terminal after installing pipx.*

</details>

#### Alternative: Standard Pip

If you prefer not to use pipx:
```bash
pip install doc-builder-mcp
doc-mcp-setup
```

#### Option 2: Manual Development Setup

If you want to contribute or modify the code:

```bash
git clone https://github.com/Hexecu/mcp-doc-builder.git
cd mcp-doc-builder
make full-setup
```

### 3. Setup

Run the interactive setup wizard:

```bash
doc-mcp-setup
```

Or manually configure:

```bash
cp ../.env.example ../.env
# Edit .env with your configuration
```

### 4. Start Neo4j

Start the Neo4j database natively with docker or using the provided Makefile:

```bash
make neo4j-up
```

*This uses the `docker-compose.yml` to start the Neo4j instance.*

### 5. Run the Server

```bash
# STDIO mode (for IDE integration)
make server-stdio

# HTTP mode (for API access)
make server
```

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `NEO4J_URI` | Neo4j connection URI | `bolt://localhost:7688` |
| `NEO4J_USERNAME` | Neo4j username | `neo4j` |
| `NEO4J_PASSWORD` | Neo4j password | - |
| `LLM_MODE` | `litellm`, `gemini_direct`, or `both` | `litellm` |
| `LITELLM_BASE_URL` | LiteLLM Gateway URL | - |
| `LITELLM_API_KEY` | LiteLLM API key | - |
| `LITELLM_MODEL` | Model name | `gemini-2.5-flash` |
| `CRAWLER_MAX_DEPTH` | Maximum crawl depth | `2` |
| `CRAWLER_RATE_LIMIT` | Seconds between requests | `1.0` |
| `CRAWLER_MAX_PAGES` | Max pages per source | `500` |

## MCP Tools

### doc_ingest
Ingest and index a documentation website.

```json
{
  "url": "https://nextjs.org/docs",
  "name": "Next.js Docs",
  "max_depth": 2
}
```

### doc_search
Search indexed documentation.

```json
{
  "query": "how to use React hooks",
  "limit": 10,
  "search_mode": "hybrid"
}
```

### doc_context
Get comprehensive context for a topic.

```json
{
  "topic": "authentication in Next.js",
  "include_related": true
}
```

### doc_sources
List all indexed documentation sources.

### doc_refresh
Refresh/re-index a documentation source.

```json
{
  "source_id": "source_abc123",
  "force": false
}
```

### doc_ontology
Explore extracted concepts and relationships.

```json
{
  "source_id": "source_abc123",
  "concept": "useState"
}
```

## IDE Integration

You can use this MCP server with any compatible AI coding assistant.

### Antigravity (Google Deepmind)

1. Open Antigravity Settings or locate your configuration file.
2. Add the MCP server configuration:
```json
{
  "mcpServers": {
    "doc-builder": {
      "command": "doc-mcp",
      "args": [],
      "env": {
        "NEO4J_URI": "bolt://localhost:7688",
        "NEO4J_PASSWORD": "your-password",
        "LITELLM_API_KEY": "your-key"
      }
    }
  }
}
```

#### Recommended Antigravity Custom Rule
To maximize the utility of `doc-builder`, add this specific rule to your Antigravity global profile or project guidelines (e.g., in `.gemini/rules.md` or global settings):

> **Documentation Strategy Rule:**
> "Whenever you encounter a new, unknown, or recently updated library, framework, or API that is not fully covered by your base training, you MUST proactively use the `doc_ingest` MCP tool to scrape and index its official documentation. Before writing complex implementation code for unfamiliar tools, always query `doc_search` or `doc_context` to understand the best practices and latest syntax."

### Cursor

Cursor supports MCP natively. To add the server:

1. Open **Cursor Settings** (Cmd/Ctrl + Shift + J) > **Features** > **MCP**.
2. Click **+ Add new MCP server**.
3. Set the Type to `command`.
4. Set the Name to `doc-builder`.
5. Set the Command to `doc-mcp` (assuming you installed via `pipx`).
6. Add the necessary environment variables (`NEO4J_PASSWORD`, `LITELLM_API_KEY`, etc.) directly in the Cursor UI environment section.

### VS Code (with Claude Dev / Roo Code)

If you use Claude Dev, Roo Code, or similar MCP clients in VS Code:

1. Open the MCP configuration file (usually found at `~/Library/Application Support/Code/User/globalStorage/saoudrizwan.claude-dev/settings/cline_mcp_settings.json` on Mac).
2. Add the server entry:

```json
{
  "mcpServers": {
    "doc-builder": {
      "command": "doc-mcp",
      "args": [],
      "env": {
        "NEO4J_URI": "bolt://localhost:7688",
        "NEO4J_PASSWORD": "your-password",
        "LITELLM_API_KEY": "your-key"
      }
    }
  }
}
```

## Architecture

```
mcp-doc-builder/
├── docker-compose.yml        # Neo4j container
├── .env.example              # Configuration template
└── server/
    ├── pyproject.toml        # Python package
    └── src/doc_builder/
        ├── main.py           # MCP server entry
        ├── config.py         # Settings
        ├── cli/              # Setup wizard & status
        ├── crawler/          # Web scraping
        │   ├── spider.py     # Async crawler
        │   ├── parser.py     # HTML parsing
        │   └── agent.py      # LLM link evaluation
        ├── vector/           # Vectorization
        │   ├── embedder.py   # Gemini embeddings
        │   ├── chunker.py    # Smart chunking
        │   └── indexer.py    # Neo4j vector index
        ├── ontology/         # Knowledge extraction
        │   ├── extractor.py  # Concept extraction
        │   ├── metatag.py    # Metatag processing
        │   └── linker.py     # Relationship building
        ├── kg/               # Neo4j graph
        │   ├── neo4j.py      # Async client
        │   ├── repo.py       # Query repository
        │   └── schema.cypher # Database schema
        ├── llm/              # LLM integration
        │   ├── client.py     # LiteLLM wrapper
        │   └── prompts/      # Prompt templates
        ├── mcp/              # MCP protocol
        │   ├── tools.py      # Tool definitions
        │   ├── resources.py  # Resource handlers
        │   └── prompts.py    # Workflow prompts
        └── security/         # Auth & validation
```

## Graph Schema

### Nodes (Doc* prefixed for namespace separation)

- **DocSource**: Documentation root (URL, name, status)
- **DocPage**: Individual pages with metadata
- **DocChunk**: Vectorized content chunks with embeddings
- **DocConcept**: Extracted concepts (APIs, patterns, entities)
- **DocMetatag**: Page metatags (og:*, twitter:*, etc.)
- **DocCrawlJob**: Crawl job tracking

### Relationships

- `(DocSource)-[:CONTAINS]->(DocPage)`
- `(DocPage)-[:LINKS_TO]->(DocPage)`
- `(DocPage)-[:HAS_CHUNK]->(DocChunk)`
- `(DocChunk)-[:MENTIONS]->(DocConcept)`
- `(DocConcept)-[:RELATES_TO]->(DocConcept)`

## CLI Commands

```bash
# Interactive setup
doc-mcp-setup

# Health check
doc-mcp-status --doctor

# Run server
doc-mcp
```

## Development

```bash
# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Type checking
mypy src/

# Linting
ruff check src/
```

## License

MIT

## Related Projects

- [MCP KG Memory](../mcp-kg-memory): Knowledge graph memory for AI coding assistants
- [Model Context Protocol](https://modelcontextprotocol.io): MCP specification
