Metadata-Version: 2.4
Name: confluence-scraper-mcp
Version: 0.1.3
Summary: A Model Context Protocol (MCP) server for Confluence RAG with ChromaDB vector search
Project-URL: Homepage, https://github.com/akhilthomas236/confluence-scraper-mcp
Project-URL: Repository, https://github.com/akhilthomas236/confluence-scraper-mcp
Project-URL: Documentation, https://github.com/akhilthomas236/confluence-scraper-mcp#readme
Project-URL: Issues, https://github.com/akhilthomas236/confluence-scraper-mcp/issues
Author-email: Akhil Thomas <akhilthomas236@example.com>
License: MIT
License-File: LICENSE
Keywords: ai,chromadb,confluence,llm,mcp,rag,vector-search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Documentation
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Requires-Dist: anyio>=4.0.0
Requires-Dist: atlassian-python-api>=4.0.0
Requires-Dist: beautifulsoup4>=4.13.0
Requires-Dist: chromadb<2.0.0,>=0.4.0
Requires-Dist: fastapi>=0.110.0
Requires-Dist: httpx>=0.28.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: numpy<2.0.0,>=1.21.0
Requires-Dist: pydantic-settings<3.0.0,>=2.2.0
Requires-Dist: pydantic<3.0.0,>=2.6.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: requests>=2.32.0
Requires-Dist: scikit-learn<2.0.0,>=1.0.0
Requires-Dist: sentence-transformers<6.0.0,>=4.0.0
Requires-Dist: starlette>=0.36.3
Requires-Dist: typing-extensions>=4.8.0
Requires-Dist: uvicorn[standard]>=0.27.0
Provides-Extra: dev
Requires-Dist: black>=24.2.0; extra == 'dev'
Requires-Dist: isort>=5.13.0; extra == 'dev'
Requires-Dist: mypy>=1.8.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=6.2.1; extra == 'dev'
Requires-Dist: pytest-mock>=3.14.1; extra == 'dev'
Requires-Dist: pytest>=8.4.0; extra == 'dev'
Description-Content-Type: text/markdown

# Confluence RAG Data Pipeline with MCP Protocol

A Model Context Protocol (MCP) server that provides relevant context from Confluence pages using RAG (Retrieval Augmented Generation).

[![PyPI version](https://badge.fury.io/py/confluence-scraper-mcp.svg)](https://badge.fury.io/py/confluence-scraper-mcp)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## 🚀 Quick Start

```bash
# Install from PyPI
pip install confluence-scraper-mcp

# Set environment variables
export CONFLUENCE_BASE_URL="https://your-domain.atlassian.net"
export CONFLUENCE_TOKEN="your-api-token"
export CONFLUENCE_SPACE_KEY="your-space-key"

# Run as MCP server
confluence-scraper-mcp

# Or run as web server
confluence-scraper-mcp --web
```

## Features

- 🔍 **Semantic Search**: Uses ChromaDB for vector-based document retrieval
- 🔗 **MCP Integration**: Full Model Context Protocol implementation
- 📚 **Confluence Native**: Direct integration with Confluence API
- 🏷️ **Smart Filtering**: Filter by spaces, labels, and metadata
- 📎 **Rich Content**: Handles attachments and comments
- 🌐 **Dual Mode**: Run as MCP server or REST API
- 📦 **Easy Install**: Available on PyPI

## Requirements

- Python 3.9 or higher
- Confluence API access token
- ChromaDB for vector storage

## Installation

1. **Install from PyPI (Recommended):**
   ```bash
   pip install confluence-scraper-mcp
   ```

2. **Install UV if you haven't already:**
   ```bash
   curl -LsSf https://astral.sh/uv/install.sh | sh
   ```

3. **Clone and Setup Project (Development):**
   ```bash
   git clone <repository-url>
   cd confluence-scraper-mcp
   # Create virtual environment
   uv venv .venv
   # Activate virtual environment
   source .venv/bin/activate
   # Install dependencies
   uv pip install -r requirements.txt
   ```

4. **Configure Environment:**
   - Create a `.env` file in the project root:
   ```bash
   touch .env
   ```
   - Add the following configuration (adjust values as needed):
   ```bash
   # Required settings
   CONFLUENCE_BASE_URL=https://your-domain.atlassian.net
   CONFLUENCE_TOKEN=your-api-token
   CONFLUENCE_SPACE_KEY=optional-space-key
   
   # Optional settings (with defaults)
   INITIAL_CRAWL=false
   CHROMA_PERSIST_DIR=./data/chroma
   EMBEDDING_MODEL="all-MiniLM-L6-v2"
   MAX_PAGES=1000
   INCLUDE_ATTACHMENTS=true
   INCLUDE_COMMENTS=true
   ```

## Usage

### Command Line Interface (After PyPI Installation)

```bash
# Run as MCP server (stdio mode) - default
confluence-scraper-mcp

# Run as web server
confluence-scraper-mcp --web
```

### Development Mode

1. **Using uvx (Recommended):**
   ```bash
   # Development mode with auto-reload
   uvx uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
   
   # Run tests
   uvx pytest
   
   # Code formatting and checks
   uvx black .
   uvx isort .
   uvx mypy .
   ```

2. **Alternative: Using Virtual Environment:**
   ```bash
   # Activate virtual environment
   source .venv/bin/activate
   
   # Then run commands as usual
   uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
   ```

3. **Initial Setup:**
   ```bash
   # Start initial crawl of Confluence pages
   curl -X POST http://localhost:8000/crawl
   
   # Verify server health
   curl http://localhost:8000/health
   ```

4. **Use the MCP API:**
   ```bash
   # Get context for an LLM query
   curl -X POST http://localhost:8000/mcp/context \
     -H "Content-Type: application/json" \
     -d '{
       "messages": [{"role": "user", "content": "Tell me about project X"}],
       "query": "project X documentation",
       "max_context_length": 1000
     }'
   
   # The response will include relevant context from your Confluence pages
   ```

5. **Monitor and Maintain:**
   ```bash
   # View logs
   tail -f logs/app.log
   
   # Re-crawl Confluence (e.g., after updates)
   curl -X POST http://localhost:8000/crawl
   ```

## API Endpoints

- `GET /health`: Health check endpoint
- `POST /crawl`: Trigger Confluence crawl
- `POST /mcp/context`: Get relevant context for a query

## MCP (Model Context Protocol) Configuration

This server implements the Model Context Protocol (MCP) for seamless integration with AI assistants and LLM clients. 

### Quick MCP Setup

1. **Install the package:**
   ```bash
   pip install confluence-scraper-mcp
   ```

2. **Copy the MCP configuration:**
   ```bash
   # Copy the example configuration
   cp examples/mcp-client-config.json ~/.config/your-mcp-client/
   ```

3. **Update environment variables in the config:**
   ```json
   {
     "mcpServers": {
       "confluence-scraper-mcp": {
         "command": "confluence-scraper-mcp",
         "args": [],
         "env": {
           "CONFLUENCE_BASE_URL": "https://your-domain.atlassian.net",
           "CONFLUENCE_TOKEN": "your-api-token",
           "CONFLUENCE_SPACE_KEY": "your-space-key"
         }
       }
     }
   }
   ```

### MCP Tools Available

The server provides several MCP tools:

- **`confluence_search`**: Search Confluence pages using semantic search
- **`confluence_get_page`**: Retrieve specific page content by ID or title
- **`confluence_crawl`**: Trigger crawling and indexing of content

### Example MCP Tool Usage

```json
{
  "method": "tools/call",
  "params": {
    "name": "confluence_search",
    "arguments": {
      "query": "API authentication methods",
      "space_key": "DEV",
      "max_results": 3,
      "include_attachments": true
    }
  }
}
```

### MCP Configuration Files

The package includes example configuration files:

- **`examples/mcp.json`**: Complete MCP server specification
- **`examples/mcp-client-config.json`**: Simple client configuration

See the [MCP specification](https://spec.modelcontextprotocol.io/) for more details on the protocol.

## 🤖 GitHub Copilot Integration

### Quick Setup for Copilot

1. **Install the package:**
   ```bash
   pip install confluence-scraper-mcp
   ```

2. **Configure VS Code Settings:**
   Open VS Code settings (`Cmd+,`) and add to your `settings.json`:
   ```json
   {
     "github.copilot.chat.mcpServers": {
       "confluence-rag": {
         "command": "confluence-scraper-mcp",
         "args": [],
         "env": {
           "CONFLUENCE_BASE_URL": "https://your-domain.atlassian.net",
           "CONFLUENCE_TOKEN": "your-api-token",
           "CONFLUENCE_SPACE_KEY": "your-space-key"
         }
       }
     }
   }
   ```

3. **Initial Setup:**
   ```bash
   # Start server and crawl content
   confluence-scraper-mcp --web &
   curl -X POST http://localhost:8000/crawl
   ```

4. **Test with Copilot:**
   Open Copilot Chat and ask: *"How do we handle authentication in our system?"*

### Detailed Setup Guide

For complete setup instructions, see: **[📖 Copilot Setup Guide](docs/COPILOT_SETUP.md)**

## Using with Code Assistants

This MCP server specializes in Confluence documentation and uses RAG (Retrieval Augmented Generation) with ChromaDB:

**Key Features:**
- 🔗 **Confluence Integration**: Direct API integration with page, attachment, and comment handling
- 🔍 **Semantic Search**: ChromaDB vector search for meaning-based retrieval
- 🏷️ **Smart Filtering**: Filter by space keys, labels, content types
- 📊 **Metadata Preservation**: Maintains Confluence structure and relationships
        ```json
        {
          "endpoints": [
            {
              "name": "API Documentation",
              "url": "http://localhost:8000/mcp/context",
              "options": {
                "max_context_length": 2000,
                "filter": {
                  "space_key": "API",
                  "labels": ["technical-docs", "api-reference"],
                  "include_comments": true,
                  "include_attachments": false,
                  "semantic_ranking": {
                    "weight": 0.7,
                    "model": "all-MiniLM-L6-v2"
                  }
                }
              },
              "authentication": {
                "type": "none"
              }
            },
            {
              "name": "Architecture Docs",
              "url": "http://localhost:8000/mcp/context",
              "options": {
                "max_context_length": 3000,
                "filter": {
                  "space_key": "ARCH",
                  "labels": ["architecture", "design"],
                  "include_comments": false,
                  "include_attachments": true,
                  "semantic_ranking": {
                    "weight": 0.8,
                    "model": "all-MiniLM-L6-v2"
                  }
                }
              },
              "authentication": {
                "type": "none"
              }
            }
          ],
          "default_endpoint": "API Documentation"
        }
        ```
        - Add the path to this file in VS Code settings under "Copilot Chat: MCP Configuration File"
        - See `examples/mcp.json` for a full example with multiple endpoints and filtering options

3. **Usage with Copilot:**
   - In VS Code, open Copilot Chat (Cmd+I)
   - Your queries will now include relevant context from your Confluence pages
   - Example: "How do I implement feature X?" will include context from related Confluence documentation
   - You can also use `/doc` command in Copilot Chat to explicitly search documentation

4. **Tips for Better Results:**
   - Keep Confluence pages well-organized and up-to-date
   - Use descriptive titles and labels in Confluence
   - Re-crawl after significant documentation updates:
     ```bash
     curl -X POST http://localhost:8000/crawl
     ```

## Development

1. **Install Development Dependencies:**
   ```bash
   uv pip install -r requirements.txt
   ```

2. **Using uvx for Development:**
   UV installs a command runner called `uvx` that can run Python scripts and modules without explicitly activating the virtual environment:
   ```bash
   # Run the FastAPI server
   uvx uvicorn app.main:app --reload
   
   # Run tests
   uvx pytest
   
   # Code formatting
   uvx black .
   uvx isort .
   uvx mypy .
   ```

3. **Environment Configuration:**
   The project uses environment variables for configuration. Copy `.env.example` to `.env` and update the values:
   ```bash
   CONFLUENCE_BASE_URL=https://your-domain.atlassian.net
   CONFLUENCE_TOKEN=your-api-token
   CONFLUENCE_SPACE_KEY=your-space-key
   CHROMA_PERSIST_DIR=data/chroma
   CHROMA_COLLECTION_NAME=confluence_docs
   EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
   CHUNK_SIZE=512
   CHUNK_OVERLAP=50
   TOP_K=3
   SIMILARITY_THRESHOLD=0.7
   ```

## Contributing

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Make your changes:
   - Use `uvx black .` and `uvx isort .` to format code
   - Use `uvx mypy .` for type checking
   - Add tests for new features
   - Update documentation as needed
4. Run tests (`uvx pytest`)
5. Commit your changes (`git commit -m 'Add some amazing feature'`)
6. Push to the branch (`git push origin feature/amazing-feature`)
7. Open a Pull Request

## License

MIT License. See [LICENSE](LICENSE) for more information.
