Metadata-Version: 2.4
Name: iflow-mcp_yairwein-mcp-doc-indexer
Version: 0.1.0
Summary: MCP server for local document indexing and search using LanceDB
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: asyncio
Requires-Dist: fastmcp>=0.1.0
Requires-Dist: lancedb>=0.4.0
Requires-Dist: numpy
Requires-Dist: ollama>=0.1.7
Requires-Dist: pandas
Requires-Dist: psutil>=7.0.0
Requires-Dist: pyarrow
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: python-docx>=1.0.0
Requires-Dist: python-dotenv
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: tiktoken
Requires-Dist: watchdog>=3.0.0
Description-Content-Type: text/markdown

# MCP Document Indexer

A Python-based MCP (Model Context Protocol) server for local document indexing and search using LanceDB vector database and local LLMs.

## Features

- **Real-time Document Monitoring**: Automatically indexes new and modified documents in configured folders
- **Multi-format Support**: Handles PDF, Word (docx/doc), text, Markdown, and RTF files
- **Local LLM Integration**: Uses Ollama for document summarization and keyword extraction. Nothing ever leaves your computer
- **Vector Search**: Semantic search using LanceDB and sentence transformers
- **MCP Integration**: Exposes search and catalog tools via Model Context Protocol
- **Incremental Indexing**: Only processes changed files to save resources
- **Performance Optimized**: Designed for decent performance on standard laptops (e.g. M1/M2 MacBook)

## Installation

### Prerequisites

1. **Python 3.9+** installed
2. **uv** package manager:
```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

3. **Ollama** (for local LLM):
```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model (e.g., llama3.2)
ollama pull llama3.2:3b
```

### Install MCP Document Indexer

```bash
# Clone the repository
git clone https://github.com/yairwein/mcp-doc-indexer.git
cd mcp-doc-indexer

# Install with uv
uv sync

# Or install as a package
uv add mcp-doc-indexer
```

## Configuration

Configure the indexer using environment variables or a `.env` file:

```bash
# Folders to monitor (comma-separated)
WATCH_FOLDERS="/Users/me/Documents,/Users/me/Research"

# LanceDB storage path
LANCEDB_PATH="./vector_index"

# Ollama model for summarization
LLM_MODEL="llama3.2:3b"

# Text chunking settings
CHUNK_SIZE=1000
CHUNK_OVERLAP=200

# Embedding model (sentence-transformers)
EMBEDDING_MODEL="all-MiniLM-L6-v2"

# File types to index
FILE_EXTENSIONS=".pdf,.docx,.doc,.txt,.md,.rtf"

# Maximum file size in MB
MAX_FILE_SIZE_MB=100

# Ollama API URL
OLLAMA_BASE_URL="http://localhost:11434"
```

## Usage

### Run as Standalone Service

```bash
# Set environment variables
export WATCH_FOLDERS="/path/to/documents"
export LANCEDB_PATH="./my_index"

# Run the indexer
uv run python -m src.main
```

### Integrate with Claude Desktop

Add to your Claude Desktop configuration (`~/Library/Application Support/Claude/claude_desktop_config.json`):

```json
{
  "mcpServers": {
    "doc-indexer": {
      "command": "uv",
      "args": [
        "run",
        "--directory",
        "/path/to/mcp-doc-indexer",
        "python",
        "-m",
        "src.main"
      ],
      "env": {
        "WATCH_FOLDERS": "/Users/me/Documents,/Users/me/Research",
        "LANCEDB_PATH": "/Users/me/.mcp-doc-index",
        "LLM_MODEL": "llama3.2:3b"
      }
    }
  }
}
```

## MCP Tools

The indexer exposes the following tools via MCP:

### `search_documents`
Search for documents using natural language queries.
- **Parameters**:
  - `query`: Search query text
  - `limit`: Maximum number of results (default: 10)
  - `search_type`: "documents" or "chunks"

### `get_catalog`
List all indexed documents with summaries.
- **Parameters**:
  - `skip`: Number of documents to skip (default: 0)
  - `limit`: Maximum documents to return (default: 100)

### `get_document_info`
Get detailed information about a specific document.
- **Parameters**:
  - `file_path`: Path to the document

### `reindex_document`
Force reindexing of a specific document.
- **Parameters**:
  - `file_path`: Path to the document to reindex

### `get_indexing_stats`
Get current indexing statistics.

## Example Usage in Claude

Once configured, you can use the indexer in Claude:

```
"Search my documents for information about machine learning"
"Show me all PDFs I've indexed"
"What documents mention Python programming?"
"Get details about /Users/me/Documents/report.pdf"
"Reindex the latest version of my thesis"
```

## Architecture

```
┌─────────────────┐     ┌──────────────┐     ┌─────────────┐
│  File Monitor   │────▶│   Document   │────▶│  Local LLM  │
│   (Watchdog)    │     │    Parser    │     │  (Ollama)   │
└─────────────────┘     └──────────────┘     └─────────────┘
                               │                      │
                               ▼                      ▼
                        ┌──────────────┐     ┌─────────────┐
                        │   LanceDB    │◀────│  Embeddings │
                        │   Storage    │     │  (ST Model) │
                        └──────────────┘     └─────────────┘
                               │
                               ▼
                        ┌──────────────┐
                        │  FastMCP     │
                        │   Server     │
                        └──────────────┘
                               │
                               ▼
                        ┌──────────────┐
                        │    Claude    │
                        │   Desktop    │
                        └──────────────┘
```

## File Processing Pipeline

1. **File Detection**: Watchdog monitors configured folders for changes
2. **Document Parsing**: Extracts text from PDF, Word, and text files
3. **Text Chunking**: Splits documents into overlapping chunks for better retrieval
4. **LLM Processing**: Generates summaries and extracts keywords using Ollama
5. **Embedding Generation**: Creates vector embeddings using sentence transformers
6. **Vector Storage**: Stores documents and chunks in LanceDB
7. **MCP Exposure**: Makes search and catalog tools available via MCP

## Performance Considerations

- **Incremental Indexing**: Only changed files are reprocessed
- **Async Processing**: Parallel processing of multiple documents
- **Batch Operations**: Efficient batch indexing for multiple files
- **Debouncing**: Prevents duplicate processing of rapidly changing files
- **Size Limits**: Configurable maximum file size to prevent memory issues

## Troubleshooting

### Ollama Not Available
If Ollama is not running or the model isn't available, the indexer falls back to simple text extraction without summarization.

```bash
# Check Ollama status
ollama list

# Pull required model
ollama pull llama3.2:3b
```

### Permission Issues
Ensure the indexer has read access to monitored folders:
```bash
chmod -R 755 /path/to/documents
```

### Memory Usage
For large document collections, consider:
- Reducing `CHUNK_SIZE` to create smaller chunks
- Limiting `MAX_FILE_SIZE_MB` to skip very large files
- Using a smaller embedding model

## Development

### Running Tests
```bash
uv run pytest tests/
```

### Code Formatting
```bash
uv run black src/
uv run ruff src/
```

### Building Package
```bash
uv build
```

## License

MIT License - See LICENSE file for details

## Contributing

Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Add tests for new functionality
4. Submit a pull request

## Support

For issues or questions:
- Open an issue on GitHub
- Check the troubleshooting section
- Review logs in the console output
