Metadata-Version: 2.4
Name: cocoindex_code_mcp_server
Version: 0.2.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Rust
Classifier: Operating System :: OS Independent
Classifier: Development Status :: 4 - Beta
Classifier: Topic :: Software Development
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Requires-Dist: cocoindex[embeddings]>=0.1.63
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: psycopg[pool]>=3.2.9
Requires-Dist: psycopg[binary]>=3.1.0
Requires-Dist: psycopg-pool>=3.1.0
Requires-Dist: pgvector>=0.4.1
Requires-Dist: numpy>=2.3.1
Requires-Dist: astchunk>=0.1.0
Requires-Dist: lark-parser>=0.12.0
Requires-Dist: prompt-toolkit>=3.0.0
Requires-Dist: cachetools>=6.1.0
Requires-Dist: click>=8.2.1
Requires-Dist: mcp>=1.12.0
Requires-Dist: tree-sitter>=0.23.0,<0.24.0
Requires-Dist: tree-sitter-python>=0.23.6,<0.24.0
Requires-Dist: tree-sitter-c-sharp>=0.23.1,<0.24.0
Requires-Dist: tree-sitter-java>=0.23.5,<0.24.0
Requires-Dist: tree-sitter-typescript>=0.23.2,<0.24.0
Requires-Dist: tree-sitter-c>=0.21.4,<0.23.0
Requires-Dist: tree-sitter-cpp>=0.22.3,<0.23.0
Requires-Dist: tree-sitter-rust>=0.21.2,<0.23.0
Requires-Dist: tree-sitter-kotlin>=1.0.0
Requires-Dist: tree-sitter-javascript>=0.23.1,<0.24.0
Requires-Dist: maturin>=1.9.6
Requires-Dist: typing-extensions>=4.15.0
Requires-Dist: pytest>=8.4.1 ; extra == 'test'
Requires-Dist: pytest-cov>=7.0.0 ; extra == 'test'
Requires-Dist: pytest-asyncio>=1.1.0 ; extra == 'test'
Requires-Dist: pytest-mock>=3.10.0 ; extra == 'test'
Requires-Dist: pytest-timeout>=2.1.0 ; extra == 'test'
Requires-Dist: pytest-xdist>=3.4.0 ; extra == 'test'
Requires-Dist: pytest-mypy>=1.0.1 ; extra == 'test'
Requires-Dist: coverage>=7.9.2 ; extra == 'test'
Requires-Dist: asyncpg>=0.29.0 ; extra == 'test'
Requires-Dist: cocoindex-code-mcp-server[test] ; extra == 'test'
Requires-Dist: maturin>=1.0,<2.0 ; extra == 'build'
Requires-Dist: mypy>=1.10.0 ; extra == 'build'
Requires-Dist: monkeytype>=23.3.0 ; extra == 'build'
Requires-Dist: pytest-monkeytype>=1.1.0 ; extra == 'build'
Requires-Dist: auto-type-annotate>=1.1.2 ; extra == 'build'
Requires-Dist: isort>=6.0.1 ; extra == 'build'
Requires-Dist: autoflake8>=0.4.1 ; extra == 'build'
Requires-Dist: autopep8>=2.0.0 ; extra == 'build'
Requires-Dist: flake8>=7.3.0 ; extra == 'build'
Requires-Dist: autopep8>=2.0.0 ; extra == 'build'
Requires-Dist: pipdeptree>=2.8.0 ; extra == 'build'
Requires-Dist: deptry>=0.14.0 ; extra == 'build'
Requires-Dist: pydocstyle>=6.3.0 ; extra == 'build'
Requires-Dist: ruff>=0.4.0 ; extra == 'build'
Requires-Dist: pylint>=4.0.2 ; extra == 'build'
Requires-Dist: cocoindex-code-mcp-server[build] ; extra == 'build'
Requires-Dist: pre-commit>=4.3.0 ; extra == 'build'
Requires-Dist: twine>=6.2.0 ; extra == 'act'
Requires-Dist: maturin[patchelf]>=1.0,<2.0 ; extra == 'act'
Requires-Dist: auditwheel>=6.2.0 ; extra == 'act'
Requires-Dist: abi3audit>=0.0.22 ; extra == 'act'
Requires-Dist: cibuildwheel>=3.2.1 ; extra == 'act'
Requires-Dist: delvewheel>=1.11.2 ; extra == 'act'
Requires-Dist: delocate>=0.8.1 ; extra == 'act'
Requires-Dist: cocoindex-code-mcp-server[act] ; extra == 'act'
Provides-Extra: test
Provides-Extra: build
Provides-Extra: act
License-File: LICENSE
Summary: RAG based on cocoindex as MCP server (streamingHttp), with Haskell support
Author-email: aanno <aanno@users.noreply.github.com>
Requires-Python: >=3.11
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/aanno/cocoindex-code-mcp-server
Project-URL: Issues, https://github.com/aanno/cocoindex-code-mcp-server/issues

# CocoIndex Code MCP Server

A Model Context Protocol (MCP) server that provides a RAG (Retrieval Augmented Generation) tool with hybrid search capabilities combining vector similarity and keyword metadata search for code retrieval. Built on the [CocoIndex](https://cocoindex.io) data transformation framework with specialized support for multiple programming languages.

This RAG MCP server enables AI tools (LLMs) to retrieve relevant code snippets from large codebases efficiently and in real-time, leveraging CocoIndex's incremental indexing, tree-sitter based chunking, and smart language-specific embeddings. It enhances the performance of code generation, code completion, and code understanding by virtually enlarging the context window available to the AI models.

Currently uses PostgreSQL + pgvector as the vector database backend, but can be adapted to other backends supported by CocoIndex.

## Table of Contents

- [Quickstart](#quickstart)
- [Command Line Arguments](#command-line-arguments)
- [Features](#features)
- [Supported Languages](#supported-languages)
- [Smart Embedding](#smart-embedding)
- [Development](#development)
- [Contributing](#contributing)

## Quickstart

### 1. Clone the Repository (optional)

```bash
git clone --recursive https://github.com/aanno/cocoindex-code-mcp-server.git
cd cocoindex-code-mcp-server
```

Checking out the sources is _not_ strictly necessary if you just want to use the MCP server, as it can be installed
from PyPI. However, there are some scripts e.g. for starting the pgvector database that are missing from the PyPI
package.

### 2. Install

Build from source using maturin:

```bash
# Install dependencies from PyPI
uv sync
uv sync --all-extras

# And build from source
maturin develop
```

Or simple install from PyPI:

```bash
pip install cocoindex-code-mcp-server
```

I provide native wheels for many systems (including Linux, Windows and MacOS) on PyPI, so no build should be necessary
in most cases. cocoindex-code-mcp-server needs Python 3.11+ (and I prefer to build abi3 wheels for better
compatibility).

### 3. Start the PostgreSQL Database

In one terminal on your local machine, start the pgvector database:

```bash
cd cocoindex-code-mcp-server
./scripts/cocoindex-postgresql.sh
# Maybe you need to install pgvector extension once
./scripts/install-pgvector.py
```

Using the scripts is optional, however you need a running PostgreSQL + pgvector database for the MCP server to work.

### 4. Configure the MCP Server (DB Connection)

cocoindex_code_mcp_server uses the `COCOINDEX_DATABASE_URL` environment variable to connect to the database.
It reads the `.env` file in the current directory if present. You can copy the provided `.env.template` to `.env` and
adjust the connection string if needed.

The current directory does not need to be the directory that you want to scan (see section 'Command Line Arguments'
below for details).

```bash
cp .env.template .env
```

### 5. Start the MCP Server

In another terminal, start the cocoindex_code_mcp_server:

```bash
cd cocoindex-code-mcp-server
python -m cocoindex_code_mcp_server.main_mcp_server --rescan --port 3033 <path_to_code_directory>
```

The server will index the code in the specified directory and start serving requests. This will take some time. It is ready when you see something like:

```text
CodeEmbedding.files (batch update): 1505 source rows NO CHANGE
```

The PyPI package does provide starting server with `cocoindex-code-mcp-server <options> <root-source-dir>`. Remember
that you need a running PostgreSQL + pgvector database for this to work.

### 6. Use the MCP Server

You can now use the RAG server running at `http://localhost:3033` as a streaming HTTP MCP server. For example, with Claude Code, use the following snippet within `"mcpServers"` in your `.mcp.json` file:

```json
{
  "cocoindex-rag": {
    "command": "pnpm",
    "args": [
      "dlx",
      "mcp-remote@next",
      "http://localhost:3033/mcp"
    ]
  }
}
```

## Command Line Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `paths` | positional | - | Path(s) to code directory/directories to index (can specify multiple) |
| `--paths` | option | - | Alternative way to specify paths (can use multiple times) |
| `--no-live` | flag | false | Disable live update mode |
| `--poll` | int | 60 | Polling interval in seconds for live updates |
| `--default-embedding` | flag | false | Use default CocoIndex embedding instead of smart embedding |
| `--default-chunking` | flag | false | Use default CocoIndex chunking instead of tree-sitter/AST chunking |
| `--default-language-handler` | flag | false | Use default CocoIndex language handling |
| `--chunk-factor-percent` | int | 100 | Chunk size scaling factor as percentage (100=default, <100=smaller, >100=larger) |
| `--port` | int | 3000 | Port to listen on for HTTP |
| `--log-level` | string | INFO | Logging level (DEBUG, INFO, WARNING, ERROR) |
| `--json-response` | flag | false | Enable JSON responses instead of SSE streams |
| `--rescan` | flag | false | Clear database and tracking tables before starting to force re-indexing |

### Examples

```bash
# Index a single directory with live updates
python -m cocoindex_code_mcp_server.main_mcp_server /path/to/code

# Index multiple directories
python -m cocoindex_code_mcp_server.main_mcp_server /path/to/code1 /path/to/code2

# Force re-indexing with custom port
python -m cocoindex_code_mcp_server.main_mcp_server --rescan --port 3033 /path/to/code

# Disable live updates (one-time indexing)
python -m cocoindex_code_mcp_server.main_mcp_server --no-live /path/to/code

# Custom chunk size (50% smaller chunks)
python -m cocoindex_code_mcp_server.main_mcp_server --chunk-factor-percent 50 /path/to/code
```

## Features

- **CocoIndex Backend**: Uses [CocoIndex](https://cocoindex.io) as the embedding and vector database backend with PostgreSQL + pgvector
- **Multiple Language Support**: Specialized support for 20+ programming languages with language-specific parsers and embeddings
- **Streaming HTTP MCP Server**: Real-time code retrieval via Model Context Protocol over HTTP
- **Code Change Detection**: Incremental indexing with automatic detection of file changes
- **Tree-sitter Chunking**: Advanced code parsing and chunking using tree-sitter AST for better code understanding
- **Smart Embedding**: Multiple embedding models automatically selected based on programming language (see [Smart Embedding](#smart-embedding))
- **Hybrid Search**: Combines vector similarity search with keyword/metadata filtering for precise results
  + **Vector Search**: Semantic similarity using language-specific code embeddings
  + **Keyword Search**: Exact matching on metadata fields (functions, classes, imports, etc.)
  + **Hybrid Search**: Weighted combination of both approaches with configurable weights

## Supported Languages

The server supports multiple programming languages with varying levels of integration:

| Language | Extensions | Embedding Model | AST Chunking | Tree-sitter | Remarks |
|----------|------------|-----------------|--------------|-------------|---------|
| **Python** | `.py` | GraphCodeBERT | ✅ astchunk | ✅ python | Custom (not using visitor), <br/>metadata extraction: `language_handlers/python_handler.py`, <br/>analyser: `lang/python/tree_sitter_python_analyzer.py`, <br/>(fallback: `lang/python/python_code_analyzer.py`), <br/>TODO: unify this with visitor approach |
| **Rust** | `.rs` | UniXcoder | ? | ✅ rust | Full metadata support with specialized visitor: `language_handlers/rust_visitor.py` |
| **JavaScript** | `.js`, `.mjs`, `.cjs` | GraphCodeBERT | ?astchunk? | ✅ javascript | Full metadata support with specialized visitor: `language_handlers/javascript_visitor.py` |
| **TypeScript** | `.ts` | UniXcoder | ✅ astchunk | ✅ typescript | Extends javascript visitor: `language_handlers/typescript_visitor.py` |
| **TSX** | `.tsx` | UniXcoder | ✅ astchunk | ?typescript? | ?see typescript? |
| **Java** | `.java` | GraphCodeBERT | ✅ astchunk | ✅ java | Full metadata support with specialized visitor: `language_handlers/java_visitor.py` |
| **Kotlin** | `.kt`, `.kts` | UniXcoder | ? | ✅ kotlin | Full metadata support with specialized visitor: `language_handlers/kotlin_visitor.py` |
| **C** | `.c`, `.h` | GraphCodeBERT | ? | ✅ c | Full metadata support with specialized visitor: `language_handlers/c_visitor.py` |
| **C++** | `.cpp`, `.cc`, `.cxx`,`.hpp` | GraphCodeBERT | ? | ✅ cpp | Extends C visitor: `language_handlers/cpp_visitor.py` |
| **C#** | `.cs` | UniXcoder | ✅ astchunk | ❌ | Tree-sitter parsing/chunking only |
| **Haskell** | `.hs`, `.lhs` | all-mpnet-base-v2 | ✅ | ✅ | Custom maturin extension with specialized visitor, <br/>chunker: `lang/haskell/haskell_ast_chunker.py`, <br/>metadata extraction: `language_handlers/haskell_handler.py` |
| **Other Languages** | see `mappers.py` | all-mpnet-base-v2 | ❌ | ❌ ?regex? | cocoindex defaults (baseline) |

### Legend

- **Embedding Model**: The embedding model automatically selected for the language
- **AST Chunking**: Advanced chunking using [ASTChunk](https://github.com/codelion/astchunk) or custom implementations (based on ideas from ASTChunk and using tree-sitter for the language).
- **Tree-sitter**: Language has tree-sitter parser configured for AST analysis. (python tree-sitter bindings, except for Haskell which uses a Maturin/Rust extension based on rust bindings cargos `tree-sitter` and `tree-sitter-haskell`.)
- **Remarks**: Additional notes about support level
- **Other Languages**: Files recognized but only basic text embedding and chunking applied (cocoindex defaults). <br/>
  This includes: Go, PHP, Ruby, Swift, Scala, Dart, CSS, HTML, JSON, Markdown, YAML, TOML, SQL, R, Fortran, Pascal, XML

## Smart Embedding

The server uses **language-aware code embeddings** that automatically select the optimal embedding model based on the programming language. This approach provides better semantic understanding of code compared to generic text embeddings.

### How It Works

The smart embedding system uses different specialized models optimized for different programming languages:

1. **GraphCodeBERT** (`microsoft/graphcodebert-base`)
   + **Optimized for:** Python, Java, JavaScript, PHP, Ruby, Go, C, C++
   + Pre-trained on code from these languages with graph-based code understanding
   + Best for languages with explicit structure and common patterns

2. **UniXcoder** (`microsoft/unixcoder-base`)
   + **Optimized for:** Rust, TypeScript, C#, Kotlin, Scala, Swift, Dart
   + Unified cross-lingual model for multiple languages
   + Best for modern statically-typed languages

3. **Fallback Model** (`sentence-transformers/all-mpnet-base-v2`)
   + Used for: Languages not specifically supported by code models
   + General-purpose text embedding for broader language support
   + 768-dimensional embeddings matching code-specific models

### Automatic Selection

The embedding model is automatically selected based on file extension:

```python
# Example: Python file automatically uses GraphCodeBERT
file: main.py → language: python → model: microsoft/graphcodebert-base

# Example: Rust file automatically uses UniXcoder
file: lib.rs → language: rust → model: microsoft/unixcoder-base

# Example: Haskell file uses fallback model
file: Main.hs → language: haskell → model: sentence-transformers/all-mpnet-base-v2
```

### Benefits

- **Better Code Understanding**: Code-specific models understand programming constructs better than generic text models
- **Language-Specific Optimization**: Each language gets embeddings from models trained on that language
- **Consistent Search Quality**: Similar code snippets in the same language produce similar embeddings
- **Zero Configuration**: Automatic model selection requires no manual configuration

### Implementation Details

The smart embedding system is implemented as an external wrapper around CocoIndex's `SentenceTransformerEmbed` function, located in `python/cocoindex_code_mcp_server/smart_code_embedding.py`. This approach:

- Does not modify CocoIndex source code
- Uses CocoIndex as a pure dependency
- Provides drop-in compatibility with existing workflows
- Can be easily updated independently

For more technical details, see:

- [`docs/claude/Embedding-Selection.md`](docs/claude/Embedding-Selection.md)
- [`docs/cocoindex/smart-embedding.md`](docs/cocoindex/smart-embedding.md)

## Development

### Prerequisites

- Rust (latest stable version)
- Python 3.11+
- Maturin (build tool for Python extensions in Rust)
- PostgreSQL with pgvector extension
- Tree-sitter language parsers (automatically installed via pyproject.toml)

### Run tests

```bash
# Run tests to verify installation
pytest -c pytest.ini tests/
```

### Code Quality

The project uses mypy for type checking. Use the provided scripts:

```bash
# Type check main source code
./scripts/mypy-check.sh

# Type check tests
./scripts/mypy-check-tests.sh
```

### Project Structure

- **`python/cocoindex_code_mcp_server/`**: Main MCP server implementation
  + `main_mcp_server.py`: MCP server entry point
  + `cocoindex_config.py`: CocoIndex flow configuration
  + `smart_code_embedding.py`: Language-aware embedding selection
  + `mappers.py`: Language and field mappings
  + `tree_sitter_parser.py`: Tree-sitter parsing utilities
  + `db/`: Database abstraction layer
    - `pgvector/`: PostgreSQL + pgvector backend
  + `lang/`: Language-specific handlers
    - `python/`: Python code analyzer
    - `haskell/`: Haskell support (via Rust extension)
- **`tests/`**: Pytest test suite
- **`docs/`**: Documentation
  + `claude/`: Development notes and architecture docs
  + `cocoindex/`: CocoIndex-specific documentation
  + `instructions/`: Task instructions and guides
- **`rust/`**: Rust components
  + `src/lib.rs`: Haskell tree-sitter Rust extension
- **`astchunk/`**: ASTChunk submodule for advanced code chunking

### Running Tests

```bash
# Run all tests
pytest -c pytest.ini tests/

# Run specific test file
pytest -c pytest.ini tests/test_hybrid_search_integration.py

# Run with coverage
pytest -c pytest.ini tests/ --cov=python/cocoindex_code_mcp_server --cov-report=html
```

## Contributing

Contributions are welcome! Please open issues and pull requests on the [GitHub repository](https://github.com/aanno/cocoindex-code-mcp-server).

### Development Workflow

1. Fork the repository
2. Create a feature branch
3. Make your changes with tests
4. Run type checking: `./scripts/mypy-check.sh`
5. Run tests: `pytest tests/`
6. Submit a pull request

### Areas for Contribution

- Additional language support (parsers, embeddings, chunking)
- Enhanced metadata extraction for existing languages
- Performance optimizations
- Documentation improvements
- Bug fixes and issue resolution

## License

AGPL-3.0 or later

## Links

- **CocoIndex Framework**: <https://cocoindex.io>
- **GitHub Repository**: <https://github.com/aanno/cocoindex-code-mcp-server>
- **Model Context Protocol**: <https://modelcontextprotocol.io>
- **ASTChunk**: <https://github.com/codelion/astchunk>

## Acknowledgments

Built on top of the excellent [CocoIndex](https://cocoindex.io) framework for incremental data transformation and the [Model Context Protocol](https://modelcontextprotocol.io) for AI tool integration.

