Metadata-Version: 2.4
Name: iflow-mcp_rdondeti-codegrok-mcp
Version: 0.2.0
Summary: MCP server for semantic code understanding and search
Project-URL: Homepage, https://github.com/rdondeti/CodeGrok_mcp
Project-URL: Repository, https://github.com/rdondeti/CodeGrok_mcp
Project-URL: Issues, https://github.com/rdondeti/CodeGrok_mcp/issues
Author: rdondeti
License: MIT
License-File: LICENSE
Keywords: code-search,code-understanding,mcp,semantic-search,tree-sitter
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: chromadb>=1.3.0
Requires-Dist: einops>=0.7.0
Requires-Dist: fastmcp>=2.0.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: torch>=2.0.0
Requires-Dist: tree-sitter-languages>=1.10.0
Requires-Dist: tree-sitter==0.21.3
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: flake8>=7.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

<div align="center">

# CodeGrok MCP

**Semantic Code Search for AI Assistants**

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![MCP](https://img.shields.io/badge/MCP-Compatible-green.svg)](https://modelcontextprotocol.io/)

*Give your AI assistant the power to truly understand your codebase*

[Features](#features) • [Quick Start](#quick-start) • [Capabilities](#-what-codegrok-can-do) • [Limitations](#-what-codegrok-cannot-do) • [Integrations](#-ai-tool-integrations) • [Use Cases](#-use-cases)

</div>

---

## What is CodeGrok MCP?

CodeGrok MCP is a **Model Context Protocol (MCP) server** that enables AI assistants to intelligently search and understand codebases using **semantic embeddings** and **Tree-sitter parsing**.

Unlike simple text search, CodeGrok understands code structure - it knows what functions, classes, and methods are, and can find relevant code even when you describe it in natural language.

```
You: "Where is authentication handled?"
CodeGrok: Returns auth middleware, login handlers, JWT validation code...
```

### Why Use CodeGrok?

**The Problem:** AI assistants have limited context windows. Sending your entire codebase is expensive and often impossible.

**The Solution:** CodeGrok indexes your code once, then AI can query semantically and receive only the 5-10 most relevant code snippets—**10-100x token reduction** vs naive "read all files" approaches.

---

## Features

- **Semantic Code Search** - Find code by meaning, not just keywords
- **9 Languages Supported** - Python, JavaScript, TypeScript, C, C++, Go, Java, Kotlin, Bash
- **28 File Extensions** - Comprehensive coverage including `.jsx`, `.tsx`, `.mjs`, `.hpp`, etc.
- **Fast Parallel Indexing** - 3-5x faster with multi-threaded parsing
- **Incremental Updates** - Only re-index changed files (auto mode)
- **Local & Private** - All data stays on your machine in `.codegrok/` folder
- **Zero LLM Dependencies** - Lightweight, focused tool (no API keys required)
- **GPU Acceleration** - Auto-detects CUDA for faster embeddings
- **Works with Any MCP Client** - Claude, Cursor, Cline, and more

---

## ✅ What CodeGrok CAN Do

### For Live Coding (AI-Assisted Development)

| Capability | Description |
|------------|-------------|
| **Semantic Code Search** | Natural language queries → vector similarity search against indexed code |
| **Find Code by Purpose** | Query "How does auth work?" → Returns relevant auth files with line numbers |
| **Symbol Extraction** | Extracts functions, classes, methods with signatures, docstrings, calls, imports |
| **Incremental Updates** | `learn` with auto mode only re-indexes modified files (uses file modification time) |
| **Persistent Storage** | Index survives restarts in `.codegrok/` folder |
| **Load Existing Index** | `learn` with `mode='load_only'` instantly loads previously indexed codebase |

### For Learning a New Codebase

| Capability | Description |
|------------|-------------|
| **Entry Point Discovery** | Query "main entry point" to find where execution starts |
| **Architecture Understanding** | Query "database connection" to find DB layer |
| **Domain Concepts** | Query "user authentication flow" to find auth logic |
| **Index Statistics** | See files parsed, symbols extracted, timing info |

---

## ❌ What CodeGrok CANNOT Do

> **Important:** Understanding limitations helps you use the tool effectively.

### Not Designed For

| Limitation | Explanation |
|------------|-------------|
| **Code Execution** | Pure indexing/search - no interpreter, no running tests |
| **Code Modification** | Read-only search - doesn't write or edit files |
| **Real-time File Watching** | No daemon mode - manually call `learn` again to update index |
| **Cross-repository Search** | Single codebase per index - can't search multiple projects simultaneously |
| **Find All Usages** | Finds definitions, not references (no "who calls this function?") |
| **Type Inference / LSP** | No language server - no jump-to-definition, no autocomplete |
| **Git History Analysis** | Indexes current state only - no commit history or blame |
| **Regex/Exact Search** | Semantic only - use `grep` or `ripgrep` for exact string matching |
| **Code Metrics** | No complexity scoring, no linting, no coverage data |

### Technical Constraints

| Constraint | Impact |
|------------|--------|
| **First index is slow** | ~50 chunks/second (~3-4 min for 10K symbols) |
| **Memory usage** | Embedding models use 500MB-2GB RAM |
| **Model download** | First run downloads ~500MB model from HuggingFace |
| **Query latency** | ~50-100ms per search |

---

## Quick Start

### Installation

```bash
# Clone the repository
git clone https://github.com/rdondeti/CodeGrok_mcp.git
cd CodeGrok_mcp

# Option 1: Use setup script (recommended)
./setup.sh              # Linux/macOS
# or
.\setup.ps1             # Windows PowerShell

# Option 2: Manual install
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
pip install -e .

# Verify installation
codegrok-mcp --help
```

**Setup script options:**
| Flag | Description |
|------|-------------|
| `--clean` | Remove existing venv before creating new |
| `--prod` | Install production dependencies only |
| `--no-verify` | Skip verification step |

### First Index

Once integrated with your AI tool (see below), ask your assistant:

```
"Learn my codebase at /path/to/my/project"
```

Then search:

```
"Find how API endpoints are defined"
"Where is error handling implemented?"
"Show me the database models"
```

---

## 🎯 Use Cases

### Use Case 1: Live Coding with AI

**How CodeGrok Saves Tokens:**

```
Without CodeGrok:
  AI tries to read entire codebase → exceeds context window → fails or costs $$

With CodeGrok:
  AI: "I need to add a new route"
    ↓ calls get_sources("Express route definition")
  CodeGrok: Returns routes/api.js:15, routes/auth.js:8
    ↓ AI reads only those 2 files
  Result: 10-100x fewer tokens, faster responses
```

### Use Case 2: Learning a New Codebase

```
Step 1: "Learn my codebase at ~/projects/big-app"
Step 2: "Where is the main entry point?"
Step 3: "How is authentication implemented?"
Step 4: "Find the database connection logic"
Step 5: "Show me how API errors are handled"
```

### Use Case 3: Code Review Assistance

```
"Find all functions that handle user input"
"Where is validation performed?"
"Show me error handling patterns"
```

---

## 🔌 AI Tool Integrations

### Claude Code (CLI)

The easiest way to add CodeGrok to Claude Code:

```bash
# Add the MCP server
claude mcp add codegrok-mcp -- codegrok-mcp
```

Or manually add to your settings (`~/.claude/settings.json`):

```json
{
  "mcpServers": {
    "codegrok": {
      "command": "codegrok-mcp"
    }
  }
}
```

**Usage in Claude Code:**
```
> learn my codebase at ./my-project
> find authentication logic
> where is the main entry point?
```

---

### Claude Desktop

Add to your Claude Desktop configuration:

| Platform | Config File Location |
|----------|---------------------|
| macOS | `~/Library/Application Support/Claude/claude_desktop_config.json` |
| Windows | `%APPDATA%\Claude\claude_desktop_config.json` |
| Linux | `~/.config/Claude/claude_desktop_config.json` |

```json
{
  "mcpServers": {
    "codegrok": {
      "command": "codegrok-mcp",
      "args": []
    }
  }
}
```

Restart Claude Desktop after saving.

---

### Cursor

Cursor supports MCP servers through its extension system:

1. **Open Settings** → Extensions → MCP
2. **Add Server Configuration**:

```json
{
  "codegrok": {
    "command": "codegrok-mcp",
    "transport": "stdio"
  }
}
```

Or add to `.cursor/mcp.json` in your project:

```json
{
  "servers": {
    "codegrok": {
      "command": "codegrok-mcp"
    }
  }
}
```

---

### Windsurf (Codeium)

Windsurf supports MCP through Cascade:

1. Open **Cascade Settings**
2. Navigate to **MCP Servers**
3. Add configuration:

```json
{
  "codegrok": {
    "command": "codegrok-mcp",
    "transport": "stdio"
  }
}
```

---

### Cline (VS Code)

Add to Cline's MCP settings in VS Code:

1. Open Command Palette (`Ctrl+Shift+P` / `Cmd+Shift+P`)
2. Search "Cline: Open MCP Settings"
3. Add:

```json
{
  "mcpServers": {
    "codegrok": {
      "command": "codegrok-mcp"
    }
  }
}
```

---

### Zed Editor

Zed supports MCP through its assistant panel. Add to settings:

```json
{
  "assistant": {
    "mcp_servers": {
      "codegrok": {
        "command": "codegrok-mcp"
      }
    }
  }
}
```

---

### Continue (VS Code/JetBrains)

Add to your Continue configuration (`~/.continue/config.json`):

```json
{
  "mcpServers": [
    {
      "name": "codegrok",
      "command": "codegrok-mcp"
    }
  ]
}
```

---

### Generic MCP Client

For any MCP-compatible client, use stdio transport:

```bash
# Command to run
codegrok-mcp

# Transport
stdio (stdin/stdout)

# Protocol
Model Context Protocol (MCP)
```

---

## MCP Tools Reference

CodeGrok provides **4 tools** for AI assistants:

| Tool | Description | Key Parameters |
|------|-------------|----------------|
| `learn` | Index a codebase (smart modes) | `path` (required), `mode` (auto/full/load_only), `file_extensions`, `embedding_model` |
| `get_sources` | Semantic code search | `question` (required), `n_results` (1-50, default: 10), `language`, `symbol_type` |
| `get_stats` | Get index statistics | None |
| `list_supported_languages` | List supported languages | None |

**Learn modes:**
- `auto` (default): Smart detection - incremental reindex if exists, full index if new
- `full`: Force complete re-index (destroys existing index)
- `load_only`: Just load existing index without any indexing

### Tool Examples

#### Learn a Codebase
```json
{
  "tool": "learn",
  "arguments": {
    "path": "/home/user/my-project",
    "mode": "auto"
  }
}
```

**Response:**
```json
{
  "success": true,
  "message": "Indexed 150 files with 1,247 symbols",
  "stats": {
    "total_files": 150,
    "total_symbols": 1247,
    "total_chunks": 2834,
    "indexing_time": 12.5
  }
}
```

#### Search for Code
```json
{
  "tool": "get_sources",
  "arguments": {
    "question": "How is user authentication implemented?",
    "n_results": 5
  }
}
```

**Response:**
```json
{
  "sources": [
    {
      "file": "src/auth/middleware.py",
      "symbol": "authenticate_request",
      "type": "function",
      "line": 45,
      "content": "def authenticate_request(request):\n    ...",
      "score": 0.89
    }
  ]
}
```

#### Incremental Update (using learn with auto mode)
```json
{
  "tool": "learn",
  "arguments": {
    "path": "/home/user/my-project",
    "mode": "auto"
  }
}
```

**Response (when index exists):**
```json
{
  "success": true,
  "mode_used": "incremental",
  "files_added": 2,
  "files_modified": 5,
  "files_deleted": 1
}
```

---

## Supported Languages

| Language | Extensions | Parser |
|----------|------------|--------|
| **Python** | `.py`, `.pyi`, `.pyw` | tree-sitter-python |
| **JavaScript** | `.js`, `.jsx`, `.mjs`, `.cjs` | tree-sitter-javascript |
| **TypeScript** | `.ts`, `.tsx`, `.mts`, `.cts` | tree-sitter-typescript |
| **C** | `.c`, `.h` | tree-sitter-c |
| **C++** | `.cpp`, `.cc`, `.cxx`, `.hpp`, `.hh`, `.hxx` | tree-sitter-cpp |
| **Go** | `.go` | tree-sitter-go |
| **Java** | `.java` | tree-sitter-java |
| **Kotlin** | `.kt`, `.kts` | tree-sitter-kotlin |
| **Bash** | `.sh`, `.bash`, `.zsh` | tree-sitter-bash |

**Total: 9 languages, 28 file extensions**

---

## How It Works

### Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                     MCP Client                               │
│        (Claude, Cursor, Cline, etc.)                        │
└─────────────────────────┬───────────────────────────────────┘
                          │ MCP Protocol (stdio)
                          ▼
┌─────────────────────────────────────────────────────────────┐
│                   CodeGrok MCP Server                        │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │   Parsers   │  │  Embeddings │  │   Vector Storage    │  │
│  │ (Tree-sitter)│  │ (Sentence   │  │    (ChromaDB)       │  │
│  │             │  │ Transformers)│  │                     │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
```

### Indexing Pipeline

```
Source Files → Tree-sitter Parser → Symbol Extraction →
Code Chunks → Embeddings → ChromaDB Storage
```

1. **Parse**: Tree-sitter extracts functions, classes, methods with signatures
2. **Chunk**: Code is split into semantic chunks with context (docstrings, imports, calls)
3. **Embed**: Sentence-transformers create vector embeddings
4. **Store**: ChromaDB persists vectors locally in `.codegrok/`

### Search Pipeline

```
Query → Embedding → Vector Similarity → Ranked Results
```

1. **Embed Query**: Convert natural language to vector
2. **Search**: Find similar vectors in ChromaDB
3. **Return**: Top-k results with file paths, line numbers, and code snippets

### Storage

All data is stored locally in your project:

```
your-project/
└── .codegrok/
    ├── chroma/           # Vector database
    └── metadata.json     # Index metadata (stats, file mtimes)
```

---

## Configuration

### Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `CODEGROK_EMBEDDING_MODEL` | Embedding model to use | `nomic-embed-code` |
| `CODEGROK_DEVICE` | Compute device (cpu/cuda/mps) | Auto-detect |

### Embedding Models

| Model | Size | Best For |
|-------|------|----------|
| `coderankembed` | 768d / 137M | **Code (default, recommended)** - uses `nomic-ai/CodeRankEmbed` |

The default model (`nomic-ai/CodeRankEmbed`) is optimized for code retrieval with:
- 768-dimensional embeddings
- 8192 max sequence length
- State-of-the-art performance on CodeSearchNet benchmarks

### Security Note: `trust_remote_code`

The default embedding model (`nomic-ai/CodeRankEmbed`) requires `trust_remote_code=True` when loading via SentenceTransformers. This flag allows execution of custom Python code bundled with the model.

**Why it's required:**
- The model uses a custom Nomic BERT architecture that isn't part of the standard HuggingFace model library
- Custom files: `modeling_hf_nomic_bert.py` (model architecture), `configuration_hf_nomic_bert.py` (config)

**Security audit:**
The custom code has been reviewed and contains:
- Standard PyTorch neural network definitions
- No `exec()`, `eval()`, or dynamic code execution
- No subprocess or shell commands
- No network requests beyond HuggingFace's standard model download APIs
- Only imports from trusted libraries (torch, transformers, einops, safetensors)

**For maximum security:**
- Review the model code yourself: [nomic-ai/CodeRankEmbed on HuggingFace](https://huggingface.co/nomic-ai/CodeRankEmbed/tree/main)
- Pin to a specific model revision in production deployments
- Consider using Microsoft CodeBERT (`microsoft/codebert-base`) as an alternative that doesn't require `trust_remote_code` (with potential quality trade-offs)

---

## Development

### Setup

```bash
# Clone
git clone https://github.com/rdondeti/CodeGrok_mcp.git
cd CodeGrok_mcp

# Run setup script
./setup.sh              # Linux/macOS (includes dev dependencies)
.\setup.ps1             # Windows PowerShell

# For clean reinstall:
./setup.sh --clean
```

### Testing

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=src/codegrok_mcp --cov-report=term-missing

# Run specific test categories
pytest tests/unit/ -v          # Fast unit tests
pytest tests/integration/ -v   # Integration tests (uses real embeddings)
pytest tests/mcp/ -v           # MCP protocol simulation tests
```

### Code Quality

```bash
# Format code
black src/

# Type checking
mypy src/

# Linting
flake8 src/
```

---

## FAQ & Troubleshooting

<details>
<summary><strong>Server won't start</strong></summary>

```bash
# Check installation
pip show codegrok-mcp

# Check Python version (need 3.10+)
python --version

# Reinstall
pip install -e .
```
</details>

<details>
<summary><strong>Indexing is slow</strong></summary>

- Large codebases (>10k files) take longer on first index
- Use `learn` again after first index for incremental updates (auto mode)
- Close other heavy applications
- Consider indexing a subdirectory first
</details>

<details>
<summary><strong>Search returns irrelevant results</strong></summary>

- Be more specific in queries (e.g., "JWT token validation" instead of "auth")
- Re-index if codebase changed significantly
- Check that the code type you're searching exists
</details>

<details>
<summary><strong>Out of memory</strong></summary>

- Index smaller portions of the codebase
- The default `coderankembed` model uses ~500MB-2GB RAM
- Close other applications
</details>

<details>
<summary><strong>"No index loaded" error</strong></summary>

Use `learn` tool first:
```
"Learn my codebase at /path/to/project"
```
</details>

---

## Comparison with Other Tools

| Feature | CodeGrok MCP | grep/ripgrep | GitHub Search | Sourcegraph |
|---------|--------------|--------------|---------------|-------------|
| Semantic Search | ✅ | ❌ | Partial | ✅ |
| Local/Private | ✅ | ✅ | ❌ | ❌ |
| MCP Support | ✅ | ❌ | ❌ | ❌ |
| No API Keys | ✅ | ✅ | ❌ | ❌ |
| Multi-language | ✅ | ✅ | ✅ | ✅ |
| Code Structure Aware | ✅ | ❌ | Partial | ✅ |
| Offline | ✅ | ✅ | ❌ | ❌ |

---

## Contributing

Contributions are welcome! Please:

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing`)
3. Make your changes
4. Run tests (`pytest`)
5. Format code (`black src/`)
6. Submit a Pull Request

### Development Guidelines

- Follow Black formatting (line length 100)
- Add type hints to all functions
- Write tests for new features
- Update documentation

---

## License

MIT License - see [LICENSE](LICENSE) for details.

---

## Related Projects

- **[Model Context Protocol](https://modelcontextprotocol.io/)** - The protocol that powers this integration
- **[Tree-sitter](https://tree-sitter.github.io/tree-sitter/)** - Fast, accurate code parsing
- **[ChromaDB](https://www.trychroma.com/)** - Vector database for embeddings
- **[Sentence Transformers](https://www.sbert.net/)** - State-of-the-art embeddings

---

## Support

- **Issues**: [GitHub Issues](https://github.com/rdondeti/CodeGrok_mcp/issues)
- **Discussions**: [GitHub Discussions](https://github.com/rdondeti/CodeGrok_mcp/discussions)

---

<div align="center">

**Made with ❤️ for developers who want AI that truly understands their code**

[⬆ Back to Top](#codegrok-mcp)

</div>
