Metadata-Version: 2.4
Name: mcp-server-thoth
Version: 0.2.2
Summary: MCP server for persistent codebase memory with semantic search
Project-URL: Homepage, https://github.com/braininahat/thoth
Project-URL: Bug Tracker, https://github.com/braininahat/thoth/issues
Project-URL: Source Code, https://github.com/braininahat/thoth
Author-email: Varun Shijo <varunshi@buffalo.edu>
Maintainer-email: Varun Shijo <varun.shijo@gmail.com>
License: MIT
License-File: LICENSE
Keywords: analysis,codebase,mcp,memory,semantic-search,visualization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: <3.13,>=3.10
Requires-Dist: aiosqlite>=0.20.0
Requires-Dist: chromadb>=0.4.0
Requires-Dist: click>=8.1.0
Requires-Dist: mcp>=1.1.0
Requires-Dist: networkx>=3.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: vllm>=0.8.5
Provides-Extra: cache
Requires-Dist: redis>=5.0.0; extra == 'cache'
Provides-Extra: dashboard
Requires-Dist: gradio>=5.0.0; extra == 'dashboard'
Requires-Dist: plotly>=5.0.0; extra == 'dashboard'
Provides-Extra: dev
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Description-Content-Type: text/markdown

# Thoth

MCP server providing persistent codebase memory with semantic search for AI assistants.

<p align="center">
  <a href="https://pypi.org/project/mcp-server-thoth/">
    <img src="https://img.shields.io/pypi/v/mcp-server-thoth.svg" alt="PyPI">
  </a>
  <a href="https://github.com/braininahat/thoth/blob/main/LICENSE">
    <img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License">
  </a>
  <a href="https://pypi.org/project/mcp-server-thoth/">
    <img src="https://img.shields.io/pypi/pyversions/mcp-server-thoth.svg" alt="Python Versions">
  </a>
</p>

## Overview

Thoth indexes code repositories using AST parsing and provides tools for symbol lookup, cross-repository navigation, and architecture visualization. With v0.2.0, Thoth now includes **semantic search** powered by local embeddings, allowing natural language queries to find relevant code without exact keyword matches.

The index persists in `~/.thoth/`, giving Claude and other MCP-compatible assistants memory across conversations.

## Features

- 🔍 **Semantic Search**: Find code using natural language queries with vLLM and Qwen3 embeddings
- 🧠 **Persistent Memory**: Code understanding persists between conversations
- 🔗 **Cross-Repository**: Navigate dependencies across multiple related repositories
- 📊 **Visualizations**: Generate architecture diagrams and dependency graphs
- ⚡ **Fast Indexing**: AST-based parsing with incremental updates
- 🎯 **Precise Navigation**: Jump to exact definitions, find all callers
- 🔧 **Local-First**: All processing happens locally, no cloud dependencies

## Installation

### Requirements

- Python 3.10-3.12 (Python 3.13 not yet supported due to vLLM dependencies)
- For semantic search: ~2GB disk space for embedding model

### Claude Desktop

Add to your configuration file:
- macOS: `~/Library/Application Support/Claude/claude_desktop_config.json`
- Windows: `%APPDATA%\Claude\claude_desktop_config.json`
- Linux: `~/.config/claude/claude_desktop_config.json`

#### Configuration:
```json
{
  "mcpServers": {
    "thoth": {
      "command": "uvx",
      "args": ["--python", "3.12", "mcp-server-thoth"]
    }
  }
}
```

To index repositories, either:
1. Use the CLI: `thoth-cli index myrepo /path/to/repo`
2. Use the `index_repository` tool from within Claude

### Command Line

```bash
# Install globally
uv tool install --python 3.12 mcp-server-thoth

# Index a repository
thoth-cli index myproject /path/to/repo

# Search symbols
thoth-cli search "database connection"

# Start MCP server
mcp-server-thoth
```

## Tools

### Core Tools
- `find_definition` - Locate symbol definitions
- `get_file_structure` - Extract functions, classes, imports from a file
- `search_symbols` - Search symbols by name pattern
- `get_callers` - Find callers of a function
- `list_repositories` - List indexed repositories
- `index_repository` - Index a new repository

### Semantic Search (v0.2.0+)
- `semantic_search` - Natural language code search using embeddings
  - Example: "function that handles user authentication"
  - Returns relevant symbols ranked by semantic similarity

### Visualization Tools
- `generate_module_diagram` - Generate Mermaid dependency diagrams
- `generate_system_architecture` - Visualize cross-repository relationships
- `trace_api_flow` - Trace client-server communication paths

## Architecture

### Storage Backend

Thoth uses a hybrid storage approach:
- **SQLite** (`~/.thoth/index.db`): Source of truth for structured data
  - `symbols` - Functions, classes, methods with location and parent relationships
  - `imports` - Import statements with cross-repository resolution
  - `calls` - Function call graph (caller → callee mapping)
  - `files` - File metadata and content hashes for incremental updates

- **ChromaDB** (`~/.thoth/chroma/`): Vector storage for semantic search
  - Stores embeddings for all indexed symbols
  - Enables natural language queries

- **NetworkX**: In-memory graph for fast relationship traversal

### Embedding Model

Semantic search uses **Qwen3-Embedding-0.6B** via vLLM:
- Lightweight (600M parameters, ~1.2GB on disk)
- Code-aware embeddings with instruction support
- Fast inference with GPU acceleration (optional)
- Falls back to TF-IDF when vLLM is unavailable

## Performance

- **Indexing**: ~10K symbols/minute
- **Semantic Search**: <100ms for typical queries
- **Memory**: ~2GB for model + ~100MB per 100K symbols
- **Accuracy**: 0.7-0.9 relevance scores for code search

## Advanced Usage

### Pre-indexing Large Repositories
For large monorepos, pre-index before adding to Claude:
```bash
thoth-cli index myrepo /path/to/large-repo
```

### Using Redis Cache (Optional)
For improved performance with multiple users:
```bash
# Install with Redis support
uv tool install "mcp-server-thoth[cache]"

# Requires Redis server running locally
```

### Dashboard (Coming Soon)
A separate `thoth-dashboard` package will provide:
- Web UI for exploring indexed code
- Interactive dependency graphs
- Real-time search interface

## Development

```bash
git clone https://github.com/braininahat/thoth
cd thoth
uv pip install -e ".[dev]"

# Run tests
pytest

# Type checking
mypy thoth
```

## Token Efficiency

Thoth dramatically reduces the tokens needed for code navigation:

**Without Thoth**: Multiple searches + reading entire files = ~50K tokens
**With Thoth**: Semantic search + precise results = ~2K tokens

Example:
```
User: "How does the dashboard update in real-time?"

Without Thoth:
- grep "dashboard" → 50 results
- grep "update" → 200 results  
- Read 10+ files to understand

With Thoth semantic search:
- Returns: WebSocketHandler.send_update(), Dashboard.subscribe_to_changes(), etc.
- Ranked by relevance
```

## Troubleshooting

### Python Version Issues
If you see errors about `xformers` or build failures:
```bash
# Ensure Python 3.12 is used
uvx --python 3.12 mcp-server-thoth
```

### GPU Memory
For systems with limited GPU memory:
- Embeddings are automatically moved to CPU after computation
- Set `CUDA_VISIBLE_DEVICES=-1` to force CPU-only mode

### Model Download
First run downloads the embedding model (~1.2GB). Subsequent runs use the cached model.

## License

MIT

## Contributing

Contributions welcome! Please check the [issues](https://github.com/braininahat/thoth/issues) page.

## Acknowledgments

- [MCP](https://modelcontextprotocol.io/) by Anthropic
- [vLLM](https://github.com/vllm-project/vllm) for fast inference
- [Qwen](https://github.com/QwenLM/Qwen) for lightweight embeddings