Metadata-Version: 2.4
Name: mcp-code-index
Version: 1.0.4
Summary: SQLite-backed code index for Claude Code, exposed via MCP
Project-URL: Homepage, https://github.com/achreftlili/code-index
Project-URL: Repository, https://github.com/achreftlili/code-index
Project-URL: Issues, https://github.com/achreftlili/code-index/issues
Author: Achref Tlili
License-Expression: MIT
License-File: LICENSE
Requires-Python: >=3.10
Requires-Dist: click>=8.1.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: numpy<2,>=1.26.0; sys_platform == 'darwin' and platform_machine == 'x86_64'
Requires-Dist: numpy>=1.26.0; sys_platform != 'darwin' or platform_machine != 'x86_64'
Requires-Dist: pathspec>=0.12.0
Requires-Dist: sentence-transformers<5,>=3.0
Requires-Dist: sqlite-vec>=0.1.0
Requires-Dist: tree-sitter-go>=0.23.0
Requires-Dist: tree-sitter-python>=0.23.0
Requires-Dist: tree-sitter-rust>=0.23.0
Requires-Dist: tree-sitter-typescript>=0.23.0
Requires-Dist: tree-sitter>=0.23.0
Requires-Dist: watchdog>=4.0.0
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.5.0; extra == 'dev'
Description-Content-Type: text/markdown

# code-index

<!-- mcp-name: io.github.achreftlili/code-index -->

A local, SQLite-backed code index for Claude Code, exposed over MCP. It
replaces blind `Read` / `Grep` / `Glob` exploration with targeted retrieval —
"where is `parseAuthToken` defined", "what calls `Indexer.reindex_all`", "find
the rate-limiting code" — answered in milliseconds against an offline index.

**No API keys. No external services. The embedder runs locally on your machine.**

## How it works (30-second tour)

1. **Parse** your repo with tree-sitter (Python, TypeScript/JavaScript, Go, Rust).
2. **Chunk** code per symbol and expand identifiers (`getUserAuthToken` → `get user auth token`) so search matches both styles.
3. **Embed** each chunk locally with `jina-embeddings-v2-base-code` (768-dim) via sentence-transformers.
4. **Store** symbols, chunks, vectors, and call/import edges in `.claude/index.db` (SQLite + sqlite-vec + FTS5).
5. **Serve** 14 retrieval tools + 1 admin tool over MCP (see [Tools](#tools)).
6. **Stay fresh** via an optional `PostToolUse` hook that incrementally re-indexes touched files.

## Tools

### Retrieval

| Tool                | Purpose                                                                                                |
| ------------------- | ------------------------------------------------------------------------------------------------------ |
| `code_search`       | Hybrid (vector + FTS) search for **conceptual** queries (e.g., "auth flow", "where do we parse JSON"). |
| `symbol_lookup`     | Exact-name lookup of functions / classes / methods / types. Prefer over `code_search` for identifiers. |
| `file_outline`      | Symbols (with signatures) in a file, in source order. Use instead of `Read` when you only need shape.  |
| `module_outline`    | Symbols across a directory subtree in one call. Use instead of looping `file_outline`.                 |
| `where_am_i`        | Given `path` + `line`, returns the innermost symbol and the full enclosing chain.                      |
| `get_symbol_body`   | Full chunk for a `symbol_id` from `symbol_lookup` / `code_search` / `file_outline`.                    |
| `get_symbol_bodies` | Batch version of `get_symbol_body` (up to 20 ids per call).                                            |
| `callers`           | Symbols that CALL the given symbol. `depth` (1-5) expands transitively.                                |
| `callees`           | Symbols that the given symbol CALLS. `depth` (1-5) expands transitively.                               |
| `references`        | Non-call uses (subclasses, free identifier references). Companion to `callers` / `callees`.            |
| `trace`             | Build a call-graph tree from an entry symbol; `flat=true` returns nodes/edges for cheap LLM scans.     |
| `file_imports`      | Files this file imports (`direction=imports`) or that import it (`direction=imported_by`).             |
| `recent_changes`    | Files touched in the last N git commits.                                                               |
| `propose_rename`    | v1: same-file rename. Returns an edit list the agent applies via its own `Edit` tool; refuses on clash. |

### Admin

| Tool / op                  | Purpose                                                                                          |
| -------------------------- | ------------------------------------------------------------------------------------------------ |
| `admin op=init`            | Build or refresh the index. Incremental by default; `force=true` rebuilds from scratch.          |
| `admin op=setup_check`     | Diagnose hook wiring + embedder + host. Round-trip-tests the hook end-to-end.                    |
| `admin op=install_hook`    | Wire the auto-reindex `PostToolUse` hook into `.claude/settings.json`. Idempotent.               |
| `admin op=stats`           | Read-only: file counts by language, symbol totals, embed model fingerprint, last-index time.     |
| `admin op=verify`          | Integrity sweep: orphan rows, parse-failure files, dangling edges.                               |

`embed_query_debug` is a dev-only ranking diagnostic, hidden from `list_tools`
unless `CODE_INDEX_DEBUG=1` is set.

All tools return bounded JSON; large bodies use `get_symbol_body` rather than
inlining whole files.

## Requirements

- **Python 3.10+** with **loadable SQLite extension support** (required by `sqlite-vec`).
  - Python 3.13 has this enabled by default.
  - On 3.10–3.12, install via the python.org installer **or** via pyenv with
    `PYTHON_CONFIGURE_OPTS=--enable-loadable-sqlite-extensions pyenv install 3.12.x`.
  - Homebrew Python often ships **without** the extension hook — use one of the
    two methods above instead.
- **`uv` / `uvx`** ([install](https://docs.astral.sh/uv/getting-started/installation/)) — recommended runner. Or `pip` if you prefer a permanent install.
- **~600 MB free disk** for the embedding model on first init.

## Quick start (Claude Code)

One command, no API keys:

```bash
claude mcp add-json -s user code-index "$(cat <<'JSON'
{
  "type": "stdio",
  "command": "uvx",
  "args": ["--refresh", "--from", "mcp-code-index", "code-index-mcp"]
}
JSON
)"
```

Then open Claude Code in any repo and ask:

> _"Build the code index for this repo."_

Claude calls the `init` MCP tool, which writes `.claude/index.db`. From then on,
ask things like _"where is `parseAuthToken` defined?"_ or _"what calls
`Indexer.reindex_all`?"_ — Claude routes them through `symbol_lookup` /
`callers` / `code_search` instead of grepping.

> **What `--refresh` does** — fetches the latest PyPI release on every Claude
> Code launch. Convenient during preview; drop it once you want to pin a
> version (saves ~1s of startup).
>
> **Project-only install** — drop `-s user` to register the server in the
> current project's `.claude/settings.json` instead of the global `~/.claude.json`.
>
> **First-run model download** — the first `init` pulls
> `jina-embeddings-v2-base-code` (~600 MB) into `~/.cache/huggingface` and
> caches it forever. Subsequent runs are fully offline. If your network
> blocks Hugging Face, pre-warm the cache from a machine that has access.
>
> **Already installed without `--refresh`?** Run `claude mcp remove code-index`
> first, then re-run the command above.

### Alternative: permanent install (no uvx)

```bash
pip install mcp-code-index
claude mcp add -s user code-index -- code-index-mcp
```

### Optional: keep the index live as you edit

Without a hook, the index drifts when files change outside the agent (`mv`,
`git checkout`, IDE saves) until you call `init` again. With one, every
`Edit` / `Write` / `MultiEdit` Claude performs triggers an incremental reindex
of the touched file.

**Easiest path: ask Claude.** On first use in a new project, ask _"set up the
code-index"_ — Claude calls `setup_check` → `install_hook` → `init`. The hook
command is derived from how the MCP server was launched (uvx-aware), so it
uses the same Python toolchain. Hook output goes to `.claude/code-index-hook.log`
so failures are debuggable.

**Manual install** — add this block to the project's `.claude/settings.json`
under `hooks.PostToolUse` (the version you want depends on how you launch the
server — `install_hook` derives the right one for you):

```json
{
  "matcher": "Edit|Write|MultiEdit",
  "hooks": [
    {
      "type": "command",
      "command": "uvx --with 'sentence-transformers<5' --with 'numpy<2' --from mcp-code-index code-index-hook"
    }
  ]
}
```

### In other MCP-compatible agents

The server speaks standard MCP over stdio, so any client that supports MCP
servers works (Cursor, Continue, Cody, Zed, etc.). Configure the client to
launch `uvx --refresh --from mcp-code-index code-index-mcp` (or
`code-index-mcp` after `pip install mcp-code-index`). Once connected, call the
`init` tool from inside the client to bootstrap the index. Drop `--refresh`
when you want to pin to a stable version instead of always pulling latest.

### From source (development)

```bash
git clone https://github.com/achreftlili/code-index
cd code-index
pip install -e .
code-index init        # CLI alternative to the `init` MCP tool
code-index-mcp         # starts the MCP server on stdio (for manual wiring)
```

## Configuration

All settings are optional — the defaults work out of the box. Override them via
environment variables. Inside Claude Code, set them in the `env` block of your
`code-index` server entry in `~/.claude.json` (then reconnect the MCP server).

**Common knobs (most users only ever touch these):**

| Var | Default | When to set it |
|---|---|---|
| `CODE_INDEX_EMBED_DEVICE` | _auto_ | Force the torch device: `cpu`, `mps`, or `cuda`. Set `cpu` on Apple Silicon if `init` fails with **MPS out-of-memory**. |
| `CODE_INDEX_EMBED_BATCH` | `32`   | Encode batch size. Lower (e.g. `8` or `4`) to cut peak GPU memory while staying on `mps`/`cuda`. |
| `CODE_INDEX_DB`          | `.claude/index.db` | Override the SQLite index path (e.g. to share an index across sibling worktrees). |

**Advanced (rarely needed):**

| Var | Default | Notes |
|---|---|---|
| `CODE_INDEX_EMBEDDER`    | `jina` | Only `jina` (local sentence-transformers) is supported today; the variable exists for future expansion. |
| `CODE_INDEX_EMBED_MODEL` | `jinaai/jina-embeddings-v2-base-code` | HuggingFace model id. Only override if you know the model is dim-compatible (768d). |
| `CODE_INDEX_EMBED_DIM`   | `768` | Must match the embedding model's output dimension. |

## Troubleshooting

**`init` fails with `MPS backend out of memory` on Apple Silicon.** A large
file produced a chunk batch bigger than your GPU's free VRAM. Quickest fix —
re-run on CPU (slower but bulletproof):

```json
"env": {
  "CODE_INDEX_EMBED_DEVICE": "cpu"
}
```

To stay on the GPU, shrink the batch instead: `"CODE_INDEX_EMBED_BATCH": "8"`.
Reconnect the MCP server (`/mcp` → reconnect, or restart Claude Code) so the
new env takes effect. `init` is incremental — already-embedded files are
skipped on the retry.

**`init` fails with a Hugging Face network error on first run.** Your network
is blocking model downloads. Pre-warm the cache on a machine that has access:

```bash
huggingface-cli download jinaai/jina-embeddings-v2-base-code
# then copy ~/.cache/huggingface/ to the offline machine
```

**`sqlite3.OperationalError: not authorized` or `sqlite-vec` fails to load.**
Your Python build doesn't have loadable SQLite extensions. See
[Requirements](#requirements) — install via python.org or a pyenv build with
`PYTHON_CONFIGURE_OPTS=--enable-loadable-sqlite-extensions`.

**`code_search` / `symbol_lookup` returns stale paths after a refactor or
branch checkout.** The auto-reindex hook only fires on Claude's `Edit` /
`Write` / `MultiEdit`. After bulk file moves outside the agent (`mv`,
`git checkout`, IDE rename), re-run `init` (it's incremental). Or wire up the
[hook](#optional-keep-the-index-live-as-you-edit) so the index keeps up with
agent edits automatically.

## Layout

```
src/code_index/
  db.py           SQLite schema, connection, sqlite-vec loading
  parser.py       Tree-sitter wrapper, symbol + edge extraction
  imports.py      Per-language import target → file path resolution
  chunker.py      Per-symbol chunks, identifier expansion
  embedder.py     Local Jina (sentence-transformers) backend
  indexer.py      Pipeline: walk → parse → chunk → embed → write
  reindexer.py    Per-root engine cache; one entry point for "reindex one file"
  retriever.py    Hybrid search (vector + FTS5) with RRF
  watcher.py      File watcher (watchdog)
  admin.py        setup_check / install_hook / init logic (pure, no MCP state)
  mcp_server.py   MCP wiring, shared helpers, schema fragments
  tool_registry.py  Shared `@_tool` decorator + `_TOOLS` registry
  tools/          Per-domain MCP handlers (graph, paths, refactor, …)
  hook.py         `code-index-hook` console script — the PostToolUse entry point
  cli.py          init / reindex / watch / stats
```
