Metadata-Version: 2.4
Name: chunkhound
Version: 4.0.1
Summary: Local-first semantic code search with vector and regex capabilities for AI assistants via MCP
Project-URL: Homepage, https://github.com/chunkhound/chunkhound
Project-URL: Repository, https://github.com/chunkhound/chunkhound
Project-URL: Issues, https://github.com/chunkhound/chunkhound/issues
Project-URL: Documentation, https://github.com/chunkhound/chunkhound#readme
Project-URL: Changelog, https://github.com/chunkhound/chunkhound/releases
Author: Ofri Wolfus
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP :: Indexing/Search
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: <3.14,>=3.10
Requires-Dist: aiohttp>=3.12.15
Requires-Dist: duckdb>=1.4.0
Requires-Dist: fastmcp>=2.0.0
Requires-Dist: httpx>=0.28.1
Requires-Dist: lancedb>=0.25.2
Requires-Dist: loguru>=0.6.0
Requires-Dist: mcp>=1.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: openai>=1.0.0
Requires-Dist: pandas>=2.3.0
Requires-Dist: pathspec>=0.12.1
Requires-Dist: psutil>=5.8.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.11.0
Requires-Dist: pygit2>=1.12.0
Requires-Dist: pylance>=0.31.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rapidyaml>=0.10.0
Requires-Dist: readchar>=4.2.1
Requires-Dist: rich>=13.0.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: tiktoken>=0.9.0
Requires-Dist: tree-sitter-bash>=0.25.0
Requires-Dist: tree-sitter-c>=0.24.1
Requires-Dist: tree-sitter-cpp>=0.23.4
Requires-Dist: tree-sitter-go>=0.25.0
Requires-Dist: tree-sitter-groovy>=0.1.2
Requires-Dist: tree-sitter-haskell>=0.23.1
Requires-Dist: tree-sitter-java>=0.23.5
Requires-Dist: tree-sitter-javascript>=0.25.0
Requires-Dist: tree-sitter-json>=0.24.8
Requires-Dist: tree-sitter-kotlin>=1.1.0
Requires-Dist: tree-sitter-language-pack>=0.7.3
Requires-Dist: tree-sitter-make>=0.1.0
Requires-Dist: tree-sitter-markdown>=0.5.1
Requires-Dist: tree-sitter-php>=0.24.0
Requires-Dist: tree-sitter-python>=0.25.0
Requires-Dist: tree-sitter-rust>=0.24.0
Requires-Dist: tree-sitter-toml>=0.7.0
Requires-Dist: tree-sitter-typescript>=0.23.2
Requires-Dist: tree-sitter-zig>=1.1.0
Requires-Dist: tree-sitter>=0.25.0
Requires-Dist: voyageai>=0.2.0
Requires-Dist: watchdog>=4.0.0
Requires-Dist: xxhash>=3.0.0
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: mypy>=1.6.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-timeout>=2.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: test
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'test'
Requires-Dist: pytest-timeout>=2.1.0; extra == 'test'
Requires-Dist: pytest>=7.4.0; extra == 'test'
Description-Content-Type: text/markdown

<p align="center">
  <a href="https://chunkhound.github.io">
    <picture>
      <source media="(prefers-color-scheme: dark)" srcset="public/wordmark-centered-dark.svg">
      <img src="public/wordmark-centered.svg" alt="ChunkHound" width="400">
    </picture>
  </a>
</p>

<p align="center">
  <strong>Deep Research for Code & Files</strong>
</p>

<p align="center">
  <a href="https://github.com/chunkhound/chunkhound/actions/workflows/smoke-tests.yml"><img src="https://github.com/chunkhound/chunkhound/actions/workflows/smoke-tests.yml/badge.svg" alt="Tests"></a>
  <a href="https://opensource.org/licenses/MIT"><img src="https://img.shields.io/badge/license-MIT-blue.svg" alt="License: MIT"></a>
  <img src="https://img.shields.io/badge/100%25%20AI-Generated-ff69b4.svg" alt="100% AI Generated">
  <a href="https://discord.gg/BAepHEXXnX"><img src="https://img.shields.io/badge/Discord-Join_Community-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
</p>

Transform your codebase into a searchable knowledge base for AI assistants using [semantic search via cAST algorithm](https://arxiv.org/pdf/2506.15655) and regex search. Integrates with AI assistants via the [Model Context Protocol (MCP)](https://spec.modelcontextprotocol.io/).

## Features

- **[cAST Algorithm](https://arxiv.org/pdf/2506.15655)** - Research-backed semantic code chunking
- **[Multi-Hop Semantic Search](https://chunkhound.github.io/under-the-hood/#multi-hop-semantic-search)** - Discovers interconnected code relationships beyond direct matches
- **Semantic search** - Natural language queries like "find authentication code"
- **Regex search** - Pattern matching without API keys
- **Local-first** - Your code stays on your machine
- **29 languages** with structured parsing
  - **Programming** (via [Tree-sitter](https://tree-sitter.github.io/tree-sitter/)): Python, JavaScript, TypeScript, JSX, TSX, Java, Kotlin, Groovy, C, C++, C#, Go, Rust, Haskell, Swift, Bash, MATLAB, Makefile, Objective-C, PHP, Vue, Zig
  - **Configuration** (via Tree-sitter): JSON, YAML, TOML, HCL, Markdown
  - **Text-based** (custom parsers): Text files, PDF
- **[MCP integration](https://spec.modelcontextprotocol.io/)** - Works with Claude, VS Code, Cursor, Windsurf, Zed, etc

## Documentation

**Visit [chunkhound.github.io](https://chunkhound.github.io) for complete guides:**
- [Tutorial](https://chunkhound.github.io/tutorial/)
- [Configuration Guide](https://chunkhound.github.io/configuration/)
- [Architecture Deep Dive](https://chunkhound.github.io/under-the-hood/)

## Requirements

- Python 3.10+
- [uv package manager](https://docs.astral.sh/uv/)
- API key for semantic search (optional - regex search works without any keys)
  - [OpenAI](https://platform.openai.com/api-keys) | [VoyageAI](https://dash.voyageai.com/) | [Local with Ollama](https://ollama.ai/)

## Installation

```bash
# Install uv if needed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install ChunkHound
uv tool install chunkhound
```

## Quick Start

1. Create `.chunkhound.json` in project root file
```json
{
  "embedding": {
    "provider": "openai",
    "api_key": "your-api-key-here"
  }
}
```
2. Index your codebase
```bash
chunkhound index
```

**For configuration, IDE setup, and advanced usage, see the [documentation](https://chunkhound.github.io).**

## YAML Parsing Benchmarks

Use the reproducible benchmark harness to compare PyYAML, tree-sitter/cAST, and RapidYAML bindings on representative YAML workloads.

```bash
# Default synthetic cases with all available backends
uv run python scripts/bench_yaml.py

# Use your own fixtures or disable specific backends
uv run python scripts/bench_yaml.py \
  --cases-dir ./benchmarks/yaml \
  --backends pyyaml_safe_load tree_sitter_universal \
  --iterations 10
```

## Real-Time Indexing

**Automatic File Watching**: MCP servers monitor your codebase and update the index automatically as you edit files. No manual re-indexing required.

**Smart Content Diffs**: Only changed code chunks get re-processed. Unchanged chunks keep their existing embeddings, making updates efficient even for large codebases.

**Seamless Branch Switching**: When you switch git branches, ChunkHound automatically detects and re-indexes only the files that actually changed between branches.

**Live Memory Systems**: Index markdown notes or documentation that updates in real-time while you work, creating a dynamic knowledge base.

## Why ChunkHound?

**Research Foundation**: Built on the [cAST (Chunking via Abstract Syntax Trees)](https://arxiv.org/pdf/2506.15655) algorithm from Carnegie Mellon University, providing:
- **4.3 point gain** in Recall@5 on RepoEval retrieval
- **2.67 point gain** in Pass@1 on SWE-bench generation
- **Structure-aware chunking** that preserves code meaning

**Local-First Architecture**:
- Your code never leaves your machine
- Works offline with [Ollama](https://ollama.ai/) local models
- No per-token charges for large codebases

**Universal Language Support**:
- Structured parsing for 29 languages (Tree-sitter + custom parsers)
- Same semantic concepts across all programming languages

**Intelligent Code Discovery**:
- Multi-hop search follows semantic relationships to find related implementations
- Automatically discovers complete feature patterns: find "authentication" to get password hashing, token validation, session management
- Convergence detection prevents semantic drift while maximizing discovery

## License

MIT

## Startup profile (discovery diagnostics)

Use `--profile-startup` to emit a JSON block with discovery and startup timing diagnostics to stderr. This works for both simulate and full index runs.

Examples:

```bash
# Simulate (discovery only) — file list on stdout, JSON profile on stderr
CHUNKHOUND_NO_RICH=1 \
chunkhound index --simulate . --sort path --profile-startup 2>profile.json

# Full run (no embeddings) — JSON profile on stderr
CHUNKHOUND_NO_RICH=1 \
chunkhound index . --no-embeddings --profile-startup 2>profile.json
```

Fields in `startup_profile` (JSON):

- `discovery_ms` — discovery time in milliseconds
- `cleanup_ms` — orphan cleanup time in milliseconds
- `change_scan_ms` — change-scan time in milliseconds
- `resolved_backend` — discovery backend actually used: `python | git | git_only`
- `resolved_reasons` — reasons for the decision (e.g., `no_repos`, `all_repos`, `mixed`, `explicit`)
- `git_rows_tracked` — number of paths from `git ls-files` (tracked)
- `git_rows_others` — number of paths from `git ls-files --others --exclude-standard`
- `git_rows_total` — sum of the two above
- `git_pathspecs` — number of pathspecs (`:(glob) ...`) pushed down to Git for pre-filtering
  - CAP: set `CHUNKHOUND_INDEXING__GIT_PATHSPEC_CAP` (default: 128). If the number of synthesized specs would exceed the cap, ChunkHound falls back to a subtree-only pathspec to guarantee coverage. The profile reflects the actual `git_pathspecs` used; an optional `git_pathspecs_capped: true` may appear.

Notes:

- `git_*` counters appear only when the backend is `git` or `git_only`.
- In `auto` mode, the backend is chosen heuristically (`git_only` for all‑repo trees, `git` for mixed trees, `python` when no repos are found).
- For scripting, set `CHUNKHOUND_NO_RICH=1` and read stderr; each JSON block appears on its own line near the end of the run.

Example snippet:

```json
{
  "startup_profile": {
    "discovery_ms": 154.2,
    "cleanup_ms": 12.7,
    "change_scan_ms": 3.1,
    "resolved_backend": "git_only",
    "resolved_reasons": ["all_repos"],
    "git_rows_tracked": 420,
    "git_rows_others": 17,
    "git_rows_total": 437,
    "git_pathspecs": 4
  }
}
```

## Exclusions (gitignore, config, defaults)

ChunkHound combines repository–aware ignores with safe defaults. The behavior depends on how you set `indexing.exclude` in `.chunkhound.json`:

- Not set (default) → gitignore only
  - The `.gitignore` files inside repositories are honored (repo‑aware engine). Default ChunkHound excludes (e.g., `.git/`, `node_modules/`, `.chunkhound/`, caches) always apply to prevent self‑indexing and noise.
- String sentinel `.gitignore` → gitignore only
  - Same as the default: only `.gitignore` rules are used as the exclusion source (plus ChunkHound’s default excludes).
- Explicit list (array) → combined (gitignore + config) [default]
  - Your glob patterns in `indexing.exclude` are layered on top of `.gitignore` rather than replacing it. ChunkHound’s default excludes are also applied. This avoids surprising loss of `.gitignore` behavior when you accept prompts to add slow files to excludes.
  - To restore legacy behavior, set `indexing.exclude_mode: "config_only"`. To force only gitignore even when a list exists, set `indexing.exclude_mode: "gitignore_only"` (rare).

Workspace overlay for non‑repo paths (default: on)
- When the directory you index contains non‑repo subtrees, ChunkHound can apply the root workspace `.gitignore` only to those non‑repo paths. This is controlled by `indexing.workspace_gitignore_nonrepo` (default: `true`).
- Repository subtrees always use their own `.gitignore` and Git’s native semantics.

Examples

```jsonc
// Default: gitignore only (+ safe defaults)
{
  "indexing": {
    // exclude omitted
    "workspace_gitignore_nonrepo": true
  }
}

// Gitignore only (explicit sentinel)
{
  "indexing": {
    "exclude": ".gitignore",
    "workspace_gitignore_nonrepo": true
  }
}

// Explicit list layered ON TOP of gitignore (default)
{
  "indexing": {
    "exclude": ["**/dist/**", "**/*.min.js"],
    "workspace_gitignore_nonrepo": false
  }
}

// Legacy behavior: config only (gitignore ignored)
{
  "indexing": {
    "exclude": ["**/dist/**", "**/*.min.js"],
    "exclude_mode": "config_only"
  }
}

// Force gitignore-only even with a list (rare)
{
  "indexing": {
    "exclude": ["**/dist/**"],
    "exclude_mode": "gitignore_only"
  }
}

### Root semantics for config patterns

- Config `include` and `exclude` patterns are always evaluated relative to the ChunkHound root (the directory you pass to `chunkhound index`).
- Git’s own `.gitignore` patterns remain repo‑aware (anchored to their respective repository roots), but your config overlay applies uniformly from the CH root across all subtrees (including Git repos).
- Examples:
  - CH root is `/workspaces`; Git repo lives under `/workspaces/monorepo`. To exclude a file inside that repo using config, prefer a CH‑root‑relative path (e.g., `"**/monorepo/path/inside/repo/file.txt"`).
  - When using anchored includes like `"src/**/*.ts"`, ensure the anchor is correct from the CH root perspective (e.g., `"monorepo/src/**/*.ts"` when the repo is nested).
```

CLI toggle for the workspace overlay
- `--nonrepo-gitignore` enables the root `.gitignore` overlay for non‑repo paths for the current run.
- To disable overlay persistently, set `"workspace_gitignore_nonrepo": false` in `.chunkhound.json`.

## Simulate and diagnostics

Simulate a discovery run without writing to the database. Useful for verifying include/exclude rules, sorting, and sizes.

```bash
# List discovered files (sorted by path)
chunkhound index --simulate . --sort path

# Show sizes and sort by size (descending)
chunkhound index --simulate . --show-sizes --sort size_desc

# Emit JSON instead of plain text
chunkhound index --simulate . --json > files.json

# Add discovery timing/profile to stderr (JSON)
CHUNKHOUND_NO_RICH=1 chunkhound index --simulate . --profile-startup 2>profile.json

# Print debug info about ignores (to stderr): CH root, sources, first N defaults
chunkhound index --simulate . --debug-ignores
```

Diagnostics (ignore decisions):

```bash
# Compare ChunkHound’s ignore decision vs Git for the current tree
chunkhound index --check-ignores --vs git --json > ignore_diff.json
```

Notes:
- When piping simulate output to tools like `head`, BrokenPipe is handled gracefully; prefer `CHUNKHOUND_NO_RICH=1` for easy JSON parsing.
