Metadata-Version: 2.4
Name: emb-codescout
Version: 0.1.0
Summary: Semantic code search + LLM answers for your TypeScript/React codebase — runs entirely on your machine.
Project-URL: Homepage, https://github.com/nishntr/CodeScout
Project-URL: Repository, https://github.com/nishntr/CodeScout
Project-URL: Issues, https://github.com/nishntr/CodeScout/issues
Project-URL: Changelog, https://github.com/nishntr/CodeScout/releases
Author: emb-codescout contributors
License: MIT License
        
        Copyright (c) 2026 embeddify contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: code-search,developer-tools,embeddings,llm,mcp,rag,semantic-search,tree-sitter
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: click>=8.1
Requires-Dist: faiss-cpu>=1.8
Requires-Dist: numpy>=1.24
Requires-Dist: pathspec>=0.12
Requires-Dist: requests>=2.31
Requires-Dist: rich>=13.0
Requires-Dist: sentence-transformers>=3.0
Requires-Dist: tree-sitter-typescript>=0.23
Requires-Dist: tree-sitter<0.26,>=0.22
Provides-Extra: dev
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Provides-Extra: mcp
Requires-Dist: mcp>=1.0; extra == 'mcp'
Description-Content-Type: text/markdown

# codescout

**Semantic code search + LLM answers for your TypeScript/React codebase — runs entirely on your machine.**

[![PyPI version](https://img.shields.io/pypi/v/codescout)](https://pypi.org/project/codescout/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

[Quickstart](#quickstart) • [MCP Server](#mcp-server) • [CLI](#commands) • [Embedding models](#embedding-models) • [How it works](#how-it-works) • [Configuration](#configuration)

---

CodeScout indexes TypeScript and React codebases with tree-sitter AST parsing, generates local embeddings via sentence-transformers, and answers natural-language questions with an LLM — all with zero external infrastructure. No Docker, no database server, no GPU, no API key for embeddings.

Run it as an MCP server and any agent (GitHub Copilot, Claude Code, Cursor) gets instant, cited answers from your codebase before writing a single line of code.

---

## Quickstart

```bash
pip install codescout
export OPENROUTER_API_KEY=sk-or-...   # free tier at openrouter.ai — only needed for ask
codescout ask "how does authentication work?"
```

On first run, codescout automatically initializes `.codescout/` in your project, downloads the embedding model (~90 MB, cached), indexes every file, then answers your question. No other setup needed.

---

## Main Features

- **AST-aware chunking** — parses TypeScript and TSX with tree-sitter; every function, component, hook, type, and interface becomes its own chunk with the right semantic label
- **Enriched embeddings** — each chunk is embedded with its natural-language description + imports + source, so queries match on meaning, not just keywords
- **Incremental indexing** — files are SHA256-hashed; only changed files are re-processed on subsequent runs
- **Fully local** — embeddings are generated by sentence-transformers on CPU, stored in FAISS + SQLite inside `.codescout/`; nothing leaves your machine
- **MCP server** — expose `search_codebase` and `index_status` as MCP tools; Copilot, Claude Code, and Cursor call them automatically
- **Direct LLM answers** — `codescout ask` retrieves relevant chunks and returns a cited, plain-English answer via [OpenRouter](https://openrouter.ai) (not just a list of snippets)

---

## MCP Server

Run codescout as an MCP server so your agent searches your codebase directly — before writing or modifying code.

### Setup

```bash
pip install 'codescout[mcp]'
codescout index          # index first
codescout mcp-init       # generates config + agent instructions (run once per project)
```

`mcp-init` creates:

| File | Purpose |
|---|---|
| `.vscode/mcp.json` | Tells VS Code how to launch the MCP server |
| `.github/copilot-instructions.md` | Instructs GitHub Copilot to call `search_codebase` before tasks |
| `CLAUDE.md` | Same instructions for Claude Code |

Reload VS Code after running `mcp-init`. The agent will then call `search_codebase` automatically on every task.

### Manual config

```json
{
  "servers": {
    "codescout": {
      "type": "stdio",
      "command": "codescout",
      "args": ["mcp-serve"]
    }
  }
}
```

### Available tools

| Tool | Description |
|---|---|
| `search_codebase(query, top_k?)` | Return the most relevant code chunks for a natural-language query |
| `index_status()` | Report how many files and chunks are indexed and when the last run was |

---

## Commands

```bash
codescout init                          # Create .codescout/ config in current project
codescout index                         # Scan and embed (incremental by default)
codescout index --full                  # Force complete re-index from scratch
codescout index --verbose               # Show per-file chunk counts while indexing
codescout ask "your question"           # Semantic search + LLM answer
codescout ask "..." --show-context      # Also print retrieved code chunks
codescout ask "..." --top-k 10          # Retrieve more chunks (default: 5)
codescout ask "..." --model openai/gpt-4o   # Override LLM model
codescout status                        # Show index stats (files, chunks, size)
codescout config                        # View all config values
codescout config top_k 10              # Set a config value
codescout mcp-init                      # Generate MCP config + agent instruction files
codescout mcp-serve                     # Start MCP server (stdio)
```

---

## Embedding Models

Embeddings are generated **entirely locally** with [sentence-transformers](https://www.sbert.net/) — no API key, no internet after the first download. Models are cached at `~/.cache/huggingface/hub/`.

| Model | Config value | Dims | Size | Code quality |
|---|---|---|---|---|
| **all-MiniLM-L6-v2** *(default)* | `all-MiniLM-L6-v2` | 384 | ~90 MB | ⭐⭐⭐ Good starting point |
| **nomic CodeRankEmbed** *(recommended)* | `nomic-ai/CodeRankEmbed` | 768 | ~520 MB | ⭐⭐⭐⭐⭐ Best open-source for code |
| **Jina v2 Code** | `jinaai/jina-embeddings-v2-base-code` | 768 | ~610 MB | ⭐⭐⭐⭐ Strong, 8k context |
| **Salesforce CodeT5+** | `Salesforce/codet5p-110m-embedding` | 256 | ~420 MB | ⭐⭐⭐⭐ Compact index size |

All models run on **CPU**. No GPU required.

### Switching models

```json
// .codescout/config.json
{
  "model": "nomic-ai/CodeRankEmbed",
  "embedding_dim": 768
}
```

```bash
codescout index --full   # always re-index after changing models
```

> `embedding_dim` must match the model's output dimension. Mixing embeddings from two different models in the same index produces wrong results.

---

## How It Works

1. **Scan** — walks the project tree, respects `.gitignore`, filters by configured extensions
2. **Parse** — tree-sitter extracts top-level declarations (functions, components, hooks, types, interfaces, classes)
3. **Chunk** — each declaration becomes one chunk; a natural-language description is generated and prepended before embedding
4. **Embed** — all chunks are encoded in a single batched call (memory-safe mini-batches of 32)
5. **Store** — vectors go into FAISS (`index.faiss`), metadata + source into SQLite (`metadata.db`), both inside `.codescout/`
6. **Query** — question is embedded, FAISS finds nearest vectors, source is fetched from SQLite, sent to LLM

### Chunk classification

| Type | Detection |
|---|---|
| `component` | JSX in body + uppercase first letter |
| `hook` | name starts with `use` + uppercase |
| `function` | any other named function or arrow |
| `type` / `interface` | TypeScript type alias or interface |
| `class` | class or abstract class declaration |
| `constant` | exported non-function declaration |

### Why descriptions improve recall

A chunk for `useAuthToken()` gets the description *"hook useAuthToken — Uses: token, setToken, userId"*. A query for *"authentication token handling"* matches this description even if the words "authentication" or "token" never appear in the function body. Interfaces get their field names extracted; hooks and components get their destructured state variables listed. Tools that embed raw source alone miss this signal.

---

## Configuration

`.codescout/config.json` (created by `codescout init`):

```json
{
  "model": "all-MiniLM-L6-v2",
  "embedding_dim": 384,
  "top_k": 5,
  "extensions": [".ts", ".tsx", ".js", ".jsx"],
  "exclude": ["node_modules", "dist", ".next", "build", "*.test.ts", "*.spec.ts"],
  "llm_model": "anthropic/claude-sonnet-4",
  "max_chunk_lines": 80,
  "min_chunk_lines": 3,
  "max_context_chars": 12000
}
```

Read or write any value:

```bash
codescout config                   # show all
codescout config top_k             # read one
codescout config top_k 10          # write one
```

---

## Storage

Everything lives in `.codescout/` inside the project root:

```
.codescout/
├── config.json       # project config
├── index.faiss       # FAISS vector index (~1.5 MB per 1,000 chunks at 384 dims)
├── metadata.db       # SQLite: chunk source, file hashes, line numbers
└── .gitignore        # auto-generated; prevents committing the index to git
```

Inspect directly:

```bash
sqlite3 .codescout/metadata.db \
  "SELECT name, chunk_type, file_path, start_line FROM chunks LIMIT 20;"
```

---

## Installation

```bash
pip install codescout            # core
pip install 'codescout[mcp]'     # + MCP server
```

Requires **Python 3.10+**. FAISS, sentence-transformers, and tree-sitter are bundled as dependencies.

---

## License

MIT
