Metadata-Version: 2.4
Name: context_rag_cli
Version: 0.1.0
Summary: Privacy-preserving CLI AI agent for querying local code repositories using RAG
Author: Developer
License: MIT
Project-URL: Homepage, https://github.com/<owner>/context_cli
Project-URL: Repository, https://github.com/<owner>/context_cli
Project-URL: Bug Tracker, https://github.com/<owner>/context_cli/issues
Keywords: rag,cli,llm,code-search,ai-assistant
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: typer>=0.12.0
Requires-Dist: rich>=13.0.0
Requires-Dist: sentence-transformers>=2.7.0
Requires-Dist: chromadb>=0.5.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: torch>=2.2.0
Requires-Dist: tiktoken>=0.7.0
Requires-Dist: whatthepatch>=1.0.0
Requires-Dist: pathspec>=0.12.0
Requires-Dist: rank-bm25>=0.2.2
Requires-Dist: watchdog>=4.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: hypothesis>=6.0; extra == "dev"
Requires-Dist: pytest-mock>=3.0; extra == "dev"
Provides-Extra: publish
Requires-Dist: build>=1.0; extra == "publish"
Requires-Dist: twine>=5.0; extra == "publish"

# Autonomous RAG Agent

A privacy-preserving command-line AI assistant for your local code repositories. Index any codebase, search it with natural language, ask questions in an interactive chat session, and generate code patches — all without uploading your source files to the cloud.

Only semantically selected code snippets are sent to the LLM. Raw repository files never leave your machine.

---

## How it works

```
Your repo  →  Chunk  →  Embed (local CPU)  →  ChromaDB (local disk)
                                                      ↓
Your query  →  Embed  →  Retrieve top-K chunks  →  Gemini / GPT / Claude
                                                      ↓
                                               Answer + sources
```

1. **Index** — scans your repo, splits files into overlapping chunks, generates embeddings locally using `all-MiniLM-L6-v2` (runs on CPU, no GPU needed), stores everything in ChromaDB on your disk.
2. **Chat / Ask / Search / Patch** — your query is embedded the same way, the most relevant chunks are retrieved, and only those chunks are sent to the LLM along with your question.

---

## Installation

### Prerequisites

- Python 3.10 or newer
- pip

### Install (recommended — pipx keeps it isolated)

```bash
pipx install context_rag_cli
```

### Install (standard pip)

```bash
pip install context_rag_cli
```

### Install from source

```bash
git clone <repo-url>
cd project_cli
pip install -e .
```

Verify the install:

```bash
agent --help
```

> **CPU-only PyTorch (saves ~1 GB):** The default `torch` wheel from PyPI includes CUDA support. To install the lighter CPU-only build, see [INSTALL.md](INSTALL.md).

> The agent stores all its data in `~/.agent/` — it never writes files to your project directory.

---

## Quick start — using it on your own project

### Step 1 — Configure your LLM provider

Pick one of the three supported providers and set your API key:

```bash
# Gemini (Google)
agent config set provider gemini
agent config set model gemini-1.5-flash
agent config set api_key YOUR_GEMINI_API_KEY

# OpenAI
agent config set provider openai
agent config set model gpt-4o-mini
agent config set api_key YOUR_OPENAI_API_KEY

# Anthropic (Claude)
agent config set provider anthropic
agent config set model claude-3-5-sonnet-20241022
agent config set api_key YOUR_ANTHROPIC_API_KEY
```

Confirm everything looks right:

```bash
agent config show
```

### Step 2 — Index your project

Navigate to any project directory and index it:

```bash
cd /path/to/your/project
agent index .
```

Or index a specific path from anywhere:

```bash
agent index "/path/to/your/project"
```

This step runs entirely locally. It may take a minute on the first run while the embedding model is downloaded (~90 MB, cached after that). All subsequent runs are fast.

### Step 3 — Start a chat session (recommended)

The best way to explore a codebase is the interactive chat mode. The embedding model loads once and you type questions directly. The agent remembers the last 6 turns by default, so follow-up questions work naturally:

```bash
cd /path/to/your/project
agent chat
```

```
RAG Agent — Interactive Chat
  Project : /path/to/your/project
  Provider: gemini / gemini-1.5-flash
  Top-K   : 5  |  Memory: 6 turns

You: How does authentication work?
Agent: The authentication flow starts in auth.py...

You: What errors can it throw?          ← follow-up works
Agent: Based on our discussion of the auth flow,
       it raises AuthError on 401/403...

You: /search JWT token
# Shows raw search results without LLM

You: /top-k 10
# Now retrieves 10 chunks per question

You: /clear-history
# Forget conversation history without restarting

You: /exit
```

### Step 4 — Or use single commands

```bash
agent ask "How does authentication work in this project?"
agent ask "Where is the database connection configured?"
agent search "error handling"
agent search "database query" --top-k 10
```

### Step 5 — Generate a patch

```bash
agent patch "Add input validation to the login endpoint"
agent patch "Add docstrings to all public functions in utils.py" --dry-run
```

---

## All commands

### `agent chat` ⭐ recommended

Start an interactive chat session. The embedding model and vector store load **once** — no startup delay on every question. Conversation memory is built in — the agent remembers your last N turns so follow-up questions work naturally.

```bash
agent chat                    # default top-k 5, 6 turns memory, buffered responses
agent chat --stream           # stream tokens as they arrive
agent chat --top-k 8          # retrieve 8 chunks per question
agent chat --history 10       # keep last 10 turns in memory
agent chat --history 0        # stateless mode (no memory)
agent chat --verbose          # show source files after each answer
```

**In-session commands:**

| Command | Description |
|---|---|
| `/exit` or `/quit` | End the session |
| `/search <query>` | Search without calling the LLM |
| `/snippet <file> [question]` | Load an entire file as pinned context |
| `/top-k <N>` | Change chunk count for the rest of the session |
| `/clear` | Clear the screen |
| `/clear-history` | Forget conversation history without restarting |
| `/help` | Show available commands |
| `Ctrl+C` | Interrupt a slow response without killing the session |

**Asking about a specific file or code snippet — `/snippet`:**

The `/snippet` command loads any file directly into the prompt as the highest-priority context. No indexing required. This is the reliable way to ask about a specific file on Windows — avoids the terminal paste problem where multiline text fires as separate questions:

```
You: /snippet agent/llm_client.py What does TokenUsage do?
Agent: TokenUsage is a dataclass that tracks token usage...

You: /snippet mycode.py
Question about this file: Explain the retry logic
Agent: ...

You: /snippet config.yaml Is this configuration valid?
Agent: ...
```

The file content is injected first (highest priority), followed by RAG-retrieved chunks for supporting codebase context.

**Paste mode — short snippets typed manually:**

Start your message with ` ``` ` to enter paste mode. Type or paste line by line, then close with ` ``` ` on its own line:

```
You: ```
  > def hello():
  >     return "world"
  > ```
Snippet captured (2 lines). Now type your question.
You: What does this function return?
Agent: It returns the string "world".
```

> **Windows tip:** Pasting multiline code directly into the terminal sends each line as a separate message, triggering a Gemini call for each line (rate limit errors). Always save the code to a file and use `/snippet <file>` for anything larger than 2-3 lines.

**Conversation memory example:**

```
You: How does authentication work?
Agent: The auth module uses JWT tokens validated in auth.py...

You: What about error handling in that?   ← follow-up works
Agent: Based on our discussion of the JWT flow, errors are
       raised as AuthError (401/403) which...

You: /clear-history               ← start fresh without restarting
Conversation history cleared.

You: /exit
```

---

### `agent index <path>`

Indexes a repository. Creates or replaces the local ChromaDB collection.

```bash
agent index .
agent index "/path/to/project"
agent index . --verbose        # show detailed timing logs
agent index . --changed        # only re-index files that changed since last run
```

`--changed` compares each file's SHA-256 hash against the manifest saved during the previous index. Only new or modified files are re-embedded; deleted files have their chunks removed. Falls back to a full index if no previous index exists.

Always prints file count, chunk count, and elapsed time on completion.

---

### `agent ask "<question>"`

Single-shot question. Retrieves relevant chunks and sends them to the LLM.

```bash
agent ask "How is the config loaded at startup?"
agent ask "What errors can the API return?" --top-k 10
agent ask "Summarise the architecture" --stream
```

Every answer includes a **Sources** table showing which files and lines were used as context.

---

### `agent search "<query>"`

Searches the indexed repository and returns the most relevant code chunks — no LLM call.

Uses hybrid search (vector similarity + BM25 keyword) by default. This catches exact function/variable name matches that pure semantic search misses.

```bash
agent search "authentication middleware"
agent search "getUserById"              # exact symbol name — BM25 helps here
agent search "database schema" --top-k 10
agent search "error handling" --semantic-only   # pure vector only
```

Each result shows: file path, line numbers, relevance score (0.000–1.000), and a code excerpt.

---

### `agent patch "<instruction>"`

Retrieves relevant chunks, asks the LLM to generate a unified diff, and optionally applies it.

```bash
agent patch "Add type hints to all functions in auth.py"
agent patch "Refactor the login flow to use async/await" --dry-run    # preview only
agent patch --discard-backup src/auth.py    # delete the .bak file after applying
```

You will be asked to confirm before any file is modified. Backups are created at `<file>.bak` before every modification.

---

### `agent config set <key> <value>`

Writes a setting to `~/.agent/config.toml`.

```bash
agent config set provider gemini
agent config set api_key YOUR_KEY
agent config set top_k 8
agent config set stream true
```

---

### `agent config show`

Displays the current effective configuration. The API key is masked (last 4 chars visible).

```bash
agent config show
```

---

### `agent list`

Lists all indexed collections with their repository path, chunk count, and index timestamp.

```bash
agent list
```

---

### `agent purge <path>`

Permanently deletes the indexed collection for a repository.

```bash
agent purge .
agent purge "/path/to/old-project"
```

---

### `agent watch [path]`

Watches a directory for file changes and automatically re-indexes modified files — no manual `agent index . --changed` needed.

Uses incremental indexing under the hood: only changed, new, or deleted files are re-embedded. A short debounce window (default 2 s) batches rapid saves (e.g. auto-save storms) into a single re-index pass.

```bash
agent watch .                   # watch current directory
agent watch /path/to/project    # watch a specific path
agent watch . --verbose         # log every file event
agent watch . --debounce 5      # wait 5 s of quiet before re-indexing
```

Press **Ctrl+C** to stop watching.

---

## Practical workflow for a developer

```bash
# 1. Install once (ever)
pipx install context_rag_cli
agent config set provider gemini
agent config set model gemini-1.5-flash
agent config set api_key YOUR_KEY

# 2. Go to any project
cd ~/projects/my-django-app

# 3. Index it (once per project, re-run after big changes)
agent index .

# 4. Start a chat session and explore
agent chat

You: What does this project do?
You: How is the database connection managed?
You: Where is input validation handled?
You: /search rate limiting
You: /top-k 10
You: How does the caching layer work?
You: /exit

# 5. Re-index after making changes (fast incremental update)
agent index . --changed   # only re-embeds modified files

# 6. Or keep the index always fresh automatically
agent watch .             # stays running; re-indexes on every save

# 7. Generate code from the terminal
agent patch "Add request ID logging to all API endpoints" --dry-run
```

---

## Token savings

This agent saves 97–99%+ of LLM tokens compared to sending your full codebase. See [TOKEN_SAVINGS.md](TOKEN_SAVINGS.md) for the full math and a script to audit your own codebase.

Quick numbers for a typical medium project (120,000 tokens):

| Approach | Tokens/query | GPT-4o cost |
|---|---|---|
| Full repo | 120,000 | $0.60 |
| This agent (top-k=5) | ~2,300 | $0.012 |
| **Saving** | **97.9%** | **50x cheaper** |

---

## Architecture

The project is a single Python package (`agent/`) with one module per component:

```
agent/
├── cli.py              # Typer CLI — all commands including chat REPL
├── config.py           # ConfigManager — reads ~/.agent/config.toml + AGENT_* env vars
├── chunking.py         # ChunkingEngine — token-aware sliding window, Chunk dataclass
├── embedding.py        # EmbeddingModel — wraps all-MiniLM-L6-v2 via sentence-transformers
├── vector_store.py     # VectorStore — ChromaDB adapter, atomic replace_collection()
├── search.py           # SemanticSearch — orchestrates embed → query → return ScoredChunks
├── prompt_compiler.py  # PromptCompiler — assembles system + context + query into a Prompt
├── llm_client.py       # LLMClient + GeminiProvider / OpenAIProvider / AnthropicProvider
├── tool_executor.py    # ToolExecutor — validate/apply unified diffs, .bak backups
└── errors.py           # All typed error classes (AgentError hierarchy)
```

All data is stored in `~/.agent/` — the agent never writes to your project directory:

```
~/.agent/
├── config.toml          # your settings
├── chroma/              # ChromaDB vector store (all indexed collections)
└── logs/
    └── agent-YYYY-MM-DD.log
```

### Key design decisions

- **Privacy by architecture** — the LLM only sees `PromptCompiler`-assembled text. No raw files are ever sent.
- **Zero GPU** — `EmbeddingModel` sets `device="cpu"` explicitly. No CUDA required.
- **Atomic re-indexing** — `VectorStore.replace_collection()` writes to a temp collection first, then renames. No partial state is observable during re-index.
- **Provider agnosticism** — `LLMProvider` is an abstract base class. `GeminiProvider`, `OpenAIProvider`, and `AnthropicProvider` are concrete implementations selected by `build_provider(config)`.
- **Hybrid search** — `HybridSearch` combines ChromaDB vector similarity with BM25 keyword scoring via Reciprocal Rank Fusion (RRF). Vector search catches semantic similarity; BM25 catches exact symbol names. RRF combines both ranked lists without needing to tune score weights.
- **Cross-encoder reranking** — after retrieving candidates via hybrid search, a `CrossEncoderReranker` (default: `ms-marco-MiniLM-L-6-v2`, ~22 MB, CPU-only) jointly encodes the query and each candidate passage to produce precise relevance scores. Candidates are re-ordered by cross-encoder score before the top-k are sent to the LLM. Pass `--no-rerank` to skip this step for faster responses.
- **Single retry on timeout** — `LLMClient` retries exactly once (30s × 2) before raising `TimeoutError`. All other errors propagate immediately.
- **Explicit confirmation for patches** — `agent patch` always prompts yes/no before modifying files. `--dry-run` skips both the prompt and any file writes.
- **Chat REPL loads components once** — `agent chat` loads the embedding model once at startup and reuses it for every question in the session, eliminating per-command startup latency.
- **Conversation memory** — `agent chat` accumulates up to N prior turns (default 6) and injects them as real `user`/`assistant` message pairs in the prompt, enabling natural follow-up questions. Reset anytime with `/clear-history` or set `--history 0` for stateless mode.

---

## Development setup

```bash
# Clone and install in editable mode with dev dependencies
git clone <repo-url>
cd project_cli
pip install -e ".[dev]"
```

Dev dependencies include `pytest`, `hypothesis`, and `pytest-mock`.

### Running tests

```bash
# All unit tests
python -m pytest tests/unit/

# A specific test file
python -m pytest tests/unit/test_chunking.py -v

# Integration tests (requires a configured API key and network)
python -m pytest tests/integration/ -m integration
```

### Test structure

```
tests/
├── unit/
│   ├── test_chunking.py        # ChunkingEngine + Chunk dataclass
│   ├── test_config.py          # ConfigManager (load, set, show, validate)
│   ├── test_embedding.py       # EmbeddingModel shape/dtype/determinism
│   ├── test_prompt_compiler.py # PromptCompiler compile/render
│   ├── test_search.py          # SemanticSearch delegation and guards
│   ├── test_tool_executor.py   # ToolExecutor validate/apply/rollback
│   └── test_vector_store.py    # VectorStore CRUD and atomicity
├── property/                   # Hypothesis property-based tests (optional)
└── integration/                # End-to-end tests (marked pytest.mark.integration)
```

229 unit tests, all passing. No internet connection required for unit tests.

### Adding a new LLM provider

1. Open `agent/llm_client.py`
2. Add a new class inheriting from `LLMProvider` and implement `complete(prompt, stream, timeout)`
3. Map HTTP errors using the shared `_raise_for_status()` helper
4. Register the new class in the `build_provider()` factory dict

---

## Supported file types

By default the agent indexes: `.py` `.js` `.ts` `.jsx` `.tsx` `.java` `.c` `.cpp` `.h` `.go` `.rs` `.rb` `.php` `.swift` `.kt` `.scala` `.sh` `.bash` `.md` `.txt` `.yaml` `.yml` `.json` `.toml` `.html` `.css`

To add or restrict extensions, see [CONFIGURATION.md](CONFIGURATION.md).

---

## What gets excluded from indexing

The agent automatically skips directories that should never be indexed:

**Always excluded** (hardcoded, no configuration needed):

| Category | Directories skipped |
|---|---|
| Version control | `.git`, `.hg`, `.svn` |
| Python | `__pycache__`, `.pytest_cache`, `.mypy_cache`, `.tox`, `.venv`, `venv`, `*.egg-info` |
| JavaScript / Node | `node_modules`, `.next`, `.nuxt`, `.turbo` |
| Build artefacts | `dist`, `build`, `out`, `target`, `bin`, `obj` |
| IDE | `.idea`, `.vscode`, `.vs` |

**`.gitignore` respected** — if a `.gitignore` file exists at the root of the indexed directory, its patterns are applied automatically. Files and directories matched by `.gitignore` are skipped. This uses the `pathspec` library with `gitwildmatch` semantics (the same engine git uses).

To disable `.gitignore` parsing, pass `respect_gitignore=False` when constructing `ChunkingEngine` directly (no CLI flag needed for most use cases).

---

## Privacy guarantee

- Embeddings are generated locally on CPU — no data sent to HuggingFace at runtime (model is cached after first download)
- ChromaDB stores all collections on your local disk at `~/.agent/chroma/`
- Only the retrieved code chunks (not full files) are included in the prompt sent to the LLM
- A log of all outbound network requests is written to `~/.agent/logs/agent-YYYY-MM-DD.log`
- The agent never writes any files to your project directory

---

## Further reading

| Document | What it covers |
|---|---|
| [CONFIGURATION.md](CONFIGURATION.md) | All settings, env vars, system prompt customisation |
| [TOKEN_SAVINGS.md](TOKEN_SAVINGS.md) | Token audit math + script to audit your own codebase |
| [ARCHITECTURE_DX.md](ARCHITECTURE_DX.md) | Deployment options, DX analysis, production considerations |

---

## Release process

1. Bump `version` in `pyproject.toml`
2. `git commit -m "chore: release vX.Y.Z"`
3. `git tag vX.Y.Z`
4. `git push && git push --tags`
5. Monitor the `publish.yml` workflow on GitHub Actions — it builds and publishes to PyPI automatically

---

## Troubleshooting

**`No indexed collection found`** — run `agent index .` from inside your project directory first.

**`Got unexpected extra argument`** — your path has spaces. Wrap it in quotes: `agent index "C:\My Projects\app"`

**Path separator on Windows** — use forward slashes or quotes: `agent index "D:\My Projects\app"`

**Gemini / LLM timeout** — the model took too long. Try fewer chunks: `agent ask "..." --top-k 2`. The agent retries once automatically (30s per attempt, 60s total max) before giving up.

**Noisy httpx logs on startup (first run only)** — on the very first `agent index` run the embedding model is downloaded from HuggingFace (~90 MB). After that, the agent sets `TRANSFORMERS_OFFLINE=1` automatically so subsequent commands skip the network check entirely and start instantly. If you need to force a model update, set `TRANSFORMERS_OFFLINE=0` in your shell before running.

**Re-index after large changes** — if answers seem stale or wrong, re-run `agent index .` to refresh the collection.

**Multiline paste breaks into separate questions (Windows)** — pasting multiline code directly into the Windows terminal sends each line as a separate Enter, triggering a Gemini API call per line (causes 429 rate limit errors). Solution: save the code to a file first, then use `/snippet` in `agent chat`:
```
# Save your code to a file (e.g. snippet.py), then in agent chat:
You: /snippet snippet.py What does this class do?
```
