Metadata-Version: 2.4
Name: codesurf
Version: 0.1.0
Summary: Smart codebase context for LLMs. Query-aware, dependency-aware, token-budgeted.
Author: Norbert
License-Expression: MIT
License-File: LICENSE
Keywords: ai,ast,claude,code,codebase,context,cursor,gpt,llm,rag
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.9
Provides-Extra: dev
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: full
Requires-Dist: tiktoken; extra == 'full'
Description-Content-Type: text/markdown

# codesurf

Smart codebase context for LLMs.

**Give your LLM the right code, not all the code.**

Most tools dump your entire repository into a single file and hope the LLM figures it out. codesurf does the opposite -- it analyzes your codebase, understands which files matter for your specific question, and produces a compact, structured context that fits within a token budget. The result: faster responses, lower cost, and better answers from any LLM.

```
$ codesurf . --query "how does authentication work"

[codesurf] 8/47 files, 12,340 tokens (12% of budget), 2 compressed

# Project: myapp
> fastapi project, 42 python, 5 typescript, 47 files, branch: main
> Context generated by codesurf | query: "how does authentication work"
> Token budget: ~12,340/100,000 (12%)

## Structure
main.py
src/
  auth/
    jwt.py
    oauth.py
  models/
    user.py
  ...

## Dependency graph
src/auth/jwt.py -> src/models/user.py
src/auth/oauth.py -> src/auth/jwt.py
src/api/routes.py -> src/auth/jwt.py, src/models/user.py

## Files

### src/auth/jwt.py (relevance: 0.95, 234 tokens)
```python
"""JWT token handling."""
from src.models.user import User

def create_token(user: User) -> str:
    ...
def verify_token(token: str) -> bool:
    ...
```

### src/models/user.py (relevance: 0.82, 156 tokens)
[COMPRESSED - signatures only]
```python
class User:
    """Represents an application user."""
    def __init__(self, name: str, email: str, role: UserRole): ...
    def is_admin(self) -> bool: ...
```
```

## The problem

LLM context windows are limited. A typical project has hundreds of files but your question only touches a handful of them. Existing tools either dump everything (wasting tokens and confusing the model) or require you to manually pick files (tedious and error-prone).

codesurf solves this by being query-aware: you tell it what you're working on, and it figures out which files are relevant, which dependencies need to be included, and how to compress everything to fit your token budget.

## How it compares

**repomix** (Node.js, 10k+ GitHub stars) concatenates your entire repo into one file. It has a `--compress` flag but compression is global -- it doesn't know what you're asking about, so it can't prioritize. No dependency awareness. Requires Node.js.

**code2prompt** fills a template with file contents. No intelligence, no dependency graph, no ranking. You get the same output regardless of what you're trying to do.

**aider repo-map** builds a structural map of your repo for its own internal use. It's locked inside aider -- you can't export it, pipe it to another LLM, or use it as a library.

**codesurf** is different in four ways:
1. It is **query-aware** -- `--query "authentication"` returns only auth-related files
2. It has a **dependency graph** -- knows that `jwt.py` imports `user.py` and includes both
3. It does **smart compression** -- full code for important files, signatures only for context files
4. It respects a **token budget** -- never exceeds the limit, automatically decides what to compress or drop

## Install

```
pip install codesurf
```

For accurate token counting (recommended), install with tiktoken:

```
pip install codesurf[full]
```

That's it. No Node.js, no Rust toolchain, no Docker. Python 3.9+ and stdlib.

## Usage

### Basic -- full project context

Scan the current directory. All source files are included, ranked by structural importance, compressed to fit 100k tokens:

```
codesurf .
```

### Query-aware -- only relevant files

Ask a question. codesurf extracts keywords, matches them against file paths, function names, class names, and docstrings, then uses the dependency graph to pull in related files:

```
codesurf . --query "how does payment processing work"
codesurf . -q "database migration logic"
codesurf . -q "why is the login endpoint slow"
```

### Focus mode -- specific directory + dependencies

Point at a directory or file. Everything inside gets full relevance, its direct dependencies get high relevance, and second-degree dependencies get moderate relevance. Everything else is excluded:

```
codesurf . --focus src/auth/
codesurf . --focus src/auth/jwt.py
```

### Combine query + focus

Narrow down to a directory and then rank within it:

```
codesurf . --focus src/api/ --query "error handling" --max-tokens 50000
```

### Pipe to an LLM

codesurf outputs to stdout by default, so you can pipe it directly into any CLI-based LLM tool:

```
codesurf . -q "auth bug" | llm "Find and fix the authentication bug"
codesurf . -q "test failures" | claude "Why are these tests failing?"
```

### Save to file

```
codesurf . -q "auth" -o context.md
```

### Copy to clipboard

```
codesurf . -q "auth" --copy
```

## How it works -- the pipeline

codesurf runs a five-stage pipeline on every invocation. Each stage feeds into the next:

### Stage 1: Scan

`FileScanner` walks the project tree using `pathlib.rglob`. It:

- Reads `.gitignore` and `.codesurfignore` with support for negation patterns (`!important.py`), doublestar (`**/generated/`), and directory-only rules
- Always ignores: `__pycache__`, `node_modules`, `.git`, `venv`, `dist`, `build`, binary files, files over 500KB
- Detects language from extension (`.py` -> Python, `.js/.jsx` -> JavaScript, `.ts/.tsx` -> TypeScript)
- Detects framework by examining imports and config files (FastAPI, Django, Flask, Next.js, React)
- Reads git branch via `git branch --show-current`
- Counts tokens for each file (tiktoken when available, char/4 estimate otherwise)
- Skips symlinks to prevent duplicate scanning

Each file becomes a `FileInfo` dataclass with path, language, size, token count, and full content.

### Stage 2: Parse

`PythonParser` uses Python's built-in `ast` module to analyze every `.py` file. It extracts:

- **Imports**: `import x`, `from x import y`, relative imports (`from .module import func`). Each import is classified as local (exists in the project), stdlib, or third-party. Only local imports become dependency edges.
- **Exports**: function signatures with full type annotations (`def authenticate(username: str, password: str) -> User`), class definitions with base classes and method names, module-level constants (ALL_CAPS only).
- **Docstrings**: module-level docstring via `ast.get_docstring()`.
- **Dependency edges**: for each local import, a `DependencyEdge` linking source file to target file with imported symbols.

Import resolution handles multiple project layouts -- it tries the direct dotted path, `__init__.py` packages, and common src-layout prefixes (`src/`, `lib/`, `app/`).

Files with `SyntaxError` are silently skipped (warning logged) -- one broken file never crashes the whole scan.

JavaScript and TypeScript files are included in output but without AST analysis in v0.1. Tree-sitter support is planned.

### Stage 3: Rank

`RelevanceRanker` scores every file from 0.0 to 1.0 using four weighted signals:

**Path match (30% weight):** Do query keywords appear in the file path? An exact directory match (`auth` == `src/auth/`) scores 1.0. A substring match scores 0.5. Bidirectional matching means "authentication" matches the `auth/` directory.

**Symbol match (30% weight):** Do query keywords appear in exported function names, class names, or docstrings? A match in a function/class name scores 1.0. A match in a docstring scores 0.5.

**Dependency proximity (20% weight):** How close is this file to matched files in the dependency graph? BFS from all directly matched files. Direct dependency = 1.0, two hops = 0.5, three hops = 0.25. Configurable depth via `--depth`.

**File importance (20% weight):** Structural heuristics. `models.py`, `schemas.py` score high (0.8) because they define the domain. Entry points (`main.py`, `app.py`) score 0.7. `__init__.py` scores low (0.3). Test files score low (0.2) unless the query mentions "test". Files imported by many others (high in-degree) get a bonus. Smaller files get a slight bonus (easier to include in full).

**Focus mode** overrides this: focused files get 1.0, their dependencies get scored by BFS distance, everything else gets 0.0.

**`--no-deps` mode** disables the dependency proximity signal entirely -- files are ranked purely on direct keyword matches and importance heuristics.

### Stage 4: Compress

`SmartCompressor` fits the ranked files into the token budget. It processes files in rank order:

1. **Full content phase** (up to 80% of budget): top-ranked files are included with their complete source code.
2. **If a file doesn't fit in full, compress it first** instead of skipping -- preserving its presence in the context.
3. **Signature phase** (remaining budget): lower-ranked files get compressed to signatures, docstrings, and class structure only.
4. **Drop phase**: if even compressed form doesn't fit, the file is excluded.

**Python compression** uses AST to extract:
- Function/method signatures with full type annotations
- First line of each docstring
- Class definitions with base classes
- Nested class/method structure preserved
- Body replaced with `...`

**Non-Python compression** keeps the first 50 and last 20 lines with a `[N lines truncated]` marker in between.

Three modes available via `--compression`:
- `smart` (default): full content for top files, signatures for the rest
- `signatures`: compress everything
- `none`: full content only, skip files that don't fit

35% of the token budget is reserved for overhead (project header, file tree, dependency graph, markdown formatting). The formatter enforces the hard budget limit by trimming files from the tail until the total output fits.

### Stage 5: Format

`OutputFormatter` produces markdown optimized for LLM consumption:

- **Project header**: name, framework, language breakdown, git branch, token budget usage
- **Structure**: compact indented tree (max 3 levels deep, annotated with file counts)
- **Dependency graph**: simple `file.py -> dep1.py, dep2.py` format, filtered to only show edges between included files
- **File sections**: each file gets a `###` header with relevance score and token count, language-tagged code block, `[COMPRESSED]` marker when applicable

Token budget line in the header reflects the actual final output size (computed after all trimming).

## Python API

```python
from codesurf import generate_context

# Basic
ctx = generate_context(".")

# Query-aware with custom budget
ctx = generate_context(".", query="auth", max_tokens=50_000)

# Focus mode, no dependency inclusion
ctx = generate_context(
    ".",
    focus=["src/auth/"],
    include_deps=False,
    depth=1,
)

# All signatures, no tree/deps in output
ctx = generate_context(
    ".",
    compression="signatures",
    show_tree=False,
    show_deps=False,
)

# Use the result
print(ctx.content)              # full markdown output
print(ctx.token_count)          # actual token count
print(ctx.files_included)       # how many files made the cut
print(ctx.files_total)          # total files in project
print(ctx.files_compressed)     # how many got signature treatment
```

The `generate_context()` function accepts all the same options as the CLI. It returns a `ContextOutput` dataclass.

## All CLI options

```
codesurf [path] [options]

Positional:
  path                    Project root directory (default: current directory)

Query and focus:
  -q, --query TEXT        Relevance query -- keywords are matched against file
                          paths, function/class names, and docstrings
  --focus PATH            Focus on a specific file or directory. Files inside
                          get score 1.0, their dependencies get high scores,
                          everything else is excluded. Repeatable.

Budget and compression:
  --max-tokens N          Token budget for the entire output (default: 100000)
  --compression MODE      How to compress files that don't fit in full:
                            smart      - full content for top files, signatures
                                         for the rest (default)
                            signatures - compress everything to signatures
                            none       - full content only, skip what doesn't fit

Dependency control:
  --no-deps               Disable dependency-based ranking. Files are scored
                          only on direct keyword matches and importance.
  --depth N               How many hops in the dependency graph to traverse
                          from matched files (default: 2)

Output:
  -o, --output FILE       Write to file. Suppresses stdout.
  --stdout                Force stdout output even when using -o
  --copy                  Copy to clipboard (requires pyperclip)

Other:
  --version               Show version and exit
  --help                  Show help and exit
```

## Architecture

```
codesurf/
  __init__.py       Public API: generate_context()
  cli.py            argparse CLI, entry point
  scanner.py        FileScanner: tree walk, .gitignore, framework detection
  parser.py         PythonParser: AST-based import/export/docstring extraction
  ranker.py         RelevanceRanker: multi-signal scoring + BFS dependency proximity
  compressor.py     SmartCompressor: budget-aware full/signature/drop decisions
  formatter.py      OutputFormatter: markdown generation with hard budget enforcement
  tokens.py         Token counting (tiktoken with char/4 fallback)
  models.py         Dataclasses: FileInfo, ProjectInfo, ContextConfig, ContextOutput
```

Zero required dependencies. The only optional dependency is `tiktoken` for accurate token counting. Without it, codesurf estimates tokens as `len(text) // 4`, which is reasonable for code but not exact.

## Design decisions

**Why not embeddings?** Embedding-based search would give better semantic matching (finding `stripe_handler.py` when you ask about "payments") but adds a heavy dependency (an embedding model or API calls). v0.1 uses keyword matching deliberately -- it's fast, deterministic, private, and works offline. Embedding support is planned as an opt-in enhancement.

**Why AST and not regex?** Python's `ast` module is the gold standard for understanding Python source code. It handles every edge case -- decorators, nested classes, conditional imports, multiline signatures, f-strings in defaults -- because it's the same parser CPython uses. Regex-based extraction is fragile and misses real-world code patterns.

**Why markdown and not XML?** LLMs understand markdown natively. XML wastes tokens on closing tags. Markdown code blocks with language tags give LLMs syntax awareness for free.

**Why reserve 35% for overhead?** The project tree, dependency graph, file headers, and markdown formatting can consume significant tokens, especially on large projects. A fixed 35% reserve means the compressor never over-allocates file content at the expense of structural context.

**Why skip symlinks?** Symlinks in codebases often point to vendored dependencies, build artifacts, or create circular references. Following them would produce duplicate content and potentially infinite loops. Real source files are always regular files.

## Limitations

This is v0.1. Known limitations:

- **Python AST only.** JavaScript and TypeScript files are included in output but without dependency analysis or smart compression. Their "compressed" form is head/tail truncation. Tree-sitter-based parsing for JS/TS is the top priority for v0.2.
- **Keyword matching only.** Ranking uses keyword matching against paths, symbols, and docstrings. If a relevant file doesn't contain any query keywords in its name, exports, or docstring, it can be missed. Embedding-based semantic search is planned.
- **No incremental caching.** Every invocation re-scans the entire project from disk. For large codebases (10k+ files) this can take a few seconds. A watch mode with filesystem event caching is planned.
- **Single-root only.** Monorepos with multiple independent packages are scanned as one flat tree. Workspace-aware scanning is planned.

## Contributing

Contributions welcome. The project has 87 tests and enforces ruff linting:

```
git clone https://github.com/Farfive/codesurf.git
cd codesurf
pip install -e ".[dev]"
pytest tests/ -v
ruff check codesurf/
```

The architecture is modular -- each stage of the pipeline is a single class in its own file. Adding a new language parser means implementing the same interface as `PythonParser` and registering it in `__init__.py`.

## License

MIT
