Metadata-Version: 2.4
Name: robinctx
Version: 0.1.0
Summary: Distill any code repo into a compact, secret-redacted LLM context pack — and fit it to a token budget.
Project-URL: Homepage, https://github.com/kp-dubbs/robinctx
Project-URL: Issues, https://github.com/kp-dubbs/robinctx/issues
Project-URL: Changelog, https://github.com/kp-dubbs/robinctx/blob/main/CHANGELOG.md
Author-email: Richard Dorr <richardpdorr@gmail.com>
License: MIT
License-File: LICENSE
Keywords: agents,claude,codebase,context,documentation,llm,prompt,repository,summarization
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Documentation
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Typing :: Typed
Requires-Python: >=3.9
Provides-Extra: dev
Requires-Dist: mypy>=1.8; extra == 'dev'
Requires-Dist: pytest>=7.4; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: mcp
Requires-Dist: mcp>=1.0; extra == 'mcp'
Provides-Extra: tokens
Requires-Dist: tiktoken>=0.5; extra == 'tokens'
Description-Content-Type: text/markdown

# robinctx 🐦

[![CI](https://github.com/kp-dubbs/robinctx/actions/workflows/ci.yml/badge.svg)](https://github.com/kp-dubbs/robinctx/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/robinctx)](https://pypi.org/project/robinctx/)
[![Python](https://img.shields.io/pypi/pyversions/robinctx)](https://pypi.org/project/robinctx/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

**Distill any code repo into a compact, secret-redacted LLM context pack — then fit it to a token budget.**

Every LLM works better when it knows your repo's purpose, stack, conventions, and API surface.
robinctx extracts exactly that — heuristically, locally, with **zero dependencies** — and packages
it as a single Markdown document (plus a machine-readable JSON sidecar) sized for a prompt.
The sidekick that preps the briefing so the hero can fight crime.

## 30-second quickstart

```bash
# no install needed:
uvx robinctx distill .
# or
pipx run robinctx distill .

# then build a task-focused prompt from the pack:
uvx robinctx pack myrepo_context.json --task "add rate limiting to the API" --budget 8000
# or do both at once:
uvx robinctx distill . --pack --task "add rate limiting to the API"
```

Runs anywhere Python 3.9+ runs — including locked-down environments — because the core uses
**only the standard library**. Git and [gitleaks](https://github.com/gitleaks/gitleaks) are used
when present, never required.

## The disclaimer philosophy

Every artifact robinctx generates starts with this block, on purpose:

> ⚠️ **AUTO-GENERATED — USE AT YOUR OWN RISK**
> This context pack was produced by robinctx by heuristic analysis.
> It may contain errors, omissions, or — despite redaction — sensitive data.
> Review before sharing outside your trust boundary, and verify any claims
> (especially refactor suggestions) against the actual source.

Heuristics are honest about being heuristics. The disclaimer carries provenance (tool version,
timestamp, repo commit SHA, scan settings) and redaction counts, so anyone downstream — human or
LLM — knows exactly what they're holding and how much to trust it.

## What gets captured

| Section | How | Notes |
| --- | --- | --- |
| Overview & docs | README / ARCHITECTURE / CLAUDE.md excerpts | redacted, capped |
| Tech stack | manifests (package.json, pyproject, go.mod, Cargo.toml, Gemfile, …) | framework inference from deps |
| Conventions | statistical style inference (indentation, quotes, naming, semicolons, type-hint ratio) | from up to 60 source samples |
| Layout | rendered directory tree + entry points | depth/size capped |
| API surface | **Python: ast** (precise, incl. async/decorators/methods) • JS/TS/Go/Rust/Ruby/Java: regex (best effort) | public symbols only |
| Git intel | branch, recent commits, churn hotspots, contributor count | optional, if git present |
| TODO/FIXME markers | comment scan | redacted, capped at 300 |
| Refactor signals | large files, long functions, churn×size hotspots, TODO clusters, missing tests | *heuristic — verify before acting* |

## Security model

The output of this tool is destined to be pasted into LLM prompts and shared. Three independent
layers stand between your secrets and that output:

1. **File exclusion** — credential-like files (`.env*`, `*.pem`, `secrets.*`, `id_rsa`,
   `.ssh/`, `.aws/`, …) are never read **and never listed** (names alone can leak).
   Inside a git repo, files are enumerated via `git ls-files --exclude-standard`, so anything
   `.gitignore`'d — where local secrets usually live — is never touched. A
   [`.robinctxignore`](#robinctxignore) file adds your own exclusions.
2. **Inline redaction** — every embedded excerpt is scrubbed for known token formats (AWS,
   GitHub, Slack, Google, OpenAI/Anthropic-style, JWTs, private-key blocks, credentialed URLs),
   secret-keyed assignments (`password = …`), and **high-entropy values** (>4.5 bits/char, ≥20
   chars, assigned to a variable — hex digests and `shaNNN-` SRI hashes are exempt).
3. **Output scanning** — after generation, the artifacts themselves are scanned with
   **gitleaks** (or trufflehog) if installed, falling back to the built-in detectors with a
   notice. Findings print as `file:line [rule]`, the output is **quarantined**
   (renamed `*.quarantined`), and the run exits **3** — CI-friendly.

Flags: `--no-secret-scan` opts out entirely; `--strict` also fails on built-in-scanner findings
(recommended in CI). Exit codes: `0` ok • `1` error • `2` usage • `3` leaks found.

Found a leak that survived all three layers? That's a vulnerability — see [SECURITY.md](SECURITY.md).

## Library API

`pip install robinctx` and build on the same engine (fully typed, `py.typed` shipped):

```python
from robinctx import distill, pack, to_markdown

context = distill("path/to/repo")        # dict — the JSON-sidecar structure
print(context["style"], context["frameworks"])

markdown = to_markdown(context)          # the .md artifact, disclaimer included

result = pack(context, task="refactor the auth module", budget=8000)
print(result.prompt)                     # budget-fitted prompt
print(result.sections, result.est_tokens)
```

`distill()` is a pure function over the filesystem (writes nothing); the CLI owns file output
and scanning. The sidecar dict carries `schema_version` with a
[documented compatibility contract](docs/schema.md).

## CLI reference

```text
robinctx distill <repo> [-o NAME] [--max-file-kb N]
                         [--format md|json|both|claude-md|agents-md|cursorrules]
                         [--no-secret-scan] [--strict]
                         [--pack --task "..." [--mode M] [--budget N] [--sections ...]]
robinctx pack <context.json> [--task "..."] [--mode task|onboard|refactor]
                              [--budget N] [--sections overview,style,api,...]
                              [--since REF] [-o FILE]
robinctx update <context.json> [--strict] [--no-secret-scan]
robinctx serve  <context.json>            # requires robinctx[mcp]
```

Pack modes prioritize differently when trimming to budget: `task` leads with conventions and
relevant APIs, `onboard` with overview and layout, `refactor` with signals and git hotspots.
With a `--task`, API entries / TODOs / refactor signals are relevance-ranked so the most useful
detail survives trimming. `--since <ref>` prepends a redacted "Recent Changes" section (git
log + diff stat) — useful for LLMs working on actively evolving repos.

**Agent files:** `--format claude-md` emits a ready-to-commit `CLAUDE.md` (likewise
`agents-md` → `AGENTS.md`, `cursorrules` → `.cursorrules`) — a condensed, imperative version of
the pack for coding agents that re-read it on every task. The secret-scan gate applies to these
too.

**Staying fresh:** `robinctx update ctx.json` is a no-op when the repo hasn't changed since the
recorded commit SHA, and re-distills when it has — cheap enough for a pre-commit hook or CI step.
See [docs/recipes.md](docs/recipes.md) for ready-made GitHub Action and pre-commit configs.

### .robinctxignore

Drop a `.robinctxignore` (or `.repoctxignore`) file at the repo root to exclude more files,
using a gitignore-flavored subset (`fnmatch` wildcards; `dir/` for directories; leading `/`
anchors to root; `!` negation and git-style `**` are **not** supported — `*` matches across `/`).

## Limitations (read this)

- **Non-Python extraction is regex-based.** It catches conventional declarations and misses
  clever ones; interfaces may be labeled `class`. Python uses `ast` and is precise.
- **Refactor signals are heuristics** — line counts, churn, TODO density. They're prompts for
  investigation, not findings. The output says so.
- **Redaction is pattern-based.** A password that looks like an English word in prose will not
  be caught. The entropy detector can't see secrets shorter than ~23 characters (Shannon entropy
  of a string is bounded by log2 of its length), and may rarely flag random-looking identifiers.
  Run with gitleaks installed; review output before sharing.
- **Token counts are estimates** (len/4) unless you install `robinctx[tokens]`.

## Extras

| Install | Adds |
| --- | --- |
| `pip install robinctx` | everything above, stdlib-only |
| `pip install robinctx[tokens]` | exact token counts via tiktoken |
| `pip install robinctx[mcp]` | `robinctx serve` — MCP server exposing the pack as queryable tools |

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). Security-relevant changes require tests, no exceptions.

## License

[MIT](LICENSE)
