Metadata-Version: 2.4
Name: victor-codegraph
Version: 0.1.1
Summary: Code->CPG chunker: tree-sitter symbol + relation extraction, size-capped chunks, ProximaRecord projection. Shared by Victor, ProximaDB SDK, and AnvaiOps.
Author-email: Vijaykumar Singh <singhvjd@gmail.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/vjsingh1984/victor
Project-URL: Repository, https://github.com/vjsingh1984/victor
Keywords: code-graph,cpg,chunking,tree-sitter,ast,embeddings,rag
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: treesitter
Requires-Dist: tree-sitter>=0.23; extra == "treesitter"
Requires-Dist: tree-sitter-language-pack>=1.0; extra == "treesitter"
Provides-Extra: contracts
Requires-Dist: victor-contracts<1.0,>=0.7.0; extra == "contracts"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: ruff>=0.5; extra == "dev"
Requires-Dist: tree-sitter>=0.23; extra == "dev"
Requires-Dist: tree-sitter-language-pack>=1.0; extra == "dev"

# victor-codegraph

Shared **code → Code-Property-Graph chunker**: tree-sitter symbol + relation extraction,
size-capped embeddable chunks, and a `ProximaRecord` projection. One chunker, three
consumers — Victor (owner), the ProximaDB SDK (`[codegraph]` extra), and AnvaiOps (SaaS
code-graph vertical).

> Design: ProximaDB `ADR-029` (authoritative) · Victor `ADR-014` (owner/donor) ·
> AnvaiOps `ADR-0018` (consumer). This package is the **TD-CG1** scaffold.

## Why

The same tree-sitter code→symbol+relation chunker existed twice (ProximaDB SDK `code.py`
and Victor `victor-coding`) and was about to be written a third time in AnvaiOps. This
package is the single neutral home. It merges the best of both donors and fixes their two
gaps:

- **Size-capping** — ProximaDB's `code.py` emitted one chunk per symbol with *no* size
  bound (a huge function became a huge chunk). Here, oversized symbols are body-split with
  overlap (LlamaIndex `CodeSplitter` discipline). See `sizing.py`.
- **Real JS/TS** — the donor JS/TS parser was a stub returning no symbols. Here JS/TS get a
  real tree-sitter extractor (functions, classes, methods, `const … = () =>`, imports).

## Install

Not yet published to PyPI — use an **editable install** from the monorepo for now. Consumers
(Victor, the ProximaDB SDK, AnvaiOps) reference it editable until the first `victor-codegraph-v*`
release is cut.

```bash
# dev: editable, with tree-sitter grammars + test tooling
make -C victor-codegraph dev          # = pip install -e ../victor-contracts && pip install -e ".[dev]"

# minimal: Python-only (stdlib ast) path, zero native deps
pip install -e ./victor-codegraph

# once published:
#   pip install victor-codegraph                 # Python path
#   pip install "victor-codegraph[treesitter]"   # + multi-language grammars
```

### Releasing

CI: `.github/workflows/ci-codegraph.yml` runs the suite (editable install, grammars on) for every
PR touching `victor-codegraph/**`. Publishing: push a tag `victor-codegraph-v0.1.0` to trigger
`.github/workflows/release-codegraph.yml`, which builds and publishes via **PyPI Trusted Publishing**
(OIDC — no API token). Configure the publisher once on PyPI (owner `vjsingh1984`, repo `victor`,
workflow `release-codegraph.yml`, environments `pypi` / `testpypi`); see the header of that workflow.

## Use

```python
from victor_codegraph import chunk, parse, to_proxima_records, ChunkConfig

# Size-capped, embeddable chunks:
chunks = chunk(source, file_path="app/service.py", config=ChunkConfig(max_chunk_tokens=512))

# Symbols + relations:
parsed = parse(source, file_path="app/service.py")

# Project to the ProximaDB substrate-keystone record shape (one symbol = row+node+vector):
records = to_proxima_records(parsed, repo_graph_id="myrepo", branch_id="main",
                             embedder=my_embed_fn)  # embedder optional
```

## Design principles (the "best posture" this encodes)

1. Chunk at **symbol** granularity (not statement, not fixed-size).
2. **AST-aligned and size-capped** — never split mid-statement, never exceed the budget.
3. Extract **relations** (CALLS/EXTENDS/CONTAINS/…) and project to a CPG.
4. **Deterministic IDs + content hash** → idempotent incremental re-index.
5. **Graceful fallback chain**: python-ast → tree-sitter → sliding-window.
6. Token budget **matched to the embedding model** (BGE-small 384-d ≈ 512 tokens).

## Status

`0.1.0` — TD-CG1 scaffold. Python (stdlib `ast`) is the primary, fully-offline path.
Multi-language extraction is best-effort via tree-sitter; deeper per-language relation
extraction (the donor parsers' Rust/Go/Java specifics) lands incrementally.
