Metadata-Version: 2.4
Name: contextfit
Version: 0.1.0
Summary: Token-native knowledge base for LLM scale
Project-URL: Homepage, https://github.com/ContextFit/cf
Project-URL: Documentation, https://github.com/ContextFit/cf#readme
Project-URL: Repository, https://github.com/ContextFit/cf
Author-email: Christophe Ponsart <cponsart@gmail.com>
License-Expression: MIT
Keywords: graph,knowledge-base,llm,rag,tokens
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: datasketch>=1.6.0
Requires-Dist: networkx>=3.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pyroaring>=0.4.0
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: zstandard>=0.21.0
Provides-Extra: dev
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: rust
Description-Content-Type: text/markdown

# ContextFit

**A token-native knowledge base designed for LLM scale.**

ContextFit keeps everything—storage, indexing, search, relationships, traversal, and commonality detection—inside discrete token-ID space until the very last step, when you decode only the final retrieved token chunks for the LLM's output.

## Why Token-Native?

- **~2× smaller storage** than raw text (no repeated tokenization)
- **Blazing-fast integer-only operations** (no float embeddings)
- **Hierarchical "geo-map-style" traversal** for multi-hop reasoning
- **Neural-network-like chunk relationships** via token overlap graphs
- **Automatic commonality discovery** without vector spaces
- **Direct LLM injection** — feed `input_ids` directly, no conversion

## Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                         ContextFit                               │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │   Storage   │  │   Index     │  │        Graph            │  │
│  │             │  │             │  │                         │  │
│  │ Token Arrays│  │ Inverted    │  │ Chunk Relationships     │  │
│  │ Chunk Store │  │ Suffix/FM   │  │ Community Detection     │  │
│  │ Compression │  │ BM25 Tokens │  │ Commonality Mining      │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
│                                                                  │
│  ┌─────────────────────────────┐  ┌─────────────────────────┐   │
│  │        Hierarchy            │  │       Retrieval         │   │
│  │                             │  │                         │   │
│  │ Level 0: Raw Chunks         │  │ Query Tokenization      │   │
│  │ Level 1+: Summary Clusters  │  │ Graph Traversal         │   │
│  │ Geo-Map Navigation          │  │ Direct input_ids Output │   │
│  └─────────────────────────────┘  └─────────────────────────┘   │
│                                                                  │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    Semantic IDs (SIDs)                      ││
│  │                                                             ││
│  │  Hierarchical token sequences → generative retrieval        ││
│  │  Similar chunks share prefixes → trie-like navigation       ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
```

## Core Components

### 1. Storage Layer
- Token arrays (uint16/uint32 IDs)
- Memory-mapped files for large corpora
- Delta encoding + Zstd compression
- Chunk metadata headers

### 2. Index Layer
- **Inverted Index**: tokenID → [(chunkID, positions)] using Roaring bitmaps
- **Suffix Array / FM-Index**: Instant exact n-gram search
- **BM25 on Tokens**: TF-IDF scoring with token IDs as terms
- **Binary postings pack**: one compact `postings.bin` instead of JSON-per-token files

### 3. Graph Layer
- Nodes = chunks (or Semantic IDs)
- Edges = token n-gram overlap, Jaccard similarity, co-occurrence
- MinHash + LSH for fast similarity without floats
- Community detection for commonality discovery

### 4. Hierarchy Layer
- Level 0: Raw token chunks (256–1024 tokens each)
- Level 1+: Clustered summaries as token sequences
- GraphRAG-style community summaries
- Integer pointers for zoom navigation

### 5. Retrieval Layer
- Tokenize query → search indexes → traverse graph → collect token IDs
- Feed directly as `input_ids` to any LLM
- No detokenization until final generation

### 6. Semantic IDs
- Assign each chunk a short hierarchical SID token sequence
- Similar chunks share prefixes via MinHash-band residual buckets
- Resolve generated/predicted SID prefixes through a trie with prefix backoff
- Retrieval mode: `--method sid` or hybrid SID + BM25

### 7. SID Generator
- Predicts SID prefixes from query tokens without detokenizing
- Combines BM25 candidate chunks, MinHash similarity, and LSH neighbors
- Candidate chunks vote for hierarchical SID prefixes
- Returns generated SID predictions plus resolved chunk IDs

### 8. Learned SID Generator
- Trains a sparse token→SID associative model from stored chunks
- Uses beam search over valid SID prefixes
- No neural dependency yet; still token-native and deterministic
- CLI: `contextfit ingest ./docs --train-sid-generator`

## Getting Started

```bash
# Install dependencies
pip install -e .

# Ingest a knowledge base
contextfit ingest ./documents --tokenizer tiktoken

# Query
contextfit query "What is ContextFit?"

# Query through Semantic IDs
contextfit query "async retrieval" --method sid

# Agent-friendly machine-readable output
contextfit query "What is ContextFit?" --method hybrid --json
contextfit stats --json

# Run a deterministic sample benchmark
python examples/benchmark_sample_corpus.py --docs-per-topic 100 --json

# Run needle-in-a-haystack benchmark
python examples/benchmark_needle_haystack.py --needles 20 --distractors 200 --top-k 5 --json

# Ingest and train the learned SID generator
contextfit ingest ./documents --train-sid-generator
```

For installing on a MacBook/OpenClaw node, see [`docs/MACBOOK_CLI_DEPLOY.md`](docs/MACBOOK_CLI_DEPLOY.md).

For OpenClaw integration, including the `contextfit_search` tool and `contextfit` context engine plugin, see [`docs/OPENCLAW_INTEGRATION.md`](docs/OPENCLAW_INTEGRATION.md).

`--json` is intended for OpenClaw/agent use. Query JSON includes `input_ids`, retrieved chunk metadata, SID predictions, semantic IDs, and decoded previews.

## Current Storage Layout

```text
contextfit_kb/
  chunks/
    chunks.bin        # zstd-compressed token-array records
    index.json        # chunk_id → byte offset/length
  inverted/
    meta.json         # corpus/index metadata
    postings.bin      # compact binary token → roaring bitmap + positions pack
  sid/
    semantic_ids.json
    learned_sid_generator.json
```

The inverted index now saves as a single binary postings pack by default. Legacy JSON-per-token indexes still load for compatibility.

## Project Status

🚧 **Early Development** — Architecture phase

## References

- TERAG: Token-Efficient GraphRAG (3–11% token reduction)
- Semantic IDs / Generative Retrieval
- GraphRAG community detection
- Letta's token-space learning

## License

MIT
