Phase 10 — RAG: Persistent Document Knowledge Base

Goal: Let Gemma search a user-curated knowledge base of documents that persists across sessions. Complements Phase 7 (one-shot file questions) with durable, indexed storage — the user ingests documents once, and Gemma can retrieve relevant passages from them in any future session.

1. Architecture Overview

A dedicated ChromaDB collection (collective.documents) holds chunked, embedded document passages. A search_documents tool lets Gemma query it semantically — same bus pattern as search_memory.

User ingest (CLI or UI)
  → DocumentService.ingest_file(path)
      → extract text (PDF / plain text / markdown)
      → chunk into overlapping passages
      → embed each chunk with nomic-embed-text
      → store in ChromaDB collection: collective.documents

Gemma tool call: search_documents(query="…")
  → DocumentService.search(query)
      → embed query with nomic-embed-text (search_query: prefix)
      → ChromaDB similarity search
      → return top-k passages with source filename + page

Relationship to existing services

ServiceCollectionContentLifetime
MemoryServicelocal_memoryQ&A conversation engramsAuto-written every turn
DocumentServicecollective.documentsChunked document passagesWritten on ingest only

Both share the same ChromaDB path (.chroma/) and the same embedding model (nomic-embed-text), but use separate collections so searches don't cross-contaminate.

2. New Files

FilePurpose
src/local/utils/file_extract.pyShared text extraction — PDF, plain text, markdown. Extracted from attachment_bar.py so both AttachmentBar and DocumentService can reuse it without a UI dependency.
src/local/services/document_service.pyChromaDB wrapper for the document collection — ingest, chunk, embed, search.
src/local/tools/search_documents_tool.pyBus tool: receives tool.request.search_documents, calls DocumentService, publishes result.
scripts/ingest.pyCLI: python scripts/ingest.py file.pdf [file2.txt …]. Runs without the full stack.
config/documents.yamlchunk_size, chunk_overlap, collection name, n_results, embed_model.

Modified files

FileChange
src/local/protocol/subjects.pyAdd TOOL_REQUEST/RESULT/ACTIVITY_SEARCH_DOCUMENTS
src/local/ui/attachment_bar.pyImport extract_text from file_extract.py instead of duplicating logic
run_local.pyCreate DocumentService; start SearchDocumentsTool daemon thread
src/local/ui/main_window.pyAdd TOOL_ACTIVITY_SEARCH_DOCUMENTS to _TOOL_ACTIVITY_SUBJECTS

3. Chunking Strategy

Fixed-size character chunks with overlap. Character-based (not token-based) to avoid a tokenizer dependency.

ParameterDefaultRationale
chunk_size1500 chars≈ 512 tokens for typical prose; fits comfortably in nomic-embed-text's 8192-token context
chunk_overlap200 charsPrevents sentence/idea boundary splits from losing context
n_results5Top-5 passages returned to Gemma; configurable

Each chunk is stored with metadata:

{
  "source_file": "attention_is_all_you_need.pdf",
  "chunk_index": 3,
  "page": 2,            # PDF only; omitted for text files
  "ingested_at": 1717430000.0,
  "type": "document"
}

4. file_extract.py — Shared Extraction

Currently attachment_bar.py has private _extract_pdf_text() and _process_file() functions. These will be moved to src/local/utils/file_extract.py as public functions, and attachment_bar.py will import from there. This is the only UI file touched.

# src/local/utils/file_extract.py

TEXT_EXTS = {".txt", ".md", ".py", ".js", ".ts", ".yaml", ".json", ".csv"}
IMAGE_EXTS = {".jpg", ".jpeg", ".png", ".gif", ".webp"}

def extract_text(path: str) -> str:
    """Return plain text from a PDF or text file. Raises on unsupported types."""
    ...

def extract_for_attachment(path: str) -> dict:
    """Return {type, name, data} dict for AttachmentBar — images as base64, text as str."""
    ...

5. DocumentService API

# src/local/services/document_service.py

class DocumentService:
    def __init__(self, chroma_path=None, collection_name=None, embed_model=None): ...

    def ingest_file(self, path: str) -> int:
        """Chunk, embed, and store file. Returns number of chunks written."""
        ...

    def ingest_text(self, text: str, source_name: str) -> int:
        """Ingest already-extracted text (for programmatic use). Returns chunk count."""
        ...

    def search(self, query: str, n: int | None = None) -> list[dict]:
        """Return top-n chunks by similarity: [{content, source_file, chunk_index, page, score}]"""
        ...

    def list_sources(self) -> list[str]:
        """Return unique source filenames in the collection."""
        ...

    def delete_source(self, source_file: str) -> int:
        """Delete all chunks for a given source file. Returns count deleted."""
        ...
Deduplication: Chunk IDs are deterministic — sha256(source_file + chunk_index) — so re-ingesting the same file is a safe upsert, not a duplicate accumulation.

6. search_documents Tool Schema

{
  "type": "function",
  "function": {
    "name": "search_documents",
    "description": "Searches the user's personal document knowledge base for relevant passages. ...",
    "parameters": {
      "type": "object",
      "properties": {
        "query": {
          "type": "string",
          "description": "What to look for in the knowledge base."
        }
      },
      "required": ["query"]
    }
  }
}

Trigger conditions live in the tool description (same principle as Phase 8/9). The description will say:

"Searches the user's personal document knowledge base for relevant passages. Call this tool when the user asks about content from documents they have added to their knowledge base, or when a question is likely answered by their stored documents rather than general knowledge or the web."

7. Tool Result Format

[Knowledge base results for "transformer positional encoding"]

1. attention_is_all_you_need.pdf (p.4, chunk 8)
   Since our model contains no recurrence and no convolution, in order for the model
   to make use of the order of the sequence, we must inject some information about
   the relative or absolute position of the tokens in the sequence…

2. transformer_survey.pdf (p.12, chunk 31)
   Positional encodings can be fixed (sinusoidal, as in the original Transformer)
   or learned. Both approaches yield similar results in practice…

8. CLI Ingest Script

# scripts/ingest.py
# Usage: python scripts/ingest.py file.pdf [file2.txt ...]
# Runs without the full stack (no ZMQ, no Ollama chat endpoint).

import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))

from local.services.document_service import DocumentService

def main():
    paths = sys.argv[1:]
    if not paths:
        print("Usage: python scripts/ingest.py file.pdf [file2.txt ...]")
        sys.exit(1)
    svc = DocumentService()
    for path in paths:
        n = svc.ingest_file(path)
        print(f"  {path}: {n} chunks ingested")

if __name__ == "__main__":
    main()

9. run_local.py Changes

# Create DocumentService alongside MemoryService (both are ChromaDB; init in main thread)
from local.services.document_service import DocumentService
shared_documents = DocumentService()

# Start tool (after existing tools, before generator)
threading.Thread(
    target=_start_search_documents,
    args=(shared_documents,),
    daemon=True, name="search_documents"
).start()

10. config/documents.yaml

collection: "collective.documents"
chroma_path: ".chroma"
embed_model: "nomic-embed-text"
chunk_size: 1500
chunk_overlap: 200
n_results: 5

description: |
  Searches the user's personal document knowledge base for relevant passages.
  Call this tool when the user asks about content from documents they have added,
  or when a question is likely answered by stored documents rather than general
  knowledge or the web. Do not call this for general research questions — use
  search_papers or web_search for those.

param_query: |
  What to look for in the knowledge base. Be specific — use terms likely to
  appear in the source documents.

11. Testing Plan

Unit tests — tests/test_document_service.py

TestWhat it checks
test_chunk_splits_textText longer than chunk_size produces multiple chunks
test_chunk_overlapAdjacent chunks share chunk_overlap chars
test_chunk_short_textText shorter than chunk_size produces exactly 1 chunk
test_ingest_text_returns_chunk_countingest_text() returns correct count
test_search_returns_resultsAfter ingesting, search() finds relevant passage
test_deterministic_ids_no_duplicatesRe-ingesting same text doesn't double chunk count
test_list_sourceslist_sources() returns ingested filenames
test_delete_sourcedelete_source() removes all chunks for a file
test_metadata_storedsource_file, chunk_index present in search result metadata

Unit tests — tests/test_search_documents_tool.py

TestWhat it checks
test_announce_schemaPublishes tool.schema with name=search_documents
test_handle_request_publishes_result_and_activityBoth subjects published on tool call
test_empty_kb_returns_informative_messageGraceful result when collection is empty

Story — tests/stories/s12_rag.yaml

12. Build Order

  1. Create src/local/utils/file_extract.py — move extraction logic from attachment_bar
  2. Update attachment_bar.py to import from file_extract
  3. Write src/local/services/document_service.py
  4. Add subjects to protocol/subjects.py
  5. Write src/local/tools/search_documents_tool.py
  6. Wire run_local.py
  7. Add TOOL_ACTIVITY_SEARCH_DOCUMENTS to main_window.py
  8. Write config/documents.yaml
  9. Write scripts/ingest.py
  10. Write unit tests and story
  11. Run full test suite
ChromaDB init order: DocumentService must be created in the main thread alongside MemoryService — ChromaDB's PersistentClient is not thread-safe under concurrent construction on the same path. Both share .chroma/ but use separate collections, so a single PersistentClient instance could be shared. For simplicity, DocumentService creates its own client (same pattern as MemoryService).
No UI changes beyond main_window.py: SearchDocumentsWindow is out of scope for Phase 10. The ToolWindow reactive spawn (existing infrastructure) handles the activity log automatically. A future phase can add a DocumentsWindow with list_sources / delete_source / drag-drop ingest.