A dedicated ChromaDB collection (collective.documents) holds chunked, embedded document passages.
A search_documents tool lets Gemma query it semantically — same bus pattern as search_memory.
User ingest (CLI or UI)
→ DocumentService.ingest_file(path)
→ extract text (PDF / plain text / markdown)
→ chunk into overlapping passages
→ embed each chunk with nomic-embed-text
→ store in ChromaDB collection: collective.documents
Gemma tool call: search_documents(query="…")
→ DocumentService.search(query)
→ embed query with nomic-embed-text (search_query: prefix)
→ ChromaDB similarity search
→ return top-k passages with source filename + page
| Service | Collection | Content | Lifetime |
|---|---|---|---|
| MemoryService | local_memory | Q&A conversation engrams | Auto-written every turn |
| DocumentService | collective.documents | Chunked document passages | Written on ingest only |
Both share the same ChromaDB path (.chroma/) and the same embedding model (nomic-embed-text),
but use separate collections so searches don't cross-contaminate.
| File | Purpose |
|---|---|
src/local/utils/file_extract.py | Shared text extraction — PDF, plain text, markdown. Extracted from attachment_bar.py so both AttachmentBar and DocumentService can reuse it without a UI dependency. |
src/local/services/document_service.py | ChromaDB wrapper for the document collection — ingest, chunk, embed, search. |
src/local/tools/search_documents_tool.py | Bus tool: receives tool.request.search_documents, calls DocumentService, publishes result. |
scripts/ingest.py | CLI: python scripts/ingest.py file.pdf [file2.txt …]. Runs without the full stack. |
config/documents.yaml | chunk_size, chunk_overlap, collection name, n_results, embed_model. |
| File | Change |
|---|---|
src/local/protocol/subjects.py | Add TOOL_REQUEST/RESULT/ACTIVITY_SEARCH_DOCUMENTS |
src/local/ui/attachment_bar.py | Import extract_text from file_extract.py instead of duplicating logic |
run_local.py | Create DocumentService; start SearchDocumentsTool daemon thread |
src/local/ui/main_window.py | Add TOOL_ACTIVITY_SEARCH_DOCUMENTS to _TOOL_ACTIVITY_SUBJECTS |
Fixed-size character chunks with overlap. Character-based (not token-based) to avoid a tokenizer dependency.
| Parameter | Default | Rationale |
|---|---|---|
| chunk_size | 1500 chars | ≈ 512 tokens for typical prose; fits comfortably in nomic-embed-text's 8192-token context |
| chunk_overlap | 200 chars | Prevents sentence/idea boundary splits from losing context |
| n_results | 5 | Top-5 passages returned to Gemma; configurable |
Each chunk is stored with metadata:
{
"source_file": "attention_is_all_you_need.pdf",
"chunk_index": 3,
"page": 2, # PDF only; omitted for text files
"ingested_at": 1717430000.0,
"type": "document"
}
Currently attachment_bar.py has private _extract_pdf_text() and _process_file() functions.
These will be moved to src/local/utils/file_extract.py as public functions, and attachment_bar.py
will import from there. This is the only UI file touched.
# src/local/utils/file_extract.py
TEXT_EXTS = {".txt", ".md", ".py", ".js", ".ts", ".yaml", ".json", ".csv"}
IMAGE_EXTS = {".jpg", ".jpeg", ".png", ".gif", ".webp"}
def extract_text(path: str) -> str:
"""Return plain text from a PDF or text file. Raises on unsupported types."""
...
def extract_for_attachment(path: str) -> dict:
"""Return {type, name, data} dict for AttachmentBar — images as base64, text as str."""
...
# src/local/services/document_service.py
class DocumentService:
def __init__(self, chroma_path=None, collection_name=None, embed_model=None): ...
def ingest_file(self, path: str) -> int:
"""Chunk, embed, and store file. Returns number of chunks written."""
...
def ingest_text(self, text: str, source_name: str) -> int:
"""Ingest already-extracted text (for programmatic use). Returns chunk count."""
...
def search(self, query: str, n: int | None = None) -> list[dict]:
"""Return top-n chunks by similarity: [{content, source_file, chunk_index, page, score}]"""
...
def list_sources(self) -> list[str]:
"""Return unique source filenames in the collection."""
...
def delete_source(self, source_file: str) -> int:
"""Delete all chunks for a given source file. Returns count deleted."""
...
sha256(source_file + chunk_index) — so
re-ingesting the same file is a safe upsert, not a duplicate accumulation.
{
"type": "function",
"function": {
"name": "search_documents",
"description": "Searches the user's personal document knowledge base for relevant passages. ...",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "What to look for in the knowledge base."
}
},
"required": ["query"]
}
}
}
Trigger conditions live in the tool description (same principle as Phase 8/9). The description will say:
"Searches the user's personal document knowledge base for relevant passages. Call this tool when the user asks about content from documents they have added to their knowledge base, or when a question is likely answered by their stored documents rather than general knowledge or the web."
[Knowledge base results for "transformer positional encoding"] 1. attention_is_all_you_need.pdf (p.4, chunk 8) Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence… 2. transformer_survey.pdf (p.12, chunk 31) Positional encodings can be fixed (sinusoidal, as in the original Transformer) or learned. Both approaches yield similar results in practice…
# scripts/ingest.py
# Usage: python scripts/ingest.py file.pdf [file2.txt ...]
# Runs without the full stack (no ZMQ, no Ollama chat endpoint).
import sys
from pathlib import Path
sys.path.insert(0, str(Path(__file__).parent.parent / "src"))
from local.services.document_service import DocumentService
def main():
paths = sys.argv[1:]
if not paths:
print("Usage: python scripts/ingest.py file.pdf [file2.txt ...]")
sys.exit(1)
svc = DocumentService()
for path in paths:
n = svc.ingest_file(path)
print(f" {path}: {n} chunks ingested")
if __name__ == "__main__":
main()
# Create DocumentService alongside MemoryService (both are ChromaDB; init in main thread)
from local.services.document_service import DocumentService
shared_documents = DocumentService()
# Start tool (after existing tools, before generator)
threading.Thread(
target=_start_search_documents,
args=(shared_documents,),
daemon=True, name="search_documents"
).start()
collection: "collective.documents"
chroma_path: ".chroma"
embed_model: "nomic-embed-text"
chunk_size: 1500
chunk_overlap: 200
n_results: 5
description: |
Searches the user's personal document knowledge base for relevant passages.
Call this tool when the user asks about content from documents they have added,
or when a question is likely answered by stored documents rather than general
knowledge or the web. Do not call this for general research questions — use
search_papers or web_search for those.
param_query: |
What to look for in the knowledge base. Be specific — use terms likely to
appear in the source documents.
| Test | What it checks |
|---|---|
| test_chunk_splits_text | Text longer than chunk_size produces multiple chunks |
| test_chunk_overlap | Adjacent chunks share chunk_overlap chars |
| test_chunk_short_text | Text shorter than chunk_size produces exactly 1 chunk |
| test_ingest_text_returns_chunk_count | ingest_text() returns correct count |
| test_search_returns_results | After ingesting, search() finds relevant passage |
| test_deterministic_ids_no_duplicates | Re-ingesting same text doesn't double chunk count |
| test_list_sources | list_sources() returns ingested filenames |
| test_delete_source | delete_source() removes all chunks for a file |
| test_metadata_stored | source_file, chunk_index present in search result metadata |
| Test | What it checks |
|---|---|
| test_announce_schema | Publishes tool.schema with name=search_documents |
| test_handle_request_publishes_result_and_activity | Both subjects published on tool call |
| test_empty_kb_returns_informative_message | Graceful result when collection is empty |
src/local/utils/file_extract.py — move extraction logic from attachment_barattachment_bar.py to import from file_extractsrc/local/services/document_service.pyprotocol/subjects.pysrc/local/tools/search_documents_tool.pyrun_local.pyTOOL_ACTIVITY_SEARCH_DOCUMENTS to main_window.pyconfig/documents.yamlscripts/ingest.py.chroma/ but use separate collections, so a single PersistentClient instance
could be shared. For simplicity, DocumentService creates its own client (same pattern as MemoryService).