Phase 12 — Multi-Collection RAG Library (rev 2 — post Codex review)

Extends the single-topic document library into a named, multi-collection store. Each collection has its own description used in the tool schema so Gemma can route queries to the right collection. The UI gains two-level navigation (collections → sources) and CRUD for collections.

Codex Review — Issues Addressed

Issue 1 resolved: Chunk ID includes collection. Old: sha256(source_file::chunk_index). New: sha256(collection::source_file::chunk_index). The same filename in two collections gets distinct IDs — no upsert collision.
Issue 2 resolved: No backward compatibility — clean slate. User re-ingests all files into named collections. No migration code, no legacy fallback paths. All ChromaDB reads assume every chunk has a collection metadata field.
Issue 3 resolved: move_source is delete+upsert, not metadata update. ChromaDB update() replaces metadata entirely. Instead: get chunks (with documents + embeddings), upsert with new IDs + merged metadata, delete old IDs. Embeddings are reused — no Ollama call needed.
Issue 4 resolved: documents.yaml is authoritative for collection definitions. list_collections() reads from config, not from Chroma metadata. Empty collections (no chunks yet) still appear. Chunk counts come from Chroma as a secondary query.
Issue 5 resolved: Tool layer gets explicit collection param parsing and tests. _handle_request extracts collection arg; passes to DocumentService.search(). Tests added in tests/test_search_library_tool.py.
Issue 6 resolved: scripts/ingest.py gets --collection flag. All commands updated: --list shows per-collection breakdown, --delete requires --collection.

Key Design Decisions

Metadata field, not separate ChromaDB collections. Each chunk gets a collection metadata field (e.g. "mba"). Single ChromaDB collection (collective.documents) retained. Moving a source = delete old chunks + upsert new chunks with updated collection — embeddings reused from Chroma, no re-embedding.
Chunk ID includes collection name. sha256(f"{collection}::{source_file}::{chunk_index}"). Finance.pdf in "mba" and Finance.pdf in "econ" are fully independent rows with no ID collision.
documents.yaml is authoritative for collection definitions. list_collections() reads the collections list from config — not from Chroma. Chroma is queried only for chunk/source counts. Empty collections appear in the UI immediately. Renaming/deleting a collection updates both the config and the Chroma metadata field.
Single tool, collection enum parameter when >1 collection exists. One collection → schema identical to current (description = collection description, no enum). Multiple collections → collection string enum added; each value's description is the collection description.
Clean slate — no migration. User clears the library and re-ingests into named collections. No legacy chunk handling anywhere.

Config Schema Change

Current config/documents.yaml:

collection: collective.documents
topic: "MBA Textbooks"
chunk_size: 1500
…

New config/documents.yaml:

collection: collective.documents   # ChromaDB collection name — unchanged
chunk_size: 1500
chunk_overlap: 200
n_results: 5
embed_model: nomic-embed-text
chroma_path: .chroma
collections:
  - name: mba
    display_name: "MBA Textbooks"
    description: "MBA textbooks covering strategy, finance, marketing, and operations"

The old topic key is removed. If collections is absent, the tool falls back to a generic description with no enum parameter (same as today with no topic).

UI Layout

Collections view (default):

library + Collection Refresh ──────────────────────────────────────────────────────── Collection Sources Description ──────────────────────────────────────────────────────── MBA Textbooks 6 MBA textbooks covering… ✎ 🗑 Economics Research 2 Macro and micro theory… ✎ 🗑 ────────────────────────────────────────────────────────

Sources view (drill-down on collection row click):

← MBA Textbooks + Files + Folder Refresh ──────────────────────────────────────────────────────────────────────── Source Chunks Move to… ──────────────────────────────────────────────────────────────────────── Finance.pdf 1249 [Economics ▾] 🗑 MergersAndAcquisitions.pdf 2277 [Economics ▾] 🗑 ──────────────────────────────────────────────────────────────────────── [Description: MBA textbooks covering strategy…] [Save]

Implementation Plan

12a · DocumentService

Collection-aware data layer

ChangeDetail
_chunk_id(collection, source, idx)Add collection to hash key. Old signature was (source, idx) — update all callers.
ingest_file(path, collection, …)Pass collection through to _ingest_pdf and _upsert_chunks. Stored in chunk metadata.
ingest_text(text, source, collection, …)Same — collection required parameter.
search(query, collection=None, n=None)collection=None → where={"type":"document"} (all). collection=X → $and filter on type + collection.
list_sources(collection=None)Returns list[str] of unique source_file values. If collection given, filter by metadata; else all.
list_sources_detail(collection=None)Returns list[dict] with source_file + chunk_count. Used by UI for table display.
list_collections()Reads collections list from documents.yaml config. Returns list[dict] with name, display_name, description + chunk_count from Chroma.
delete_source(source, collection)Filter by source_file AND collection, delete matching IDs.
move_source(source, from_col, to_col)Get chunks (docs + embeddings + metas) → new IDs with to_col → upsert → delete old IDs. No Ollama call.
delete_collection(name)Delete all chunks where collection == name. Then remove from documents.yaml.
count(collection=None)Count chunks matching type=document (+ optional collection filter).

move_source implementation: result = coll.get(where={…}, include=["documents","embeddings","metadatas"]) → build new_ids with new collection → coll.upsert(ids=new_ids, documents=…, embeddings=…, metadatas=new_metas)coll.delete(ids=old_ids). Metadata merge is simply {**old_meta, "collection": to_col}.

12b · SearchLibraryTool

Dynamic schema + collection routing

ChangeDetail
_build_schema()Reads collections list from documents.yaml. Zero/one collection → no enum, description = collection description (or generic). Two+ collections → add collection enum parameter; each enum value's description = collection description.
_handle_request()Extract args["collection"] (optional). Pass to _search().
_search(query, collection=None)Pass collection to docs.search(). Result header shows collection name if provided.
# Single collection (no enum)
{
  "name": "search_library",
  "description": "Search MBA Textbooks: MBA textbooks covering strategy, finance…",
  "parameters": { "query": {"type": "string"} }
}

# Multiple collections
{
  "name": "search_library",
  "description": "Search the document library. Choose the collection that best matches your query.",
  "parameters": {
    "query": {"type": "string"},
    "collection": {
      "type": "string",
      "enum": ["mba", "econ"],
      "description": "mba: MBA textbooks (strategy, finance, ops); econ: macro/micro research papers"
    }
  }
}
12c · DocumentsWindow

Two-level navigation

12d · scripts/ingest.py

Collection-aware CLI

ChangeDetail
--collection NAMERequired for ingest; optional for --list (show all if omitted); required for --delete.
--listWith no --collection: show all collections from config with source/chunk counts. With --collection: show sources in that collection.
--delete FILE --collection NAMEDelete named source from named collection.

Files Changed

FileChange
config/documents.yamlRemove topic; add collections list
src/local/services/document_service.pycollection in chunk ID; collection param throughout; list_collections, list_sources_detail, move_source, delete_collection
src/local/tools/search_library_tool.pyDynamic schema; collection arg extraction; result labeling
src/local/ui/documents_window.pyFull rewrite — QStackedWidget, collections view, sources view, move/rename/delete
scripts/ingest.py--collection flag; collection-aware --list and --delete
tests/test_document_service.pyTests for collection CRUD, move_source, same filename in two collections, count per collection
tests/test_search_library_tool.pyNew file — schema generation (1 vs N collections), collection arg routing, result labeling

Out of Scope