LoCAL2 — Plan 2: Multimodal, Tools, RAG

Plan 1 (Phases 1–6) established the core: LLM-native tool calling, episodic memory, pairwise evaluation, and per-participant observability UI. Plan 2 extends Gemma's capabilities with file attachments, new tools, and a persistent knowledge base.

PhaseTitleStatus
7File attachments + multimodal▶ NEXT
8Date/time + location tools▶ NEXT
9Semantic Scholar toolplanned
10RAG — persistent document knowledge baseplanned
11Documentationplanned

Phase 7 — File Attachments + Multimodal

Goal: Let the user attach files to any query. Images go directly to Gemma's vision input. PDFs and text files are extracted and injected as context. Gemma sees everything natively — no prompt preprocessing.

Supported file types

TypeExtensionsHandling
Image.jpg .jpeg .png .gif .webpBase64-encoded → images field in Ollama message
PDF.pdfText extracted via pypdf → prepended to user message
Text.txt .md .py .js .ts .yaml .json .csvRead directly → prepended to user message
OtherError chip: "unsupported format"

UI changes — MainWindow

Attachment button

A paperclip button () sits to the left of the query input. Clicking it opens a QFileDialog. Drag-and-drop onto the input area also works — dragEnterEvent / dropEvent on the input container.

Attachment chips

Attached files appear as small chips above the query input. Each chip shows the filename and an ✕ to remove it. Multiple files are allowed. Chips are cleared after the query is sent.

┌─────────────────────────────────────────────────┐
│ 📎 diagram.png ✕   📎 notes.pdf ✕               │  ← attachment chips
├─────────────────────────────────────────────────┤
│ What does this architecture diagram show?  Send │  ← query input
└─────────────────────────────────────────────────┘

Response card

The existing StreamingResponseWidget shows a small attachment summary line: "[attached: diagram.png, notes.pdf]" below the query badge.

Attachment processing

Processing happens in the UI thread at send time (files are small, extraction is fast). No background worker needed unless PDFs are very large.

def _process_attachment(path: str) -> dict:
    ext = Path(path).suffix.lower()
    if ext in {".jpg", ".jpeg", ".png", ".gif", ".webp"}:
        data = base64.b64encode(Path(path).read_bytes()).decode()
        return {"type": "image", "name": Path(path).name, "data": data}
    elif ext == ".pdf":
        text = _extract_pdf_text(path)   # pypdf
        return {"type": "text", "name": Path(path).name, "data": text}
    elif ext in {".txt", ".md", ".py", ".js", ...}:
        text = Path(path).read_text(errors="replace")
        return {"type": "text", "name": Path(path).name, "data": text}
    else:
        return {"type": "error", "name": Path(path).name}

Bus protocol

query.received payload gains an attachments field:

{
  "query": "What does this architecture diagram show?",
  "session_id": "...",
  "query_id": "...",
  "attachments": [
    {"type": "image", "name": "diagram.png", "data": "<base64>"},
    {"type": "text",  "name": "notes.pdf",   "data": "extracted text..."}
  ]
}

GeneratorAgent — _build_messages

When attachments are present, the user message is built differently:

History: Attachments are NOT stored in conversation history. Only the text query is appended to the session. This keeps the context window clean for follow-up turns and avoids re-sending large base64 blobs.

Config — generator.yaml

max_attachment_chars: 8000   # truncation limit per text attachment

New dependency

pypdf>=4.0

New files

FilePurpose
src/local/ui/attachment_bar.pyChip strip widget + file processing logic

Modified files

FileChange
src/local/ui/main_window.pyAdd paperclip button, drag-drop, wire AttachmentBar, include attachments in query payload
src/local/agents/generator_agent.py_build_messages reads attachments from envelope payload
config/generator.yamlAdd max_attachment_chars
requirements.txtAdd pypdf

Build order

  1. Add pypdf to requirements, install
  2. Build AttachmentBar widget (chips, file picker, drag-drop)
  3. Wire AttachmentBar into MainWindow input area
  4. Extend query.received payload with attachments
  5. Update _build_messages to handle image and text attachments
  6. Smoke test: attach an image, ask "what do you see?"
  7. Smoke test: attach a PDF, ask a question about it
  8. Story S9: multimodal acceptance test

Phase 8 — Date/Time + Location Tools

Goal: Give Gemma reliable grounding about when and where it's operating. Two small tools, no external APIs, no state.

DateTimeTool

Returns current local date, time, timezone, and day of week. Single system call.

Tool name: get_datetime
Schema: no parameters required
Result: "Tuesday 2026-06-03 09:17:42 PDT (UTC-7)"

No config file needed. Follows the existing tool pattern: announces schema on startup, responds to tool.request.get_datetime.

LocationTool

Reads from config/location.yaml and returns a structured location string. No live geolocation — user sets their own context.

# config/location.yaml
city: "Cupertino"
state: "California"
country: "United States"
timezone: "America/Los_Angeles"
coordinates: "37.3230° N, 122.0322° W"   # optional, for distance queries
Tool name: get_location
Result: "Cupertino, California, United States (America/Los_Angeles, 37.3230° N, 122.0322° W)"

New files

FilePurpose
src/local/tools/datetime_tool.pyDateTimeTool
src/local/tools/location_tool.pyLocationTool
config/location.yamlUser location config

Phase 9 — Semantic Scholar Tool

Goal: Let Gemma search academic literature. Semantic Scholar's public API returns ranked papers with titles, authors, abstracts, and citation counts — far more useful than general web search for research queries.

Tool schema

Tool name: search_papers
Parameters:
  query  (string, required) — research topic or keywords
  limit  (integer, optional, default 5) — max papers to return

API

Uses the Semantic Scholar Graph API (https://api.semanticscholar.org/graph/v1/paper/search). Free, no API key required for basic use (rate limited to 100 req/5min). Optional key for higher limits, stored in .env.

Result format

[2026-06-03] Papers: "transformer attention mechanisms"

1. Attention Is All You Need (2017) — Vaswani et al.
   Citations: 98,432  |  https://semanticscholar.org/paper/...
   Transformers dispense with recurrence entirely, relying instead on...

2. ...

Config — semantic_scholar.yaml

max_results: 5
timeout: 15
fields: "title,authors,year,abstract,citationCount,url"

Phase 10 — RAG: Persistent Document Knowledge Base

Goal: Let Gemma search a user-curated knowledge base of documents that persists across sessions. Complements Phase 7 (one-shot file questions) with durable, indexed storage.

Design

A separate ChromaDB collection (collective.documents) holds chunked, embedded document passages. A search_documents tool lets Gemma query it semantically, same pattern as search_memory.

Ingestion

Two paths:

Chunking strategy

Fixed-size chunks (512 tokens) with 64-token overlap. Each chunk stored with metadata: source_file, chunk_index, page (for PDFs), ingested_at.

Tool schema

Tool name: search_documents
Parameters:
  query  (string, required) — what to look for in the knowledge base

New files

FilePurpose
src/local/tools/search_documents_tool.pyRAG search tool
src/local/services/document_service.pyChromaDB collection wrapper + chunking
scripts/ingest.pyCLI ingestion script
config/documents.yamlChunk size, overlap, collection name

Phase 11 — Documentation

Goal: Document the system for contributors and future-self. Deferred until Plan 2 features stabilize.

Build Order Summary

PhaseDeliverableKey dependency
7File attachments, image + text, AttachmentBar UIpypdf
8DateTimeTool, LocationTool, location.yaml
9SemanticScholarTool, search_papers schemahttpx (already present)
10DocumentService, search_documents tool, ingest scriptPhase 7 (PDF extraction reuse)
11Architecture + developer docsStable feature surface
Phase 10 (RAG) deliberately reuses the PDF extraction logic from Phase 7's AttachmentBar. Build 7 before 10.