← Plan 2

Phase 7 — File Attachments + Multimodal

Goal: Let the user attach files to any query. Images go directly to Gemma's vision input via the Ollama images field. PDFs and text files are extracted and prepended as context. Gemma sees everything natively — no prompt preprocessing or intermediate summarization.

Supported File Types

TypeExtensionsHandling
Image.jpg .jpeg .png .gif .webp Base64-encode bytes → images field in Ollama user message
PDF.pdf Text extracted via pypdf → prepended to user message content
Text.txt .md .py .js .ts .yaml .json .csv Read directly → prepended to user message content
OtherRed error chip: "unsupported format" — excluded from payload

New File: src/local/ui/attachment_bar.py

Widget layout

┌──────────────────────────────────────────────────────────────┐
│  ⌁   📎 diagram.png ✕   📎 notes.pdf ✕                      │
└──────────────────────────────────────────────────────────────┘

The AttachmentBar is a QWidget containing a horizontal flow of chips plus a paperclip button () on the left. It is hidden when no files are attached and becomes visible as soon as the first file is added.

Public API

MethodDescription
add_files(paths: list[str]) Process and add files; called by button picker and drag-drop handler
attachments() → list[dict] Returns current [{type, name, data}, …]; only valid attachments included
clear() Remove all chips and reset internal list; called after send

Chip behaviour

File processing

def _process_file(path: str) -> dict:
    ext = Path(path).suffix.lower()
    name = Path(path).name
    if ext in {".jpg", ".jpeg", ".png", ".gif", ".webp"}:
        data = base64.b64encode(Path(path).read_bytes()).decode()
        return {"type": "image", "name": name, "data": data}
    elif ext == ".pdf":
        text = _extract_pdf_text(path)      # pypdf PdfReader
        return {"type": "text", "name": name, "data": text}
    elif ext in {".txt", ".md", ".py", ".js", ".ts", ".yaml", ".json", ".csv"}:
        text = Path(path).read_text(errors="replace")
        return {"type": "text", "name": name, "data": text}
    else:
        return {"type": "error", "name": name}


def _extract_pdf_text(path: str) -> str:
    from pypdf import PdfReader
    reader = PdfReader(path)
    return "\n".join(page.extract_text() or "" for page in reader.pages)

Drag-and-drop

The parent input_container in MainWindow sets setAcceptDrops(True) and overrides dragEnterEvent / dropEvent. Dropped files are forwarded to self._attachment_bar.add_files(paths). The AttachmentBar itself does not handle drag events — it only receives already-resolved paths.

Changes: src/local/ui/main_window.py

Input area layout

┌──────────────────────────────────────────────────────────────────┐
│  📎 diagram.png ✕   📎 notes.pdf ✕                              │  ← AttachmentBar
│                                                        (hidden   │
│                                                        when empty)│
├──────────────────────────────────────────────────────────────────┤
│  Type a query and press Enter…                             Send  │  ← input row
└──────────────────────────────────────────────────────────────────┘

The paperclip button lives inside AttachmentBar (leftmost element). The query QLineEdit and Send button remain unchanged in their own row below.

_build_conversation_page() changes

_send_query() changes

def _send_query(self) -> None:
    query = self._query_input.text().strip()
    if not query:
        return
    query_id = str(uuid.uuid4())
    attachments = self._attachment_bar.attachments()
    envelope = MessageEnvelope.create(
        message_type="query",
        subject=QUERY_RECEIVED,
        sender_id="ui",
        payload={
            "query": query,
            "session_id": self._session_id,
            "query_id": query_id,
            "attachments": attachments,        # [] when none
        },
        correlation_id=query_id,
        metadata={"session_id": self._session_id},
    )
    self._publisher.publish(envelope)
    self._query_input.clear()
    self._attachment_bar.clear()

Response card attachment summary

When attachments are present at send time, a dim summary line is shown in the StreamingResponseWidget below the query timestamp badge:

[attached: diagram.png, notes.pdf]

This requires passing attachment_names: list[str] when creating the widget. The widget adds a QLabel (objectName attachmentSummary) that is hidden when the list is empty. BusLogger does not need to change — the names are known in the UI at send time and passed directly.

Changes: src/local/agents/generator_agent.py

_handle_query()

Extract attachments from the envelope payload and pass them to _build_messages:

attachments = payload.get("attachments") or []
messages = self._build_messages(query, session_id, attachments)

_build_messages(query, session_id, attachments)

def _build_messages(
    self, query: str, session_id: str | None, attachments: list[dict] | None = None
) -> list[dict]:
    history = self._conv.get_history(session_id)
    messages: list[dict] = []
    if self._system_prompt:
        messages.append({"role": "system", "content": self._system_prompt})
    messages.extend(history)

    # Build user message
    content_parts = []
    image_b64s = []
    max_chars = get_config("generator").get("max_attachment_chars", 8000)

    for att in (attachments or []):
        if att.get("type") == "text":
            text = (att.get("data") or "")[:max_chars]
            content_parts.append(f'[Attached: {att["name"]}]\n{text}')
        elif att.get("type") == "image":
            image_b64s.append(att.get("data", ""))

    content_parts.append(query)
    user_msg: dict = {"role": "user", "content": "\n\n".join(content_parts)}
    if image_b64s:
        user_msg["images"] = image_b64s

    messages.append(user_msg)
    return messages
History invariant: Attachments are NOT stored in conversation history. Only the bare text query (without injected file context) is appended to the session after generation. This keeps the context window clean across turns and avoids re-sending large base64 blobs.

Changes: config/generator.yaml

max_attachment_chars: 8000   # truncation limit per text attachment

New Dependency: requirements.txt

pypdf>=4.0

Story S9 — Multimodal Acceptance

Story file: tests/stories/s9_multimodal.yaml

TurnInputAssertionMode
1 Attach a .txt file containing a known unique phrase; query: "What does the attached file say?" Answer contains the unique phrase from the file Automated (fixture file)
2 Attach a .png screenshot; query: "Describe what you see in the image" Answer is non-empty and does not start with an error marker Manual smoke test (vision assertion not deterministic)
The automated turn injects a known fixture file at test time; no live Ollama call is needed — the test can mock GeneratorAgent._generate to verify that _build_messages produces the correct content string and that the query.received envelope carries the expected attachments list.

Build Order

  1. Add pypdf>=4.0 to requirements.txt; run pip install pypdf
  2. Build src/local/ui/attachment_bar.py — chips, paperclip button, file processing
  3. Wire AttachmentBar into MainWindow._build_conversation_page(); add drag-drop
  4. Update _send_query() to include attachments in payload and call clear()
  5. Add attachment summary line to StreamingResponseWidget
  6. Update GeneratorAgent._build_messages() to handle text and image attachments
  7. Add max_attachment_chars: 8000 to config/generator.yaml
  8. Write Story S9 and tests; run full suite
  9. Manual smoke test: attach image → "what do you see?"; attach PDF → ask question about it

Files Changed / Added

FileChange
src/local/ui/attachment_bar.pyNEW — chip strip, paperclip button, file processing
src/local/ui/main_window.pyWire AttachmentBar, drag-drop, attachments in send payload, response card summary
src/local/agents/generator_agent.py_build_messages handles text prepend + image list
config/generator.yamlAdd max_attachment_chars
requirements.txtAdd pypdf>=4.0
tests/stories/s9_multimodal.yamlNEW — acceptance story