# Log Fix Design Note

**Investigation date:** 2026-05-10  
**Investigator:** A1 task  
**Scope:** "logs vanish for completed/recycled workflows" bug

---

## Root Cause

When a user opens logs for a workflow node whose pod has been garbage-collected by Kubernetes (common for terminal-phase workflows: Succeeded / Failed / Error), the following chain fires:

1. `_load_logs` (app.py:460) calls `stream_workflow_logs` passing the Argo node ID as `pod_name` (e.g. `"my-wf-abc12"`).
2. `stream_workflow_logs` (client.py:169) sends:
   ```
   GET /api/v1/workflows/{ns}/{name}/log
       ?podName={node_id}
       &logOptions.follow=false
       &logOptions.container=main
   ```
3. Argo resolves `podName` by querying the **live Kubernetes pod** of that name. For a terminal-phase workflow whose pod has been GC'd, the pod is gone → Argo returns an **empty NDJSON stream (0 bytes, 0 lines)**, HTTP 200.
4. `_load_logs` counts `received == 0` → displays `"No log lines returned by the server."` and stops.
5. The actual logs **are available** as a `main.log` artifact in Argo's artifact storage at `/artifacts/{ns}/{name}/{node_id}/main.log`, but no fallback to `get_artifact_content` is attempted.

**Secondary latent risk:** `stream_workflow_logs` sets `timeout=None` when the caller passes `follow=True` (client.py:196). Today `_load_logs` never passes `follow=True` (uses the default `False`), so the timeout is always 60 s. However, `LogView._follow` defaults to `True` (logs.py:66) — if this reactive is ever wired to the API call, terminal workflows would hang forever with no timeout.

**Phase is never consulted.** None of the four `_load_logs` call sites read `workflow.status.phase` before deciding how to fetch or whether a pod is likely alive.

---

## Call Sites

| File : Line | Caller | pod_name passed? | follow arg | timeout in effect | Phase checked? | Needs change? |
|---|---|---|---|---|---|---|
| `client.py:148` | `get_workflow_logs` (definition) | optional | hardcoded `false` | 60.0 s | N/A | **No** (never called from app.py; exists but unused) |
| `client.py:169` | `stream_workflow_logs` (definition) | optional | param default `False` | `None` if follow else `60.0` | N/A | **Yes — A2** (add hard-cap on follow=True timeout) |
| `app.py:479` | `_load_logs` body → `stream_workflow_logs` | forwarded | omitted (False) | 60.0 s | No | **Yes — A2, A3** (phase check + artifact fallback) |
| `app.py:936` | `_on_node_selected` → `_load_logs` | `event.node_id` (Argo node ID) | — | 60.0 s | No | **Yes — A2** (pod may be GC'd) |
| `app.py:974` | `action_view_logs` (VIEW_WORKFLOWS) → `_load_logs` | None | — | 60.0 s | No | **Yes — A2** (whole-workflow log; pod=None so less likely to be empty, but phase still unread) |
| `app.py:1008` | `action_view_logs` (VIEW_DETAIL) → `_load_logs` | `self._selected_pod` (Argo node ID) | — | 60.0 s | No | **Yes — A2, A3** |
| `app.py:1925` | `_on_log_refresh` → `_load_logs` | `self._selected_pod` | — | 60.0 s | No | **Yes — A2, A3** |

---

## Fix Strategy

### A2: Phase-aware follow mode

**File to change: `src/aw8s/app.py:460-507` (`_load_logs`)**

- Read workflow phase before streaming:
  ```python
  phase = (self._selected_workflow_data or {}).get("status", {}).get("phase", "")
  terminal = phase in {"Succeeded", "Failed", "Error"}
  ```
- Pass `follow=False` explicitly when `terminal` is True (already the default, but make it explicit and documented).
- For Running/Pending workflows, `follow` can be `True` if desired in a future step — but guarded here.

**File to change: `src/aw8s/api/client.py:169-207` (`stream_workflow_logs`)**

- Replace `timeout=None if follow else 60.0` (line 196) with a capped value:
  ```python
  timeout=follow_timeout if follow else 60.0
  ```
  where `follow_timeout` is a new parameter defaulting to `300.0` (5 min hard cap).
- This prevents any future caller from accidentally passing `follow=True` on a terminal workflow and hanging forever.

No changes needed to `get_workflow_logs` (client.py:148) — it is correct but unreachable dead code. A2 may optionally remove or document it.

### A3: main.log artifact fallback

**File to change: `src/aw8s/app.py:460-507` (`_load_logs`)**

In the `else` branch (when `received == 0`), add a fallback for terminal-phase nodes that had a `pod_name`:

```python
if terminal and pod_name:
    try:
        raw = await self.client.get_artifact_content(
            self.config.namespace, name, pod_name, "main.log"
        )
        text = raw.decode("utf-8", errors="replace")
        if text.strip():
            log_view.set_logs(title + " [artifact]", text)
            return
    except Exception:
        pass  # fall through to "No log lines" message
```

`get_artifact_content` (client.py:247-268) signature:
```python
async def get_artifact_content(
    self, namespace, name, node_id, artifact_name,
    direction="output", max_bytes=65536
) -> bytes
```
URL resolved: `/artifacts/{namespace}/{name}/{node_id}/main.log`  
Timeout: `None` inside (streams until `max_bytes`). Safe for log preview.

**Note:** `get_artifact_url` (client.py:211-224) has a dead variable `prefix` (computed but never used in the return expression). A3 does not need to fix this, but should be aware it is harmless.

### A4: LogView UX changes

**File to change: `src/aw8s/views/logs.py`**

- When logs are loaded from artifact (not live stream), update the header to show `[artifact]` indicator instead of `[green]●[/]`.
- Add a new reactive `_from_artifact: reactive[bool] = reactive(False)` to track this state.
- When `_from_artifact` is True, disable the follow toggle `action_toggle_follow` (follow is meaningless for static artifact content).
- The `_update_header` method (logs.py:132-139) should reflect the artifact state.
- Add a subtle status line or toast when artifact fallback is used: `"[dim]Logs loaded from artifact storage (pod GC'd)[/]"`.

---

## Open Questions

1. **Is `self._selected_workflow_data` populated at all 4 call sites?**  
   Confirmed present for VIEW_DETAIL (used at app.py:982). For `_on_node_selected` (line 936) and `_on_log_refresh` (line 1925), the detail view must have been opened first — but it should be set. Verify that VIEW_WORKFLOWS → log path (line 974, no detail view opened) still has `_selected_workflow_data` populated, or fetch phase separately.

2. **Is the Argo node ID the same as the K8s pod name?**  
   Argo node IDs for Pod nodes (e.g. `my-wf-abc12-1234567890`) typically match the K8s pod name, but this is not guaranteed for all executor types (e.g. PNS, emissary). A3's artifact fallback uses this same ID as `node_id` for artifact lookup — this is correct (Argo uses node ID in artifact paths, not pod name).

3. **`max_bytes=65536` (64 KB) for artifact content — sufficient?**  
   Large workflows may produce logs exceeding 64 KB. A3 implementors should decide whether to increase `max_bytes` or add a "log truncated" indicator.

4. **Does `main.log` artifact always exist for terminal-phase workflows?**  
   Only if the Argo executor is configured to save logs as artifacts (common for `emissary`, optional for `docker`/`k8s`). A3 must handle the 404 case gracefully (already covered by the `except Exception: pass` pattern above).

5. **Container-set nodes:** `main.log` is the artifact name for `main` container. For non-main containers or containerSet steps, the artifact name may differ. The `container` parameter of `_load_logs` should inform the artifact name in A3.

6. **`follow_timeout` default value:** 300 s (5 min) is proposed for A2. Should this be configurable via `Config` or hardcoded? Implementors decide.

7. **`get_workflow_logs` (client.py:148) is dead code** — it is defined but never called from anywhere in `app.py`. A2 may choose to remove it or leave it for future use.
