I'll review the code changes against the observation data to verify correctness and integration.

---

### expert_build/cli.py (derive-review-repair subparser + dispatch)
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Subparser args (`--model`, `--rounds`, `--max-derive-rounds`, `--timeout`, `--domain`) match what `cmd_derive_review_repair` and `_stage_derive` consume. Dispatch table correctly routes to `_lazy("pipeline", "cmd_derive_review_repair")`. Consistent with existing subcommand patterns.
---

### expert_build/pipeline.py:_run_convergence_loop
VERDICT: CONCERN
CORRECTNESS: QUESTIONABLE
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Two issues. (1) **Round labels are relative, not absolute when resuming.** The label is `f"cycle {cycle}/{rounds}"` where `cycle` is 1-based within the `rounds` parameter. When `cmd_pipeline` passes `remaining_rounds = args.rounds - start_cycle + 1`, a resume from cycle 3 of 5 produces labels "cycle 1/3", "cycle 2/3", "cycle 3/3" instead of the old "cycle 3/5", "cycle 4/5", "cycle 5/5". This is a user-facing regression in stderr output and in stage metadata (`round_label` flows into `_stage_derive`, `_stage_review`, `_stage_repair`, `_stage_deduplicate`). (2) **No banners for stages 4-7.** The old pipeline code called `_banner(4, ...)` through `_banner(7, ...)` per cycle. The extracted loop doesn't emit banners, and the `on_cycle` callback doesn't either. For the standalone command this is fine (it has its own header), but for `cmd_pipeline` it's a loss of stage-level progress output that other stages (1, 2, 3, 8) still emit.
---

### expert_build/pipeline.py:cmd_derive_review_repair
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Guard checks (model availability, db existence) are correct. Summary printing covers all fields from `_run_convergence_loop`'s return dict. Minor note: `getattr(args, "rounds", 3)` is redundant since argparse always sets `rounds`, but it's defensive and harmless.
---

### expert_build/pipeline.py:cmd_pipeline (refactored convergence section)
VERDICT: CONCERN
CORRECTNESS: QUESTIONABLE
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: PARTIAL
INTEGRATION: WIRED
REASONING: **Stages 4-7 are never marked "running".** The old code called `_mark_stage(state, N, "running")` before each stage. The `_pipeline_on_cycle` callback only marks stages as "completed" at cycle end. This means `state["current_stage"]` (set only on "running" status per `_mark_stage` line 74) is never updated to 4/5/6/7 during the convergence loop. While resume still works via `current_cycle`, any external tooling reading the state file (e.g. `cmd_status`) won't see which stage is actively running within a cycle — it'll appear stuck at whatever stage was last "running" (stage 3). Also, all four stages are marked "completed" simultaneously at cycle end, even if repair was skipped (no invalids) — the old code had `_mark_stage(state, 6, "completed", skipped=True)` for that case; the new code marks it "completed" without the `skipped` flag, losing that distinction.
---

### tests/test_pipeline.py:TestConvergenceLoop
VERDICT: PASS
CORRECTNESS: VALID
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: COVERED
INTEGRATION: WIRED
REASONING: Tests cover the key behaviors: convergence condition, max rounds without convergence, repair skip when no invalids, accumulation across cycles, and the `on_cycle` callback. Mock targets are correct (patching `expert_build.pipeline._stage_*`).
---

### tests/test_pipeline.py:TestCmdDeriveReviewRepair
VERDICT: CONCERN
CORRECTNESS: QUESTIONABLE
SPEC_COMPLIANCE: N/A
ISSUE_COMPLIANCE: N/A
BELIEF_COMPLIANCE: N/A
TEST_COVERAGE: PARTIAL
INTEGRATION: WIRED
REASONING: `test_model_not_available_exits` patches `expert_build.llm.check_model_available` but `pipeline.py` imports it with `from .llm import check_model_available`, binding the name in the `pipeline` module namespace. The mock doesn't reach the call site. The test passes by coincidence: `check_model_available("nonexistent")` returns `False` from the real function since "nonexistent" isn't in `MODEL_COMMANDS`. Same issue in `test_prints_summary` — the mock sets `return_value=True` on the wrong target. If the test runner doesn't have the "claude" CLI installed, the real `check_model_available("claude")` returns `False` and the test would fail with `SystemExit`. The correct patch target is `expert_build.pipeline.check_model_available`. This is likely a pre-existing pattern issue, but it's fragile.
---

## Summary of findings

1. **Round label regression on resume** (CONCERN): When the pipeline resumes at cycle N, stderr labels show relative numbering ("cycle 1/K") instead of absolute ("cycle N/M"). Minor UX issue.

2. **Missing "running" state transitions for stages 4-7** (CONCERN): External state consumers won't see which stage is actively running within the convergence loop. The `skipped=True` flag for repair when no invalids exist is also lost.

3. **No banners for stages 4-7 in pipeline mode** (CONCERN): Loss of stage-level progress output that the user previously saw.

4. **Wrong mock patch target** (CONCERN): `check_model_available` is patched at the definition site, not the usage site. Tests pass coincidentally but are fragile.

None of these are blocking — the core logic (convergence loop extraction, standalone command) is correct and well-tested. The concerns are about resume fidelity and output regressions.

---

### SELF_REVIEW
LIMITATIONS: Could not verify `cmd_status` or other state consumers to assess the impact of losing "running" state transitions. Could not see the full existing test suite to confirm whether the mock patch target issue is pre-existing. Could not verify whether any integration tests exercise the resume path with the new code.
---

### FEATURE_REQUESTS
- Show `cmd_status` and any other consumers of the pipeline state file, since refactoring state transitions can break downstream readers
- Flag mock patch targets that don't match the import style at the call site (a common Python testing pitfall)
---
