Metadata-Version: 2.4
Name: specpilot-ai
Version: 1.0.0
Summary: AI-powered Specification-Driven Development framework: turns a vague idea into needs, spec, plan, and code via a 4-stage agent pipeline
Author-email: Mohamed <malif78p@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/malif78/specpilot
Project-URL: Repository, https://github.com/malif78/specpilot.git
Project-URL: Documentation, https://github.com/malif78/specpilot#readme
Project-URL: Issues, https://github.com/malif78/specpilot/issues
Keywords: ai,agents,specification,development,claude,anthropic,sdd
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: anthropic>=0.40.0
Requires-Dist: PyYAML>=6.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: isort>=5.0; extra == "dev"
Requires-Dist: flake8>=6.0; extra == "dev"
Dynamic: license-file

# specpilot — AI-Powered Specification-Driven Development

An AI agent pipeline that turns a vague idea into a needs document, a formal
specification, an implementation plan, and working code — automatically, using Claude.

Built from scratch as a fully readable, minimal Python implementation so every concept
is explicit and understandable.  Useful as a standalone tool and as a reference for
understanding how agentic frameworks (BMAD, SPECKIT) work under the hood.

---

## Table of Contents

1. [What Is SDD?](#1-what-is-sdd)
2. [Architecture Overview](#2-architecture-overview)
3. [Core Concepts Explained](#3-core-concepts-explained)
   - [ProjectContext — the shared blackboard](#31-projectcontext--the-shared-blackboard)
   - [Skill — a Python function exposed as an LLM tool](#32-skill--a-python-function-exposed-as-an-llm-tool)
   - [Agent — role + system prompt + tool-use loop](#33-agent--role--system-prompt--tool-use-loop)
   - [Orchestrator — the stage-machine router](#34-orchestrator--the-stage-machine-router)
4. [Step-by-Step: What Happens During a Run](#4-step-by-step-what-happens-during-a-run)
5. [File Reference](#5-file-reference)
6. [Setup and Installation](#6-setup-and-installation)
7. [Running the Tests](#7-running-the-tests)
   - [Automated end-to-end test](#71-automated-end-to-end-test)
   - [Interactive CLI](#72-interactive-cli)
   - [Reading the workspace output](#73-reading-the-workspace-output)
8. [What Is Missing: Gap Analysis vs BMAD and SPECKIT](#8-what-is-missing-gap-analysis-vs-bmad-and-speckit)

---

## 1. What Is SDD?

**Specification-Driven Development** is a discipline where a formal specification
document is produced *before* any implementation begins, and the entire project
(planning, coding, testing) is traceable back to that spec.

An **SDD AI framework** automates that process using AI agents:

```
User's vague idea
       |
       v
  [DISCOVERY]  <-- AI asks clarifying questions until the need is precise
       |
       v
 [SPECIFICATION] <-- AI formalises the need into a structured spec document
       |
       v
   [PLANNING]  <-- AI breaks the spec into an ordered implementation plan
       |
       v
[IMPLEMENTATION] <-- AI guides the developer task-by-task through the build
       |
       v
  Documented, working software
```

Each stage produces a **markdown artifact** saved to disk (`needs.md`, `spec.md`,
`plan.md`, `impl_notes.md`).  These artifacts are the single source of truth — later
stages always read from them rather than relying on conversation memory.

---

## 2. Architecture Overview

```
┌──────────────────────────────────────────────────────────────────────────┐
│         main.py  /  tests/simple_test.py  /  tests/test_run.py           │
│                          (CLI entry points)                              │
└──────────────────────────────┬───────────────────────────────────────────┘
                               │ creates & wires
                               v
┌──────────────────────────────────────────────────────────────────────────┐
│                           Orchestrator                                   │
│                                                                          │
│   Stage:  DISCOVERY → SPECIFICATION → PLANNING → IMPLEMENTATION → DONE  │
│                                                                          │
│   Routing rule: send user input to agents[current_stage]                │
│   Transition:   when agent sets context.stage_advance_requested = True  │
└────────┬──────────────────────────────────────────────────────┬──────────┘
         │ reads/writes                                          │ reads/writes
         v                                                       v
┌─────────────────────┐                             ┌───────────────────────┐
│   ProjectContext    │                             │      Agent (x4)       │
│   (shared state)    │◄────────────────────────────│                       │
│                     │                             │  name, role           │
│  raw_need           │  injected into every        │  system_prompt        │
│  clarified_need     │  LLM call as context        │  skills  (list)       │
│  spec_document      │  summary                    │  conversation_history │
│  plan_document      │                             │                       │
│  impl_notes         │                             │  run(user_msg) ──┐    │
│  stage              │                             │                  │    │
│  stage_advance_*    │                             │  _tool_use_loop()│    │
│  workspace_dir      │                             └──────────────────┼────┘
└─────────────────────┘                                                │
                                                                       │ calls
                                                     ┌─────────────────v────┐
                                                     │   Anthropic API      │
                                                     │   (Claude)           │
                                                     │                      │
                                                     │  stop_reason:        │
                                                     │  "end_turn" → text   │
                                                     │  "tool_use" → loop   │
                                                     └──────────────────────┘
                                                          │          │
                                               tool_use   │          │ result
                                               block      v          │
                                              ┌───────────────────┐  │
                                              │   Skill.execute() │──┘
                                              │                   │
                                              │  write_document()   │ → workspace/*.md
                                              │  write_code_file() │ → workspace/*.py etc.
                                              │  advance_stage()   │ → context flag
                                              └───────────────────┘
```

### Key design principle

> **Agents do not talk to each other.  They communicate through the shared
> `ProjectContext`.**

The Elicitation agent writes `needs.md` and sets `context.clarified_need`.
The Specification agent reads that field (injected into its system prompt) —
it never "calls" the Elicitation agent.  This loose coupling means any agent
can be replaced independently.

---

## 3. Core Concepts Explained

### 3.1 ProjectContext — the shared blackboard

**File:** `framework/core/context.py`

```
ProjectContext
├── raw_need           str   — user's first, unpolished sentence
├── clarified_need     str   — refined summary after elicitation
├── spec_document      str   — formal spec written by SpecificationAgent
├── plan_document      str   — task breakdown written by PlanningAgent
├── impl_notes         str   — progress log written by ImplementationAgent
├── stage              Stage — current stage (enum)
├── stage_advance_requested  bool — flipped by advance_stage skill
├── stage_advance_summary    str  — what the agent accomplished
└── workspace_dir      str   — where *.md files are saved ("workspace/")
```

Every agent receives a text summary of the context injected into its system
prompt on every LLM call:

```python
# agent.py
def _system_prompt_with_context(self) -> str:
    return (
        self._base_system_prompt
        + "\n\n## Current Project Context\n"
        + self.context.summary_for_agents()
    )
```

This means even if the same agent is called many turns later, it always has
an up-to-date view of what previous stages produced — without any explicit
handoff message.

**Why a shared blackboard instead of message passing?**

Message passing (agent A sends a message to agent B) creates tight coupling and
requires a shared message bus.  A blackboard is simpler: every agent reads and
writes the same object.  This is the pattern used by BMAD's document dependency
chain and SPECKIT's SPEC.md / PLAN.md / TASKS.md artifacts.

---

### 3.2 Skill — a Python function exposed as an LLM tool

**File:** `framework/core/skill.py`

A `Skill` wraps a plain Python function and exposes it to Claude as a tool
definition (JSON Schema).

```python
@dataclass
class Skill:
    name: str          # "write_document"
    description: str   # shown to the LLM to help it decide when to call it
    parameters: dict   # JSON Schema of the function's arguments
    execute: Callable  # the actual Python function
```

Converting a skill to the Anthropic API format is one method:

```python
def to_tool_schema(self) -> dict:
    return {
        "name": self.name,
        "description": self.description,
        "input_schema": self.parameters,   # Anthropic's required field name
    }
```

**The three skills in this framework:**

| Skill | Who has it | What it does | Side effect |
|-------|-----------|-------------|------------|
| `write_document` | All agents | Saves a markdown file to `workspace/` | Updates the matching context field (e.g. `context.spec_document`) |
| `write_code_file` | ImplementationAgent | Writes any source file (`.py`, `.toml`, …) to `workspace/` so code can be run | Appends path to `context.code_files` |
| `advance_stage` | All agents | Signals that the stage is complete | Sets `context.stage_advance_requested = True` |

Agents call skills by *requesting* them in the LLM response — they never call
`skill.execute()` directly.  The agent's tool-use loop dispatches the call.

---

### 3.3 Agent — role + system prompt + tool-use loop

**File:** `framework/core/agent.py`

Each agent is an instance of the `Agent` class with:
- A **name** and **role** (e.g. "Elicitor / Product Analyst")
- A **system prompt** defining its expertise and instructions
- A **list of skills** it can invoke
- Its own **conversation history** (messages within this stage only)

#### The tool-use loop

This is the heart of the framework.  When an agent calls the LLM, Claude may
respond with text (done) or with a `tool_use` block (it wants to run a skill).

```
Agent.run(user_message)
  │
  ├─ append message to conversation_history
  │
  └─ _tool_use_loop()
       │
       ├─ call Anthropic API with:
       │    - system prompt (base + context summary)
       │    - full conversation history
       │    - tools = [skill.to_tool_schema() for skill in self.skills]
       │
       ├─ if stop_reason == "tool_use":
       │    for each tool_use block in response:
       │      result = _dispatch_skill(block.name, block.input)
       │    append assistant turn (with tool_use blocks) to messages
       │    append user turn (with tool_result blocks) to messages
       │    └─ LOOP AGAIN (Claude gets the tool results and continues)
       │
       └─ if stop_reason == "end_turn":
            extract text from content blocks
            return text
```

In a single turn, Claude may call multiple skills before returning text.
For example the Specification agent calls `write_document` then `advance_stage`
in the same response — the loop handles both before returning.

After `_tool_use_loop` returns, `run()` checks:

```python
if self.context.stage_advance_requested:
    self._stage_complete = True
    self.context.stage_advance_requested = False   # consumed
```

This is how stage transitions work: the skill writes to the context, the agent
reads from it, the orchestrator reads from the agent.

---

### 3.4 Orchestrator — the stage-machine router

**File:** `framework/core/orchestrator.py`

The orchestrator holds a `dict[Stage, Agent]` and a reference to the shared
`ProjectContext`.  Its job is purely routing:

```python
def process(self, user_input: str) -> tuple[str, bool]:
    agent = self.agents[self.context.stage]   # pick agent for current stage
    response = agent.run(user_input)           # delegate

    stage_changed = False
    if agent.stage_complete:
        agent.reset_stage_complete()
        stage_changed = self._advance_stage()  # move context.stage forward

    return response, stage_changed
```

`_advance_stage` walks a fixed ordered list:

```
DISCOVERY → SPECIFICATION → PLANNING → IMPLEMENTATION → DONE
```

When `DONE` is reached, `orchestrator.is_done()` returns `True` and the REPL exits.

**Why a linear state machine?**

It maps directly onto SDD's sequential workflow.  Each stage has a single clear
purpose and must complete before the next begins.  This is intentional for a
learning framework — real frameworks like LangGraph use directed graphs that
allow loops and parallel branches (see Section 8).

---

## 4. Step-by-Step: What Happens During a Run

This traces every event for a single "I want to build a todo list app" input.

### Step 1 — Bootstrapping (`main.py` or `test_run.py`)

```python
client = anthropic.Anthropic(api_key=...)
context = ProjectContext(workspace_dir="workspace")

agents = {
    Stage.DISCOVERY:       make_elicitation_agent(context, client),
    Stage.SPECIFICATION:   make_specification_agent(context, client),
    Stage.PLANNING:        make_planning_agent(context, client),
    Stage.IMPLEMENTATION:  make_implementation_agent(context, client),
}

orchestrator = Orchestrator(context=context, agents=agents)
```

All four agents share the **same** `context` object and the **same** `client`.
Nothing is called yet.

### Step 2 — First user message enters

```
User: "I want to build a todo list app"
```

`orchestrator.process("I want to build a todo list app")` is called.
`context.stage == Stage.DISCOVERY`, so the **ElicitationAgent** receives the message.

### Step 3 — ElicitationAgent calls the LLM

`agent.run()` → `_tool_use_loop()` → Anthropic API call with:
- System prompt: `"You are a senior product analyst..."` + context summary
- Messages: `[{"role": "user", "content": "I want to build..."}]`
- Tools: `[write_document schema, advance_stage schema]`

Claude responds with `stop_reason == "end_turn"` and a text question:
> "Great idea! Who are the main users and what are the 3 core features?"

No skill was called.  The loop exits immediately.  The text is returned to the REPL.

### Step 4 — Several more turns

The user answers the clarifying questions.  Each turn:
1. User input → `orchestrator.process()` → `agent.run()`
2. LLM responds with another question (no tool call yet)
3. Text printed to user

### Step 5 — Elicitation agent decides it has enough information

After the user confirms the scope, Claude's response includes tool_use blocks:

```json
[
  {
    "type": "tool_use",
    "id": "toolu_01",
    "name": "write_document",
    "input": {
      "filename": "needs.md",
      "content": "# Project Needs\n...",
      "doc_type": "needs"
    }
  },
  {
    "type": "tool_use",
    "id": "toolu_02",
    "name": "advance_stage",
    "input": { "summary": "Clarified a personal todo app with 3 features" }
  }
]
```

`stop_reason == "tool_use"`.  The loop:

1. Calls `write_document(filename="needs.md", content="...", doc_type="needs")`
   → saves file to `workspace/needs.md`
   → sets `context.clarified_need = content`
   → returns `"Saved to workspace/needs.md"`

2. Calls `advance_stage(summary="...")`
   → sets `context.stage_advance_requested = True`
   → returns `"Stage advance requested"`

3. Appends both tool results to messages and calls the LLM again.

4. LLM returns a text confirmation: "needs.md saved, moving to Specification..."
   `stop_reason == "end_turn"` → loop exits, text returned.

### Step 6 — Orchestrator advances stage

Back in `orchestrator.process()`:

```python
if agent.stage_complete:          # True, because advance_stage was called
    agent.reset_stage_complete()
    stage_changed = self._advance_stage()   # context.stage = SPECIFICATION
```

`stage_changed = True` is returned to the REPL which prints the new banner.

### Step 7 — SpecificationAgent takes over

Next user message → `orchestrator.process()` → now routes to `SpecificationAgent`.

The agent's system prompt says "read the clarified need from Project Context".
The context summary injected into the system prompt includes:

```
Clarified need: Personal todo app with add/complete/delete tasks, local JSON storage, Python CLI
```

The SpecificationAgent writes `spec.md` with FR-01…FR-N sections, calls
`advance_stage` → orchestrator moves to PLANNING.

### Step 8 — Planning and Implementation follow the same pattern

Each agent reads previous artifacts from the context summary, produces its own
document, and calls `advance_stage` to hand off.

### Step 9 — DONE

When `ImplementationAgent` calls `advance_stage`, the orchestrator sets
`context.stage = Stage.DONE`.  `orchestrator.is_done()` returns `True`.
The REPL prints the completion summary and exits.

**`workspace/` now contains:**
```
needs.md       — clarified requirements
spec.md        — formal functional + non-functional requirements
plan.md        — phased implementation plan with tasks
impl_notes.md  — what was built, design decisions, remaining work
```

---

## 5. File Reference

```
mysdd/
│
├── main.py                    Entry point for interactive CLI
├── config.py                  API key + model + workspace dir settings
├── requirements.txt           anthropic>=0.40.0
├── tests/
│   ├── simple_test.py         Fast smoke test (~2 min, word-count tool)
│   ├── test_run.py            Automated end-to-end test (~5 min)
│   └── persist_test.py        Session save/load round-trip test
├── docs/
│   ├── ROADMAP.md             14 missing features with design sketches
│   ├── INSTALL.md             Installation guide
│   ├── USAGE.md               Usage guide
│   ├── DISTRIBUTION.md        Distribution procedure
│   ├── QUICK_REFERENCE.md     30-minute publishing checklist
│   └── GETTING_OTHERS_TO_USE.md
│
├── framework/
│   ├── core/
│   │   ├── context.py         ProjectContext dataclass + Stage enum
│   │   ├── skill.py           Skill dataclass (wraps Python fn as LLM tool)
│   │   ├── agent.py           Agent base class with tool-use loop
│   │   └── orchestrator.py    Stage-machine router
│   │
│   ├── skills/
│   │   ├── document_writer.py  write_document skill factory
│   │   └── advance_stage.py    advance_stage skill factory
│   │
│   └── agents/
│       ├── elicitation.py      Stage 1: ElicitationAgent
│       ├── specification.py    Stage 2: SpecificationAgent
│       ├── planning.py         Stage 3: PlanningAgent
│       └── implementation.py   Stage 4: ImplementationAgent
│
└── workspace/                 Generated documents land here
    ├── needs.md
    ├── spec.md
    ├── plan.md
    └── impl_notes.md
```

---

## 6. Setup and Installation

### Prerequisites

- Python 3.10 or later
- An Anthropic API key (`sk-ant-...`)

### Install

```powershell
pip install specpilot
```

Or install from source:

```powershell
git clone https://github.com/malif78/specpilot.git
cd specpilot
pip install -e .
```

### API key — one-time setup

Create a `.env` file in the project root (it is git-ignored and never committed):

```
ANTHROPIC_API_KEY=sk-ant-...
```

`config.py` loads this file automatically on every run — no need to set an
environment variable in each terminal session.  A real environment variable
always takes precedence over `.env` if both are set.

---

## 7. Running the Tests

### 7.1 Quick smoke test

`simple_test.py` runs a focused demo on a word-count CLI tool — all four stages
in roughly 2 minutes (~8 API calls).  Good for a fast sanity check.

```powershell
python tests/simple_test.py
```

**Expected workspace output:**

```
workspace_simple/
  needs.md          spec.md
  plan.md           impl_notes.md
  wc_tool.py        pyproject.toml   ← actual runnable code
```

Run the generated tool immediately after:

```powershell
python workspace_simple\wc_tool.py README.md
```

### 7.2 Full automated end-to-end test

`test_run.py` drives all four stages using a longer scripted conversation
about a "personal expense tracker CLI" (~5 minutes, ~20 API calls).

```powershell
python tests/test_run.py
```

**What it does:**

| Phase | Scripted messages sent | Expected agent behaviour |
|-------|----------------------|--------------------------|
| Discovery (3 turns) | Describes app, answers clarifying questions | Asks 1-2 questions per turn, writes `needs.md`, advances |
| Specification (1 turn) | "Go ahead and write the spec" | Writes `spec.md` with FR-01…, advances |
| Planning (1 turn) | Confirms stdlib + argparse stack | Writes `plan.md` with phases + tasks, advances |
| Implementation (3 turns) | "Start Phase 1", "next phase", "done" | Writes source files + `impl_notes.md`, advances |

**Expected output (final lines):**

```
Final stage   : done
Has spec doc  : yes
Has plan doc  : yes
Has impl notes: yes

Workspace files:
  impl_notes.md  (≈3 KB)
  needs.md       (≈1 KB)
  plan.md        (≈6 KB)
  spec.md        (≈4 KB)
```

**To re-run cleanly** (fresh workspace):

```powershell
Remove-Item workspace\*.md
python tests/test_run.py
```

### 7.3 Interactive CLI

`main.py` runs the real conversational REPL.  Type your own application idea.

```powershell
python main.py
```

**Example session:**

```
------------------------------------------------------------
  SDD Framework  —  Specification-Driven Development
------------------------------------------------------------

Type your idea below.  'quit' or Ctrl-C to exit.

------------------------------------------------------------
  DISCOVERY      — Understanding your need
------------------------------------------------------------

[Elicitor · Product Analyst] Welcome! Tell me about your idea...

You > I want to build a recipe manager web app
...
```

Type `quit` to exit at any point.

### 7.4 Reading the workspace output

After a run, inspect the generated documents:

```powershell
# List all generated files with sizes
Get-ChildItem workspace\

# Read the spec
Get-Content workspace\spec.md

# Read the plan
Get-Content workspace\plan.md
```

**What good output looks like:**

- `needs.md` — 300-600 words, mentions: problem, users, 3-5 MVP features, constraints
- `spec.md` — structured with `FR-01`…`FR-N` numbered requirements, NFRs, out-of-scope
- `plan.md` — 3-6 phases, each phase has checkboxed tasks naming specific files/modules
- `impl_notes.md` — records what was built, design decisions, remaining phases

---

## 8. What Is Missing: Gap Analysis vs BMAD and SPECKIT

This framework is a learning skeleton.  Below is an honest comparison with two
production-grade SDD frameworks and a full list of gaps.

### 8.1 BMAD-METHOD

BMAD (Breakthrough Method for Agile AI-Driven Development) is an open-source
framework that orchestrates 12+ specialized AI agents through a full agile
workflow, with IDE integration for Claude Code, Cursor, and VSCode.

| BMAD feature | Our framework | Gap |
|---|---|---|
| **12+ specialized agents** (Analyst, PM, Architect, Scrum Master, QA, Dev, PO…) | 4 hardcoded agents | Only 4 agents with fixed roles; no configurable personas |
| **Adaptive complexity** — same workflow scales from a bug fix to an enterprise platform | Single fixed 4-stage pipeline | No way to skip stages, add stages, or loop back |
| **Cross-agent delegation** — agents can hand off sub-tasks to other agents | None | Agents are isolated; no inter-agent messaging |
| **Quality gates and checklists** between stages | None | No formal accept/reject between stages; an agent can advance prematurely |
| **BMad Builder** — users build and share custom agents | Agents are Python classes only | No plugin system; adding an agent requires code changes |
| **Agile artifacts** — user stories, sprint backlog, acceptance criteria | Only 4 markdown docs | No user story format, no backlog, no sprint concept |
| **Session persistence** — resume a project across sessions | Context dies with the process | Every run starts from scratch |
| **Multiple LLM support** | Claude only | No model routing or fallback |

### 8.2 SPECKIT (GitHub Spec Kit)

SPECKIT treats specifications as executable, first-class artifacts.  Its key
innovations are **context discovery hooks** (probing the codebase before planning)
and **validation hooks** (checking artifacts after each stage).

| SPECKIT feature | Our framework | Gap |
|---|---|---|
| **7-phase workflow** (Constitution → Specification → Clarification → Planning → Task Breakdown → Implementation → Validation) | 4 phases | Missing: Constitution (project governance), Task Breakdown (granular task list), Validation (post-implementation checks) |
| **Context discovery hooks** — agents read existing code/APIs/conventions before planning | None | Agents have no awareness of an existing codebase; they hallucinate file names and APIs |
| **Validation hooks** — post-phase checks verify artifacts (do the files exist? do the tests pass?) | None | No verification that what was planned actually got built |
| **SPEC.md → PLAN.md → TASKS.md pipeline** — each artifact is a typed, structured document | Freeform markdown | Documents have no enforced schema; a misbehaving agent could produce garbage |
| **Agent-agnostic** — works with any AI assistant (Claude, Copilot, Gemini, Cursor…) | Claude only | Hard dependency on Anthropic SDK |
| **Customization presets and extensions** | None | No configuration file; all customization requires Python code changes |
| **Task tracking** — TASKS.md with explicit done/not-done state | impl_notes.md is prose | No machine-readable task state; cannot resume mid-plan |

### 8.3 Full Gap List

The following features exist in production frameworks but are absent here.
They are roughly ordered from highest to lowest impact.

#### Persistence and Memory

| Gap | Description | How to add it |
|-----|-------------|---------------|
| **No session resumption** | Killing the process loses all context | Serialize `ProjectContext` to JSON on every state change; load on startup if file exists |
| **No cross-session memory** | Agents forget previous projects | Add a vector store (ChromaDB, FAISS) indexed by project; inject relevant past decisions into system prompts |
| **No long-term agent memory** | Each agent's conversation history resets per run | Persist `agent.conversation_history` to disk alongside context |

#### Orchestration

| Gap | Description | How to add it |
|-----|-------------|---------------|
| **Linear only** | Stages go forward only; no loops, no branches | Replace the list-based state machine with a directed graph (LangGraph pattern); add loop-back edges for "needs more clarification" |
| **No parallel agents** | Agents run sequentially | Use `asyncio` + `asyncio.gather` to run independent agents concurrently (e.g., Architect and QA reviewing the spec simultaneously) |
| **No agent delegation** | An agent cannot spawn a sub-agent | Add a `delegate_to(agent_name, task)` skill that calls another agent as a sub-task |
| **No human-in-the-loop gates** | Stages advance automatically when an agent says so | Add a formal approval step — pause, show the user the artifact, require explicit "approve" or "request changes" |

#### Context and Grounding

| Gap | Description | How to add it |
|-----|-------------|---------------|
| **No codebase discovery** | Agents don't know the existing project structure | Add a `discover_context` skill that runs `git ls-files`, reads key files, and injects findings into the planning stage |
| **No web/doc search** | Agents can't look up libraries, APIs, or standards | Add a `web_search` skill backed by a search API |
| **No RAG** | No retrieval of relevant past decisions or docs | Add vector-search over the workspace documents so later agents can query earlier artifacts semantically |

#### Output Quality

| Gap | Description | How to add it |
|-----|-------------|---------------|
| **No artifact schema validation** | Agents can produce malformed documents | Define JSON schemas for each document type; parse the LLM output and retry if validation fails |
| **No retry / fallback logic** | Any API error or bad output crashes the run | Wrap `_tool_use_loop` in exponential backoff; add an output validator that triggers a re-prompt on failure |
| **No output evaluation** | No way to score whether the spec is complete | Add an Evaluator agent that scores each artifact against a rubric and returns a pass/fail with feedback |

#### Developer Experience

| Gap | Description | How to add it |
|-----|-------------|---------------|
| **No streaming** | Responses appear all at once (blocking) | Use `client.messages.stream()` and print tokens as they arrive |
| **No async** | Everything is synchronous; UI freezes during LLM calls | Rewrite `_tool_use_loop` with `asyncio`; use `client.messages.create_async()` |
| **No observability** | No tracing, token counts, or cost tracking | Log every LLM call with timestamp, tokens in/out, cost; integrate with LangSmith or a custom logger |
| **No prompt versioning** | System prompts are hardcoded strings | Move prompts to YAML/TOML files; version them in git; A/B test variants |
| **Hardcoded agents** | Adding a new agent requires Python code | Define agents in a config file (YAML); the framework loads them dynamically |
| **No tool library** | Only 2 skills available | Add: `run_code`, `read_file`, `search_web`, `run_tests`, `create_github_issue`, `send_email`, … |

### 8.4 Summary Table

```
Feature                        Our Framework   SPECKIT   BMAD
--------------------------------------------------------------
Core SDD workflow                   Y            Y        Y
Multi-stage artifacts               Y            Y        Y
Tool use (skills)                   Y            Y        Y
Session persistence                 Y            Y        Y
Codebase discovery hooks            N            Y        N
Post-stage validation               N            Y        N
Non-linear orchestration            N            N        Y
12+ specialized agents              N            N        Y
Human-in-the-loop gates             N            Y        Y
Long-term memory / RAG              N            N        Y
Parallel agent execution            N            N        N
Streaming responses                 N            Y        Y
Artifact schema validation          N            Y        N
Retry / fallback logic              N            Y        N
Observability / tracing             N            N        Y
Configurable agents (no code)       N            Y        Y
Multi-LLM support                   N            Y        N
```

### 8.5 Session Persistence (Implemented)

Session persistence has been implemented.  It is the foundation for all other
advanced features — you cannot build evaluation pipelines or long-term memory
without it.

#### How it works

```
workspace/
  .session.json        ← written atomically after every agent turn
  needs.md
  spec.md
  plan.md
  impl_notes.json
```

The session file stores two things:

1. **Context snapshot** — all `ProjectContext` fields serialized as JSON.
   The `stage` enum is stored as its string value (`"planning"`).
   Transient flags (`stage_advance_requested`) are always reset to `False`.

2. **Agent conversation histories** — each agent's per-stage message list,
   keyed by stage name.  This is what allows an agent to resume mid-conversation
   without re-asking questions it already answered.

```json
{
  "version": 1,
  "saved_at": "2026-05-27T13:47:55",
  "context": {
    "raw_need": "a note-taking CLI app",
    "clarified_need": "...",
    "spec_document": "...",
    "stage": "planning",
    "workspace_dir": "workspace"
  },
  "agent_histories": {
    "discovery":       [{"role": "user", "content": "..."}, ...],
    "specification":   [...],
    "planning":        [...],
    "implementation":  [...]
  }
}
```

#### Key design decisions

| Decision | Reason |
|----------|--------|
| **In-place context restore** (`restore_from_dict`) | All agents hold a reference to the same context object.  Replacing it with a new one would leave agents pointing at stale data. |
| **Atomic write** (temp file → `os.replace`) | A crash mid-save never produces a corrupt session file — the old file remains intact until the new one is fully written. |
| **Transient flags not persisted** | `stage_advance_requested` is an in-flight signal, not state.  Persisting it could cause the stage to advance twice on resume. |
| **Session deleted on DONE** | A completed project should start fresh next time.  The workspace documents (`spec.md`, etc.) are the durable artifacts — the session file is scaffolding. |
| **`session_metadata()` fast-read** | The resume prompt reads only the small metadata header, not the full document content, so the prompt appears instantly even for large sessions. |

#### Resume flow in `main.py`

```
python main.py
  │
  ├─ build_orchestrator()          — fresh context + agents (all blank)
  │
  ├─ _maybe_resume()
  │     ├─ session_metadata()      — fast-read: stage, saved_at, raw_need preview
  │     ├─ print resume prompt
  │     └─ if Y: orchestrator.load_session()
  │               ├─ context.restore_from_dict()  — fills all context fields
  │               └─ agent.conversation_history = saved_history  (per stage)
  │
  └─ run_repl(resumed=True/False)
        ├─ if fresh: send opening message → ElicitationAgent greets user
        └─ if resumed: skip opening message → user types next message directly
```

#### Running the persistence test

```powershell
python tests/persist_test.py
```

This test verifies the full round-trip without running the complete pipeline:
- Sends 2 turns to the elicitation agent (makes 2 real API calls)
- Asserts the session file was written correctly
- Builds a brand-new orchestrator (simulating a restart)
- Loads the session and asserts every field and history message matches
- Verifies `session_metadata()` fast-read
- Verifies `delete_session()` removes the file

Every other gap (memory, RAG, validation) builds on top of persistence.
