Metadata-Version: 2.4
Name: agentic-forge-core
Version: 1.1.0
Summary: Framework-agnostic, enterprise-grade, self-improving AI agent infrastructure.
License: MIT
Requires-Python: >=3.12
Requires-Dist: anthropic>=0.40.0
Requires-Dist: celery[redis]>=5.4.0
Requires-Dist: dataclass-wizard>=0.39.0
Requires-Dist: gitpython>=3.1.43
Requires-Dist: httpx>=0.27.0
Requires-Dist: mcp>=1.2.0
Requires-Dist: pydantic-settings>=2.6.0
Requires-Dist: pydantic>=2.9.0
Requires-Dist: python-dotenv>=1.0.1
Requires-Dist: pyyaml>=6.0.2
Requires-Dist: sqlalchemy>=2.0.30
Requires-Dist: tiktoken>=0.8.0
Provides-Extra: all-frameworks
Requires-Dist: agentforge[autogen,crewai,dashboard,langchain,langgraph,openai]; extra == 'all-frameworks'
Provides-Extra: autogen
Requires-Dist: pyautogen>=0.3.2; extra == 'autogen'
Provides-Extra: crewai
Requires-Dist: crewai>=0.80.0; extra == 'crewai'
Provides-Extra: dashboard
Requires-Dist: fastapi>=0.115.0; extra == 'dashboard'
Requires-Dist: passlib[bcrypt]>=1.7.4; extra == 'dashboard'
Requires-Dist: prometheus-fastapi-instrumentator>=7.0.0; extra == 'dashboard'
Requires-Dist: python-jose[cryptography]>=3.3.0; extra == 'dashboard'
Requires-Dist: python-multipart>=0.0.12; extra == 'dashboard'
Requires-Dist: uvicorn>=0.32.0; extra == 'dashboard'
Requires-Dist: weasyprint>=63.0; extra == 'dashboard'
Provides-Extra: dev
Requires-Dist: factory-boy>=3.3.1; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.14.0; extra == 'dev'
Requires-Dist: pytest>=8.3.0; extra == 'dev'
Requires-Dist: ruff>=0.7.0; extra == 'dev'
Provides-Extra: langchain
Requires-Dist: langchain-anthropic>=0.3.0; extra == 'langchain'
Requires-Dist: langchain-openai>=0.2.0; extra == 'langchain'
Requires-Dist: langchain>=0.3.0; extra == 'langchain'
Provides-Extra: langgraph
Requires-Dist: langchain-anthropic>=0.3.0; extra == 'langgraph'
Requires-Dist: langgraph>=0.2.28; extra == 'langgraph'
Provides-Extra: memory
Requires-Dist: sentence-transformers>=3.3.0; extra == 'memory'
Requires-Dist: torch>=2.5.0; extra == 'memory'
Provides-Extra: openai
Requires-Dist: openai>=1.57.0; extra == 'openai'
Provides-Extra: postgres
Requires-Dist: psycopg2-binary>=2.9.9; extra == 'postgres'
Provides-Extra: redis
Requires-Dist: redis>=5.0.1; extra == 'redis'
Description-Content-Type: text/markdown

<div align="center">

<br/>

```
   ___                    _   _____
  / _ \                  | | |  ___|
 / /_\ \ __ _  ___ _ __ | |_| |_ ___  _ __ __ _  ___
 |  _  |/ _` |/ _ \ '_ \| __|  _/ _ \| '__/ _` |/ _ \
 | | | | (_| |  __/ | | | |_| || (_) | | | (_| |  __/
 \_| |_/\__, |\___|_| |_|\__\_| \___/|_|  \__, |\___|
         __/ |                              __/ |
        |___/                              |___/
```

**Self-Improving AI Agent Infrastructure**

*Forge agents worth trusting.*

<br/>

[![Python](https://img.shields.io/badge/Python-3.12+-3776AB?style=flat-square&logo=python&logoColor=white)](https://python.org)
[![FastAPI](https://img.shields.io/badge/FastAPI-0.115+-009688?style=flat-square&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com)
[![PostgreSQL](https://img.shields.io/badge/PostgreSQL-16-336791?style=flat-square&logo=postgresql&logoColor=white)](https://postgresql.org)
[![Redis](https://img.shields.io/badge/Redis-7-DC382D?style=flat-square&logo=redis&logoColor=white)](https://redis.io)
[![Celery](https://img.shields.io/badge/Celery-5.4-37814A?style=flat-square&logo=celery&logoColor=white)](https://docs.celeryq.dev)
[![Next.js](https://img.shields.io/badge/Next.js-15-000000?style=flat-square&logo=next.js&logoColor=white)](https://nextjs.org)
[![Docker](https://img.shields.io/badge/Docker-Compose-2496ED?style=flat-square&logo=docker&logoColor=white)](https://docker.com)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow?style=flat-square)](LICENSE)

<br/>

**Framework Agnostic** · **Git-Backed** · **Eval-Driven** · **Enterprise-Grade** · **Fully Observable**

<br/>

[**Quick Start**](#-quick-start) · [**How It Works**](#-how-it-works) · [**Supported Frameworks**](#-supported-frameworks) · [**Documentation**](#-documentation) · [**Dashboard**](#-trust-dashboard) · [**Roadmap**](#-roadmap)

<br/>

</div>

---

## What is AgentForge?

AgentForge is an open-source infrastructure system that makes AI agents continuously improve themselves — automatically, overnight, without human intervention between iterations.

It applies a **Karpathy-style research loop** (originally designed for neural network training) to any AI agent regardless of framework: LangChain, LangGraph, CrewAI, AutoGen, raw Anthropic/OpenAI SDK, or any HTTP/subprocess-based agent. The loop proposes one targeted change to the agent's specification per iteration, evaluates whether that change improved a measurable score, and commits or reverts via git.

You write the goals (eval suite). AgentForge runs the iterations.

```
Before AgentForge:          After 50 iterations:
Agent pass rate: 44%   →    Agent pass rate: 89%
Manual prompt edits: ∞ →    Git commits: 28  |  Reverts: 22  |  Cost: $14.80
```

### The Problem It Solves

AI agents deployed in production degrade in value over time. Failure modes accumulate. Prompt drift happens. User expectations shift. Traditional fixes require an engineer to manually review outputs, rewrite prompts, and redeploy — a slow, expensive, bottlenecked process.

AgentForge automates this cycle entirely. Each agent runs its own improvement loop, generating a full audit trail of every change, every score, and every decision. When something breaks, you `git revert`. When something works, it's committed to history.

---

## ✨ Key Features

- **🔄 Self-Improving Loop** — Automated propose → eval → score → commit/revert cycle that runs unattended
- **🔌 Framework Agnostic** — Works with LangChain, LangGraph, CrewAI, AutoGen, raw Anthropic/OpenAI, HTTP APIs, CLI subprocesses, or any custom agent
- **📊 Binary Eval Engine** — LLM-judge grader evaluates weighted binary assertions per output; deterministic score per iteration
- **🧠 Memory Architecture** — Episodic memory with pgvector (local embeddings via sentence-transformers, zero API cost), semantic knowledge base, cross-agent learning
- **🎯 Context Management** — Four-tier context stack with token budget enforcement and automatic compression
- **🏗️ Multi-Agent Orchestration** — Parallel improvement loops across an entire agent fleet; dependency-aware scheduling
- **📈 Trust Dashboard** — Per-agent real-time visibility: score timeline, assertion health matrix, git diff viewer, live loop feed, cost tracker
- **🔒 Enterprise-Ready** — Multi-tenancy, JWT auth, row-level security, audit log, Prometheus metrics, rate limiting
- **💰 Free Infrastructure** — PostgreSQL + pgvector, Redis, MinIO, Prometheus, Grafana — zero infrastructure cost outside LLM API calls
- **🔁 Git-Native** — Every improvement is a structured git commit. Every failure is a `git reset`. Full history always recoverable

---

## 🚀 Quick Start

### Prerequisites

- Python 3.12+
- Docker + Docker Compose
- An Anthropic API key (or OpenAI — any supported LLM provider)

### 1. Clone & Install

```bash
git clone https://github.com/yourusername/agentforge.git
cd agentforge

# Install core dependencies
pip install -e "."

# Install for your specific agent framework (pick what you need)
pip install -e ".[langchain]"     # LangChain agents
pip install -e ".[langgraph]"     # LangGraph agents
pip install -e ".[crewai]"        # CrewAI agents
pip install -e ".[autogen]"       # AutoGen agents
pip install -e ".[all-frameworks]" # Everything
```

### 2. Configure

```bash
cp .env.example .env
# Edit .env and set:
# ANTHROPIC_API_KEY=sk-ant-...
# SECRET_KEY=your-random-32-char-string
```

### 3. Start Infrastructure

```bash
# Start PostgreSQL, Redis, MinIO (all free, all local)
docker compose up postgres redis minio -d

# Apply database schema
docker compose exec postgres psql -U agentforge -d agentforge \
  -f /docker-entrypoint-initdb.d/init.sql
```

### 4. Baseline Your Agent

```bash
# Score the example agent before any improvements
python cli.py eval --agent agents/example-agent

# Output:
# Score: 44.0%
#   Case 1: 40% (2/5)  — cache explanation
#   Case 2: 60% (3/5)  — recursion explanation
#   Case 3: 40% (2/5)  — race condition
# Top failing: a1_1, a2_1, a3_1
```

### 5. Run the Loop

```bash
# Run 10 improvement iterations with a $3 budget ceiling
python cli.py run --agent agents/example-agent --max-iterations 10 --budget 3.0

# 🔥 AgentForge | agent-example-v1 | budget=$3.00
#   iter=1 mode=exploitation score=44.0% failing=6
#   📝 REPLACE in SYSTEM_PROMPT (HIGH confidence)
#   new_score=56.0% delta=+12.0% regressions=0
#   ✅ COMMIT a3f92b1
#   iter=2 mode=exploitation score=56.0% failing=4
#   ...
# 🏁 Done | score=78.0% | commits=6 | $2.84
```

### 6. Full Dashboard

```bash
docker compose up --build
open http://localhost
```

---

## 🔧 How It Works

### The Improvement Loop

```
┌─────────────────────────────────────────────────────────────────┐
│                         THE LOOP                                │
│                                                                 │
│  1. READ      Current AGENT.md (the mutable spec)              │
│  2. PROPOSE   LLM proposes exactly ONE targeted change          │
│  3. APPLY     Change applied to a working copy                 │
│  4. EVAL      All N test prompts × M binary assertions run      │
│  5. SCORE     weighted_passed / total_weight × 100             │
│  6. DECIDE    Improved AND no regressions?                      │
│               YES → git commit  |  NO → git reset              │
│  7. LOG       eval-log.md + database                            │
│  8. REPEAT                                                      │
│                                                                 │
│  STOPS when: perfect score | budget hit | plateau 5 iters      │
└─────────────────────────────────────────────────────────────────┘
```

### The Three Iron Rules

| Rule | Why |
|------|-----|
| **One change per iteration, always** | Two simultaneous changes break attribution — you can't know what caused an improvement |
| **Evals are human-written and proposer-blind** | The LLM proposer never sees eval prompts — only assertion pass/fail results. Humans set the ground truth |
| **Git is the safety net** | Every improvement is a commit. Every failure is a `git reset`. The full history is always recoverable |

### The AGENT.md Spec

The only file the loop modifies. Plain markdown, section-delimited, framework-neutral:

```markdown
---
agent_id: my-agent-v1
framework: langchain        ← determines which adapter loads
model: claude-sonnet-4-6
current_score: 78.4
loop_iterations: 31
---

## [SYSTEM_PROMPT]
<!-- MUTABLE: proposer may rewrite this section -->
You are a senior copywriter...
<!-- END_SYSTEM_PROMPT -->

## [FEW_SHOT_EXAMPLES]
<!-- MUTABLE: proposer may add, remove, or replace examples -->
### Example 1
Input: "..."
Output: "..."
<!-- END_FEW_SHOT_EXAMPLES -->

## [TOOL_ROUTING_LOGIC]   ← When to use which tool
## [RETRIEVAL_CONFIG]     ← k, threshold, chunk_strategy
## [MODEL_CONFIG]         ← model per task type
## [CONSTRAINTS]          ← hard rules
```

The loop can modify any section. Front matter (version, score, timestamps) is managed by the system automatically.

### The Eval Suite

```json
{
  "agent_id": "my-agent-v1",
  "evals": [
    {
      "id": 1,
      "prompt": "Write a LinkedIn post about AI dashboards",
      "assertions": [
        {"id":"a1_1","text":"First line is a standalone sentence","weight":2.0},
        {"id":"a1_2","text":"Contains at least one specific number","weight":1.5},
        {"id":"a1_3","text":"Word count is under 300","type":"numeric","weight":1.0},
        {"id":"a1_4","text":"Does not contain the word synergy","weight":2.0}
      ],
      "held_out": false
    }
  ]
}
```

Score = `sum(passed × weight) / sum(all weights) × 100`

The grader uses a judge LLM (Haiku by default — cheapest) to evaluate each assertion independently: *"Is this assertion TRUE or FALSE about this output?"*

---

## 🔌 Supported Frameworks

AgentForge never modifies your agent's code. It modifies the spec. The Adapter Layer translates the spec into whatever your framework needs.

| Framework | Status | What AGENT.md Controls |
|-----------|--------|----------------------|
| **Anthropic SDK** (raw) | ✅ Full support | System prompt, few-shot, tools, model |
| **OpenAI SDK** (raw) | ✅ Full support | System prompt, few-shot, functions, model |
| **LangChain** | ✅ Full support | ChatPromptTemplate, chain config, LLM selection |
| **LangGraph** | ✅ Full support | Node system prompts (via `build_graph(spec)` hook) |
| **CrewAI** | ✅ Full support | Agent role, goal, backstory, tool list |
| **AutoGen** | ✅ Full support | AssistantAgent system message, config list |
| **HTTP API** | ✅ Full support | Full spec passed as JSON — any language/framework |
| **CLI Subprocess** | ✅ Full support | Full spec passed via stdin — Node.js, Go, Rust, etc. |
| *Any other* | ✅ Add adapter | Implement `BaseAdapter.run()` — 30 lines |

### Adding a New Framework

```python
# core/adapters/my_framework.py
from core.adapters.base import BaseAdapter, AgentRunResult

class MyFrameworkAdapter(BaseAdapter):
    def run(self, prompt: str, context=None) -> AgentRunResult:
        # Translate self.spec into your framework's format
        # Run the agent
        # Return AgentRunResult(output=..., cost_usd=..., ...)
        ...

    def health_check(self) -> bool: ...
    def get_framework_name(self) -> str: return "my-framework"
```

Then add one line to `ADAPTER_REGISTRY` in `core/agent_runner.py`. The entire loop, eval engine, dashboard, and git manager work without any other changes.

---

## 🗂️ Project Structure

```
agentforge/
│
├── core/                    # Framework-agnostic engine
│   ├── adapters/            # One file per supported framework
│   ├── spec/                # AGENT.md parser + validator
│   └── context/             # Token budget enforcement + compression
│
├── eval/                    # Evaluation engine
│   ├── grader.py            # LLM-judge binary assertion grader
│   └── runner.py            # Parallel eval suite execution
│
├── loop/                    # The improvement loop
│   ├── proposer.py          # LLM change proposer
│   ├── runner.py            # Loop orchestration
│   └── git_manager.py       # Commit / revert / diff / history
│
├── memory/                  # Agent memory system
│   ├── embedder.py          # Local sentence-transformers (free, no API)
│   └── store.py             # pgvector episodic + semantic memory
│
├── orchestrator/            # Multi-agent coordination
│   ├── meta.py              # MetaOrchestrator (fleet management)
│   └── scheduler.py         # Celery task queue
│
├── dashboard/
│   ├── api/                 # FastAPI backend (SSE, JWT auth, rate limiting)
│   └── frontend/            # Next.js 15 trust dashboard
│
├── agents/                  # Your agent directories
│   └── example-agent/
│       ├── AGENT.md         # ← The only file the loop modifies
│       ├── evals/
│       │   ├── evals.json   # Human-written eval suite
│       │   └── eval-log.md  # Running iteration log
│       └── references/      # Static context (brand docs, guidelines)
│
├── infra/                   # Docker Compose, Nginx, Prometheus, Grafana
├── cli.py                   # Command-line interface
├── AGENT_VERSE.md           # Multi-agent dependency map
└── pyproject.toml
```

---

## 🧰 Tech Stack

All infrastructure is **free and open-source**. The only costs are LLM API calls.

### Backend

| Component | Technology | Notes |
|-----------|-----------|-------|
| API Framework | **FastAPI 0.115** | Async, auto-docs, SSE for real-time dashboard |
| Task Queue | **Celery 5 + Redis** | Parallel improvement loops across agent fleet |
| Primary Database | **PostgreSQL 16** | Relational data, multi-tenancy via row-level security |
| Vector Store | **pgvector** extension | Embedded in PostgreSQL — no separate vector DB |
| Embeddings | **sentence-transformers** `all-MiniLM-L6-v2` | Runs locally, 384-dim, ~90MB, **zero API cost** |
| Object Storage | **MinIO** | Self-hosted S3-compatible — eval files, logs, exports |
| Version Control | **GitPython** | Every spec change is a structured git commit |
| Token Counting | **tiktoken** | Context budget enforcement |

### Observability (all free)

| Component | Technology | Access |
|-----------|-----------|--------|
| Metrics | **Prometheus** | `localhost:9090` |
| Dashboards | **Grafana** | `localhost:3001` |
| Logs | **Loki** | Via Grafana |
| API metrics | **prometheus-fastapi-instrumentator** | `/metrics` endpoint |

### Frontend

| Component | Technology |
|-----------|-----------|
| Framework | **Next.js 15** (App Router) |
| Charts | **Recharts** |
| Real-time | **Server-Sent Events** (no WebSocket overhead) |
| Diff viewer | **react-diff-viewer-continued** |
| Styling | **Tailwind CSS** |

### Infrastructure

| Service | Image | Purpose |
|---------|-------|---------|
| Database | `ankane/pgvector:latest` | PostgreSQL 16 + pgvector |
| Cache/Broker | `redis:7-alpine` | Celery + SSE pub/sub |
| Object storage | `minio/minio` | Eval files, exports |
| Metrics | `prom/prometheus:v2.55.0` | Scraping |
| Dashboards | `grafana/grafana:11.3.0` | Visualisation |
| Log store | `grafana/loki:3.2.0` | Log aggregation |
| Proxy | `nginx:1.27-alpine` | TLS, routing |

---

## 📈 Trust Dashboard

Every agent gets a dedicated dashboard showing exactly what happened, why, and what the result was.

```
┌─────────────────────────────────────────────────────────────────┐
│  agent-copywriting-v1  ● active              Score: 84.1%      │
│  Marketing Copywriter · AcmeCorp             ↑ +12.6% session  │
├─────────────────────────────────────────────────────────────────┤
│  [Score Timeline] [Assertion Health] [Decision Log]            │
│  [Live Loop]      [Last Diff]        [Cost & Budget]           │
├─────────────────────────────────────────────────────────────────┤
│  Score over 48 iterations                                       │
│                                                          84.1% ▓│
│                                               78.4%  ___/      │
│                               68.2%  _______/                  │
│               56.0%  ________/                                  │
│  44.0% ______/                                                  │
│  ──────────────────────────────────────────────────────────     │
│   iter 1      iter 12      iter 24      iter 36      iter 48   │
└─────────────────────────────────────────────────────────────────┘
```

**Dashboard Panels:**

- **Score Timeline** — Pass rate over all iterations, colour-coded: green = commit, red dot = revert, amber band = exploration mode
- **Assertion Health Matrix** — Grid of all assertions, coloured by pass rate over last 10 iterations. Dark green = always passing. Dark red = chronically failing
- **Decision Log** — Every change with git diff viewer. Click any iteration to see exactly what changed and the proposer's diagnosis
- **Live Loop Feed** — Real-time step-by-step view of the running loop via Server-Sent Events
- **Cost & Budget** — Spend per iteration, cost per 1% improvement, budget remaining bar
- **Public Share URL** — Make an agent's dashboard public for client transparency (`/public/agent/{agent_id}`)

---

## 🧠 Memory Architecture

Agents remember what decisions they made and what outcomes those decisions produced.

```
Layer 1 — Working Memory     In-context. Ephemeral. Cleared at session end.
Layer 2 — Episodic Memory    pgvector. Every completed task: input + decisions + outcome.
Layer 3 — Semantic Memory    pgvector. Accumulated knowledge from references/ directory.
Layer 4 — Procedural Memory  The AGENT.md itself. The loop writes here.
Layer 5 — Shared Memory      Cross-agent patterns surfaced by the orchestrator.
```

Memory retrieval is a standard tool call:

```python
# Agents call this during production use
memory_recall(
    query="Write LinkedIn post about AI dashboards",
    agent_id="agent-copywriting-v1",
    k=3
)
# Returns: "Episode 3 days ago: Similar task. Led with a stat. User rated 5/5.
#           Key learning: counter-intuitive claim in line 1 gets highest engagement."
```

Embeddings use `sentence-transformers/all-MiniLM-L6-v2` running locally — **no embedding API calls, no cost**.

---

## 🌐 Multi-Agent Coordination (Agent-Verse)

When you have multiple agents, they form an **Agent-Verse** — a network where agents can call each other, share memory, and improve together.

### AGENT_VERSE.md

```markdown
# AGENT_VERSE.md

| agent_id | framework | score | status |
|----------|-----------|-------|--------|
| agent-researcher | langgraph | 91.2 | active |
| agent-copywriter | anthropic-sdk | 84.1 | active |
| agent-reviewer | langchain | 88.6 | active |

## Dependency Graph
agent-copywriter --> agent-researcher   # calls researcher for facts
agent-copywriter --> agent-reviewer     # calls reviewer before returning
```

### Cross-Agent Learning

When an agent improves its score by >5%, the MetaOrchestrator extracts the change and surfaces it to compatible agents as a "hint" — they can adopt the same improvement through their own loop iteration, rather than rediscovering it from scratch.

### Parallel Improvement

All agents improve simultaneously via Celery workers:

```bash
# Start 5 parallel improvement loops across your entire fleet
python cli.py orchestrate --agents-root agents --budget-per-agent 15.0
```

---

## 🏢 Enterprise Features

| Feature | Details |
|---------|---------|
| **Multi-tenancy** | PostgreSQL row-level security per organisation. Isolated git namespaces, Redis keyspaces, MinIO prefixes |
| **JWT Auth** | Short-lived access tokens + refresh tokens. Org-scoped claims |
| **Role-Based Access** | `viewer` → `editor` → `admin` → `org_admin`. Read-only public dashboard URLs |
| **Audit Log** | Append-only table. Every API write logged with user, org, timestamp, payload |
| **Rate Limiting** | 100 req/min per user, 500/min per org. Configurable |
| **Cost Circuit Breaker** | Auto-pause all loops if hourly spend exceeds 3× configured budget |
| **Prometheus Metrics** | Loop scores, iteration rates, eval costs, API latency — all exposed at `/metrics` |
| **SOC 2 Ready** | Append-only audit trail, full data lineage, configurable data retention |

---

## 💰 Cost Reference

Infrastructure is entirely free. The only variable cost is LLM API calls.

| Component | Model | Cost / Iteration |
|-----------|-------|-----------------|
| Agent execution (5 prompts) | claude-sonnet-4-6 | ~$0.225 |
| Grader (25 assertions × Haiku) | claude-haiku-4-5-20251001 | ~$0.063 |
| Proposer (1 call × Sonnet) | claude-sonnet-4-6 | ~$0.017 |
| **Total (Sonnet default)** | | **~$0.31 / iter** |
| **Total (Haiku for everything)** | claude-haiku-4-5-20251001 | **~$0.03 / iter** |
| 50-iteration session (Sonnet) | | **~$15.50** |
| 50-iteration session (Haiku) | | **~$1.50** |

**Tip**: Start with Haiku for cheap baseline iteration, switch to Sonnet once you need subtler reasoning.

---

## 📖 Documentation

| Document | Description |
|----------|-------------|
| [`AGENTFORGE_MASTER.md`](docs/AGENTFORGE_MASTER.md) | Complete build document — everything an AI agent needs to build this system end-to-end |
| [`AGENTFORGE_ARCHITECTURE.md`](docs/AGENTFORGE_ARCHITECTURE.md) | Deep-dive architecture with all layer designs and database schema |
| [`AGENTFORGE_7DAY_ROADMAP.md`](docs/AGENTFORGE_7DAY_ROADMAP.md) | Day-by-day implementation guide with full source code |
| [`AGENTFORGE_BUILDER_AGENT.md`](docs/AGENTFORGE_BUILDER_AGENT.md) | AGENT.md for using Claude Code / Antigravity to build this system |

---

## 🖥️ CLI Reference

```bash
# Run improvement loop for one agent
python cli.py run \
  --agent agents/my-agent \
  --max-iterations 50 \
  --budget 15.0 \
  --plateau 5

# Score current spec without running the loop
python cli.py eval --agent agents/my-agent

# Launch parallel loops for the entire fleet
python cli.py orchestrate \
  --agents-root agents \
  --budget-per-agent 15.0

# Quick health check — run one prompt through the agent
python cli.py test \
  --agent agents/my-agent \
  --text "Write a LinkedIn post about AI"
```

---

## 📁 Creating Your First Agent

### Step 1 — Create the directory

```bash
mkdir -p agents/my-agent/evals agents/my-agent/references
```

### Step 2 — Write AGENT.md

```markdown
---
agent_id: my-agent-v1
agent_name: My Agent
framework: anthropic-sdk        # or: langchain | langgraph | crewai | autogen | http
model: claude-sonnet-4-6
description: What this agent does in one sentence
current_score: 0
loop_iterations: 0
created: 2026-03-15
last_improved: 2026-03-15
---

## [SYSTEM_PROMPT]
<!-- MUTABLE: proposer may rewrite this section -->
You are a [role]. [Specific output format rules. Measurable quality standards.]
<!-- END_SYSTEM_PROMPT -->

## [FEW_SHOT_EXAMPLES]
<!-- MUTABLE: proposer may add, remove, or replace examples -->

### Example 1
Input: "[realistic user input]"
Output: "[ideal output]"

<!-- END_FEW_SHOT_EXAMPLES -->

## [TOOL_ROUTING_LOGIC]
<!-- MUTABLE: proposer may edit routing rules -->
- [When to use which tool]
<!-- END_TOOL_ROUTING_LOGIC -->

## [RETRIEVAL_CONFIG]
<!-- MUTABLE: proposer may tune parameters -->
retrieval_k: 3
similarity_threshold: 0.72
<!-- END_RETRIEVAL_CONFIG -->

## [MODEL_CONFIG]
<!-- MUTABLE: proposer may swap models per task -->
default_model: claude-sonnet-4-6
<!-- END_MODEL_CONFIG -->

## [CONSTRAINTS]
<!-- MUTABLE: proposer may add or remove constraints -->
- [Hard rules the agent must always follow]
<!-- END_CONSTRAINTS -->

## [FRAMEWORK_CONFIG]
<!-- END_FRAMEWORK_CONFIG -->
```

### Step 3 — Write evals.json

```json
{
  "agent_id": "my-agent-v1",
  "eval_version": "1.0",
  "created_by": "human",
  "grader_model": "claude-haiku-4-5-20251001",
  "evals": [
    {
      "id": 1,
      "category": "core_task",
      "difficulty": "core",
      "prompt": "[realistic user input]",
      "expected_output": "[what a good output looks like]",
      "held_out": false,
      "assertions": [
        {"id":"a1_1","text":"[specific TRUE/FALSE statement about the output]","weight":1.5},
        {"id":"a1_2","text":"[another specific statement]","weight":1.0}
      ]
    }
  ]
}
```

**Assertion writing rules:**
- ✅ `"First sentence does not begin with 'I'"` — specific, binary
- ✅ `"Output contains at least one specific number"` — specific, binary
- ✅ `"Word count is under 300"` — numeric type, auto-checked
- ❌ `"The output is helpful"` — vague, not binary
- ❌ `"Good use of language"` — not measurable

### Step 4 — Baseline and run

```bash
python cli.py eval --agent agents/my-agent     # See starting score
python cli.py run --agent agents/my-agent --max-iterations 20 --budget 6.0
```

---

## 🗺️ Roadmap

The system is built in 7 phases:

- [x] **Phase 1** — Framework adapter layer (all 8 adapters + AgentRunner)
- [x] **Phase 2** — Spec system + git version control (parser, validator, GitManager)
- [x] **Phase 3** — Eval engine (grader, runner, weighted scoring, regression detection)
- [x] **Phase 4** — Improvement loop (proposer, LoopRunner, CLI, eval-log.md)
- [x] **Phase 5** — Memory layer (pgvector store, local embedder, ContextGuard)
- [x] **Phase 6** — Meta-orchestrator (Celery, parallel loops, cross-agent learning)
- [x] **Phase 7** — Trust dashboard (FastAPI, SSE, Next.js, enterprise hardening)

**Planned:**

- [ ] Auto-generated eval candidates (propose new assertions from production failure patterns)
- [ ] Multi-objective scoring (quality × latency × cost Pareto front)
- [ ] Agent replication (fork high-scoring agent with broader eval suite)
- [ ] Model-agnostic benchmarking (same loop, compare LLM providers)
- [ ] VS Code extension (dashboard inline in editor)
- [ ] Eval marketplace (share eval suites for common agent types)

---

## 🤝 Contributing

Contributions are welcome. The most valuable contributions right now:

1. **New framework adapters** — Implement `BaseAdapter` for a framework not yet supported
2. **New eval assertion types** — Beyond binary and numeric
3. **Dashboard improvements** — More visualisation panels
4. **Adapter test coverage** — Integration tests for each adapter against a mock LLM

### Dev Setup

```bash
git clone https://github.com/yourusername/agentforge.git
cd agentforge
pip install -e ".[dev,all-frameworks]"
docker compose -f infra/docker-compose.dev.yml up -d
pytest tests/
```

### Contribution Guidelines

- Follow the project structure exactly — new code goes in the right layer
- Adapters must implement the full `BaseAdapter` interface — no partial implementations
- Every eval assertion must be independently binary-gradeable
- `one change per iteration` is non-negotiable — don't propose multi-change improvements

---

## 📜 License

MIT License — see [LICENSE](LICENSE) for details.

---

## 🙏 Acknowledgements

AgentForge is inspired by:

- **[Karpathy/autoresearch](https://github.com/karpathy/autoresearch)** — The original apply-a-research-loop-to-training-code idea that proved autonomous iteration works overnight
- **Claude Code Skills System** — The SKILL.md pattern: a mutable markdown spec that a loop can read, propose changes to, and version with git
- The broader **evals-as-first-class-citizens** movement in AI engineering

---

<div align="center">

<br/>

**AgentForge** — *Forge agents worth trusting.*

<br/>

Built with open-source infrastructure. Powered by human-written evals. Committed to git.

<br/>

[⬆ Back to Top](#agentforge--self-improving-ai-agent-infrastructure)

</div>