| A1 |
Deep Research demo
Multi-agent workflow exercising the full stack. Best learning tool and marketing asset. Build in
demos/deep_research/. Shows off memory, tool calling, crew orchestration, and observability in one end-to-end example. |
S | |
| A2 |
Token streaming
Every competitor has it. Add
async stream() path through ModelClient → AgentRunner → CLI output. Biggest visible gap vs LangGraph/CrewAI. Affects perceived responsiveness more than any other single change. |
M | |
| A3 |
Arize Phoenix integration
src/nexus/observability/phoenix.py. Configure NexusTracer OTLP endpoint to point at Phoenix. Add docker-compose.phoenix.yml. Optional dep nexus-ai[phoenix]. Gives a polished visual for demos and screenshots. |
S | |
| A4 |
Railway one-click deploy
One-click deploy button for quick demos.
⚠ Deferred — Railway doesn't support Dapr sidecars. Revisit after Helm chart ships; target "Deploy to Azure Container Apps" (
azd up) instead. |
— | |
| A5 |
Better error messages
Add
hint: str field to every NexusError subclass with a human-readable suggested fix. Huge DX improvement — first-time users hit errors and immediately know what to do without reading docs. |
S |
| B1 |
Pre-built tool library
src/nexus/tools/library/: web_search, http_request, file_read, file_write, sql_query, send_email, calculator. Batteries-included like CrewAI. Removes the "write your own tools" barrier for new users. |
M | |
| B2 |
LangChain tool adapter
src/nexus/tools/adapters/langchain.py. Wrap any LangChain tool as a Nexus RegisteredTool. Unlocks 1,000+ community tools instantly with zero rewriting. Duck-typed wrapping with automatic schema extraction. |
S | |
| B3 |
REST API server mode
nexus serve agent.py. FastAPI app exposing /run, /stream, /memory, /health. Any language can call Nexus agents over HTTP. Critical for polyglot teams and microservice architectures. |
M | |
| B4 |
OpenAI-compatible API endpoint
/v1/chat/completions that routes through AgentRunner. Any OpenAI SDK client works with zero code changes. Massive adoption driver — existing OpenAI integrations just change the base URL. |
M | |
| B5 |
Jupyter notebook integration
nexus.jupyter module. await runner.run(...) works in notebooks. Rich HTML display for ExecutionResult. Data scientists are key users — meeting them in their environment lowers adoption friction significantly. |
S |
| C1 |
Human-in-the-loop UI
Web interface for agents paused at status=WAITING_FOR_HUMAN. FastAPI SSE + vanilla HTML/JS. The pause/resume logic is already in AgentRunner — this just surfaces it as a UI so non-engineers can respond to agent questions without touching the CLI.
|
M | |
| C2 |
Time-travel debugging
nexus replay <session-id> — step through any past agent run from the EventLog. See exactly what the agent saw at each step: which messages were in context, what tools returned, which memory was retrieved. Unique vs all competitors. EventLog already built in Phase 9. |
M | |
| C3 |
Execution trace viewer
Live web UI showing active graph node, memory reads/writes, tool calls, and accumulating cost as an agent runs. Connects to the existing OTEL + EventLog infrastructure. Makes agent execution feel legible rather than a black box — high demo value.
|
L | |
| C4 |
Webhook trigger
POST /agents/{id}/run to trigger agents from external events — Zapier, GitHub webhooks, Slack slash commands, etc. Builds on the B3 REST server. Unlocks no-code automation workflows without any additional infrastructure. |
M | |
| C5 |
Scheduled agents
nexus schedule agent.py --cron "0 9 * * *" via Dapr Jobs API. Enables daily report agents, monitoring agents, data pipeline agents, and periodic cleanup tasks without external schedulers. Cron syntax, Dapr handles reliability. |
M | |
| C6 |
Visual agent builder
ReactFlow drag-and-drop graph builder that generates
nexus_graph.yaml. Biggest single adoption unlock — opens the framework to non-engineers and makes agent design visual rather than code-first. Needs 4–6 sub-sessions to build properly. |
XL |
| D1 |
Google Gemini client
src/nexus/core/models/gemini.py. Third most-used LLM provider after Anthropic and OpenAI. Implement the ModelClient ABC using the Google Generative AI SDK. Widens the addressable user base without touching any other code. |
S | |
| D2 |
Ollama client
src/nexus/core/models/ollama.py. Local model support via Ollama's OpenAI-compatible REST API. Zero API cost, full privacy, works air-gapped. Privacy-conscious enterprises and self-hosters are a distinct market segment. |
S | |
| D3 |
Agent handoffs
OpenAI Swarm-style mid-conversation hand-off: one agent explicitly passes control to another, transferring context and memory. Extends the existing Crew patterns. Useful for support escalation workflows (triage → specialist → closer).
|
S | |
| D4 |
Streaming eval assertions
Eval assertions that work on streamed output: first-token latency, time-to-completion, streaming quality (does the answer degrade when streamed vs buffered). Covers the A2 streaming path — without this, streaming is untested by the eval framework.
|
M | |
| D5 |
Grafana dashboard template
Pre-built JSON dashboard for Nexus Prometheus metrics — one-click import into any Grafana instance. Shows token usage, cost, latency, error rate, active agents. Teams with existing Grafana get value immediately without building dashboards from scratch.
|
S | |
| D6 |
Agent state snapshots
Export and import full agent state as JSON — all memory, message history, metadata. Enables forensic debugging of production incidents, migration between environments, and sharing exact agent states for reproducible bug reports.
|
S | |
| D7 |
Cost alerts
Webhook, email, or Slack notification when agent spend crosses a configured threshold. CostTracker and Dapr pub/sub already exist — this is mostly wiring a notification sink to the existing budget events. Prevents surprise bills in production.
|
S | |
| D8 |
Prompt playground CLI
nexus playground — interactive CLI to test prompts before committing to an agent definition. Shows tokens, cost, and latency for each test run. Bridges the gap between experimentation and production; keeps iteration fast without running a full agent loop. |
S | |
| D9 |
Memory inspector UI
Visual browser for an agent's episodic and semantic memory. Edit facts, adjust trust scores, clear poisoned memories, inspect provenance chains. Lets operators understand and correct what an agent has learned without writing code or SQL queries.
|
M | |
| D10 |
Eval dashboard
Web UI for eval suite history — pass rate trends over time, regression alerts when a prompt version drops, side-by-side comparison of prompt versions. Makes quality regression visible at a glance rather than buried in CI logs.
|
M | |
| D11 |
Vector DB adapters
Pinecone, Weaviate, and Qdrant as alternative embedding stores to pgvector. Implement a
VectorStore ABC and drop-in adapters. Enterprises with existing vector DB investments can use Nexus memory without running PostgreSQL. |
M | |
| D12 |
A2A protocol
Agent-to-Agent discovery: Nexus agents callable from LangGraph, CrewAI, and AutoGen agents and vice versa. Standards-compliant implementation per the A2A spec (in ADR-010, not yet built). Removes ecosystem lock-in and enables multi-framework agent pipelines.
|
L | |
| D13 |
Agent template gallery
nexus hub pull research-agent — community-shared agent definitions with one command. Needs a registry backend (GitHub-based initially). Drives adoption the same way CrewAI's marketplace does — reduces time from "install" to "running useful agent" to minutes. |
L | |
| D14 |
Multi-tenancy
Isolate state, memory, and billing per tenant using namespace scoping in the Dapr state layer. Required before any SaaS offering. Each tenant's agents cannot read each other's memory or state. Enables both self-hosted multi-team deployments and hosted tiers.
|
M | |
| D15 |
Nexus Cloud SaaS
Hosted platform with Stripe usage-based billing. CostTracker and EventLog already built = billing infrastructure ~80% done. Add tenant management, a hosted Dapr cluster, and a web dashboard. Converts open-source traction into sustainable revenue.
|
XL |
| E1 |
Agent debate & ensemble reasoning
Multiple agents with different temperatures or system prompts independently answer the same question, then converge on a final answer via voting or synthesis. Research shows 10–40% quality improvement on hard reasoning tasks. Unique vs all major competitors.
|
M | |
| E2 |
Uncertainty quantification
Agents maintain explicit confidence scores over beliefs and actions. High uncertainty → pause and ask human. Medium uncertainty → proceed but log prominently. Low uncertainty → execute autonomously. Makes the human_node intelligent rather than rule-based. No competitor has this.
|
M | |
| E3 |
Long-horizon planning with re-planning
LATS / Tree-of-Thoughts planning layer above ReAct. Agent plans upfront, executes, detects plan failure, and replans. Unlocks genuinely multi-step tasks (multi-hour, multi-day) that current ReAct loop cannot handle due to context drift and local optima.
|
L | |
| E4 |
Market-based task allocation
Extends basic agent handoffs (D3). Supervisor posts a task with a budget; agents bid based on their capability assessment; cheapest capable agent wins. Novel mechanism for dynamic multi-agent systems where agents self-select work rather than being assigned.
|
M | |
| E5 |
Artifact-centric collaboration
Agents collaborate around shared structured artifacts — documents, schemas, codebases — rather than passing text strings. Each agent reads and modifies the same versioned artifact. More natural for knowledge work tasks; produces a concrete output rather than a conversation transcript.
|
M |
| F1 |
Agent self-improvement via Reflexion
Agents reflect on task failures and update their own system prompts and procedural memory. DSPy-style prompt optimization guided by eval feedback. Agents measurably improve on their specific domain over time. Long-term differentiator that no competitor has shipped. Extends ProceduralMemory.
|
L | |
| F2 |
Persistent user modeling
Per-user profile stored separately from episodic memory: expertise level, communication preferences, past decisions, domain context. Agent adapts its language, depth, and suggestions to each individual. The difference between a generic tool and a personal assistant that knows you.
|
M | |
| F3 |
Hierarchical memory architecture
Fourth memory layer: schematic memory that abstracts patterns from millions of episodes. Retrieval hierarchy: schema → semantic → episodic. Prevents quality degradation at scale by keeping frequently-needed patterns at the top of the hierarchy. Needed for enterprise-scale deployments.
|
L | |
| F4 |
Causal reasoning layer
Agents reason about why something happened and what would have happened with a different action. Pearl's do-calculus applied to agent execution traces. Foundation for genuine learning from mistakes rather than pattern matching. Research-grade — target 2026+.
|
XL |
| G1 |
Adversarial red-teaming framework
src/nexus/evaluation/red_team.py. A dedicated attacker agent systematically finds inputs that break the main agent — goal hijacking, memory poisoning attempts, tool abuse. Automated adversarial prompt generation. Extends existing eval + safety infrastructure. Highest priority of Phase G; no competitor has shipped this. |
M | |
| G2 |
Multi-modal agent actions
Vision model integration plus computer use: agents that can see a screen, click UI elements, and type. Enables agents to operate any software without an API — legacy systems, web UIs, desktop apps. Massive addressable market but high implementation effort. Target 2026+.
|
XL | |
| G3 |
Formal policy verification
TLA+/model checking to prove agent invariants as formal guarantees: "this agent will never call delete_database without human approval." Required for regulated industries (finance, healthcare, legal) that need verifiable compliance. Research-grade — target 2027+.
|
XL |
| H1 |
Document processing tools
src/nexus/tools/library/: PDF, Word, and Excel parsing tools added to the existing B1 tool library. High-demand for enterprise document workflows — contract analysis, report generation, data extraction. Most enterprise AI use cases involve documents. |
M | |
| H2 |
Code analysis tools
AST parsing, linting, and static analysis tools for code agents. Enables software engineering workflows: code review agents, refactoring agents, bug-finding agents. Python and TypeScript support first. Complements the existing sandbox code execution.
|
M | |
| H3 |
Cohere client
src/nexus/core/models/cohere.py. Popular in enterprise RAG deployments, especially in regulated industries. Implement the ModelClient ABC using Cohere's SDK. Also feeds into H4 (embedding service improvements) since Cohere has strong embedding models. |
S | |
| H4 |
Embedding service improvements
Support multiple embedding providers: OpenAI, Cohere, and local models via Ollama. Make the embedding provider configurable per memory type (e.g., fast local embeddings for working memory, high-quality OpenAI embeddings for semantic memory). Currently hard-coded to one provider.
|
M | |
| H5 |
TypeScript / JS SDK
@nexus-ai/client npm package — a thin TypeScript client wrapping the B3/B4 REST API. Frontend integrations, Next.js apps, and Node.js agents can call Nexus without Python. Opens the framework to the much larger JS/TS developer community. |
L | |
| H6 |
Plugin system
Community extensions via lifecycle hooks — pre-LLM, post-tool, pre-memory-write, etc. — without requiring PRs to core. Enables domain-specific extensions (HIPAA compliance plugin, SOC2 audit plugin) to be distributed as separate packages. Reduces core complexity while growing ecosystem.
|
M | |
| H7 |
Agent versioning
Version and rollback agent definitions (system prompt + tools + config). A/B test prompt versions in production with automatic quality scoring. Track which version is running where. Required for production teams that iterate on prompts while maintaining stability for existing users.
|
M | |
| H8 |
RAG pipeline template
Complete retrieval-augmented generation as a first-class template in
demos/rag/. Document ingestion, chunking, embedding, retrieval, and answer generation in one working example using the full pgvector stack. RAG is the highest-demand agent use case; a complete template removes a major adoption barrier. |
M | |
| H9 |
Real-LLM integration tests
Optional CI job that tests against actual Claude and GPT-4 APIs, gated by
RUN_REAL_LLM_TESTS=true environment variable. Catches regressions that mocked tests miss — streaming edge cases, tool call format changes, rate limit handling. Off by default so CI stays fast and free. |
S |
| I1 |
nexus doctor CLI command
Check: API keys set, Dapr sidecar reachable, PostgreSQL connectable, pgvector extension present, Redis reachable, optional extras installed. Prints ✓/✗ checklist.
--json flag for bug report attachment. Single highest-impact DX addition — setup failure is the #1 new-user dropout cause. |
XS | |
| I2 |
Secrets scanning in CI
gitleaks or trufflehog GitHub Action on every push and PR. Prevents accidental API key commits. 10-minute setup, zero ongoing maintenance. One of the easiest security wins available. |
XS | |
| I3 |
pip-audit in CI
Catches known CVEs in pinned transitive dependencies —
httpx, pydantic, dapr, and asyncpg have all had security advisories. Should block merges on high-severity findings. Trivial to add as a CI step. |
XS | |
| I4 |
Health check endpoints
/healthz (liveness — always 200) and /readyz (readiness — 503 if Dapr/DB unreachable). Hard requirement for Kubernetes liveness and readiness probes. Without these, Kubernetes cannot safely roll deployments. |
XS | |
| I5 |
GitHub issue + PR templates
Three issue templates: Bug Report (steps to reproduce, expected vs actual, versions), Feature Request (problem statement, proposed solution), Question (redirect to Discussions). PR template checklist: description, tests added, docs updated, ruff passes, mypy passes, CHANGELOG entry.
|
XS | |
| I6 |
CONTRIBUTING.md
Dev setup, how to run tests, how to run Dapr locally, PR process, conventional commit format, how to add a new tool, how to add a new model client. Referenced from the H52 spec but may not yet exist. First stop for any would-be contributor.
|
XS | |
| I7 |
CSP headers on web UI
Content-Security-Policy, X-Frame-Options, X-Content-Type-Options middleware on FastAPI. Without CSP, the memory inspector UI is XSS-vulnerable if stored memory content contains script tags — a realistic attack vector for a memory poisoning scenario. |
XS | |
| I8 |
SECURITY.md
How to report vulnerabilities (email, not public issue), response SLA, disclosure policy. GitHub shows a security advisory banner without this file. Required for enterprise security reviews and any organization with a formal procurement process.
|
XS | |
| I9 |
Error catalog in docs
Every
NexusError code listed with: what it means, what caused it, how to fix it. The hint field (Phase A5) already exists — surface those hints as a searchable docs page. When users Google a Nexus error code, this should be the first result. |
S | |
| I10 |
PyPI trusted publisher migration
Migrate release workflow from API token to OIDC (GitHub Actions). Add
actions/attest-build-provenance for artifact attestation. Tokens can be stolen; OIDC is phishing-resistant. Required for supply chain security compliance. |
S | |
| I11 |
Graceful shutdown
nexus serve must handle SIGTERM: stop accepting requests, wait for in-flight agent runs up to a configurable timeout, flush the event log, then exit. Without this, agent runs are killed mid-execution on every Kubernetes rolling deployment. |
S | |
| I12 |
Rate limiting on REST API
slowapi adds per-IP or per-API-key rate limiting to FastAPI in 20 lines. Without it, a misconfigured client looping on retries can exhaust an entire LLM API quota in minutes — a common on-call incident for teams new to production agents. |
S | |
| I13 |
SBOM generation
CycloneDX or SPDX format via
syft. Required for US federal procurement (Executive Order 14028) and enterprise security reviews. No agentic AI framework ships an SBOM — instant differentiator for regulated industries. |
S | |
| I14 |
Request ID / correlation ID propagation
When FastAPI receives
X-Request-ID or traceparent headers from an upstream service, inject into the structlog context for the agent run. Without this, distributed traces are broken at the Nexus boundary and production debugging becomes painful. |
S | |
| I15 |
Retry + run timeout configuration
Global retry config for LLM calls (
max_retries, backoff_factor, retry_on status codes) under NexusConfig.model.retry — currently scattered per client. Global wall-clock agent run timeout via asyncio.wait_for to prevent 6-hour runaway agents. |
S | |
| I16 |
Cross-Python-version CI matrix
CI should test Python 3.12 and 3.13. Many enterprise teams are still on 3.11 — worth confirming compatibility before they hit it in production. Catches type annotation and stdlib changes that break unexpectedly.
|
S | |
| I17 |
Performance benchmark suite
pytest-benchmark for hot paths: embedding service, retriever, agent runner loop overhead vs raw API call, RAG hybrid search. Use github-action-benchmark to track regressions over time. Without benchmarks, a slow plugin hook or retrieval change is invisible until someone complains in production. |
M | |
| I18 |
Tool result caching
Idempotent tool calls (same name + same arguments) cached within a session or globally.
cache_ttl_seconds on ToolParameter. Expensive tools like web scraping and SQL queries should not re-run on identical inputs in the same session. |
S | |
| I19 |
nexus export / nexus import
Export agent memory + event log + config as a portable
.nexus bundle. Import on another machine. Needed for: environment migration, sharing reproducible runs for debugging, and backup. The EventLog and Dapr state are already structured for this. |
M | |
| I20 |
WebSocket endpoint
REST API has SSE streaming but not WebSocket. Real-time dashboards, mobile apps, and many frontend frameworks prefer WebSocket over SSE.
fastapi-websockets, zero new required runtime deps. Complements existing SSE without replacing it. |
M | |
| I21 |
Mutation testing baseline
mutmut or cosmic-ray. Standard tests can pass even when a critical condition is inverted — mutation testing finds these gaps. No agent framework ships this. Real differentiator for quality-conscious enterprise adopters. |
M | |
| I22 |
Property-based testing expansion
Hypothesis is already in dev deps but only used for working memory. Expand to RRF fusion math, trust scoring formulas, cost calculations, and token counting — all numerical logic that breaks in edge cases property tests find reliably.
|
S | |
| I23 |
Changelog automation (git-cliff)
git-cliff generates CHANGELOG.md automatically from conventional commits, which this project already uses. Without automation, CHANGELOG drifts or never gets updated. Enterprise teams evaluate changelogs before deciding to upgrade. |
XS | |
| I24 |
Dependabot / Renovate
Automated dependency update PRs. One config file at
.github/dependabot.yml. Without this, deps drift silently — CVEs accumulate and teams running old versions hit bugs that were already fixed months ago. |
XS | |
| I25 |
Architecture diagram (C4 / Mermaid)
C4 context + container diagram in the docs site. PLAN.md has ASCII art but no shareable diagram. The most-shared asset during technical evaluations — teams screenshot this for architecture review boards. Mermaid renders in GitHub and MkDocs with no extra tooling.
|
S |