Zaxy Operational Runbook

Architecture Overview

Zaxy is an event-sourced temporal knowledge graph fabric for AI agent memory. It consists of three layers:

  1. Eventloom (bottom): Immutable append-only JSONL logs with SHA-256 hash chains.
  2. Neo4j (core): Bi-temporal knowledge graph with entity/relationship validity windows.
  3. Pathlight (optional top layer): Observability, tracing, and debugging dashboard.

Quick Start

# Start infrastructure
docker compose up -d neo4j

# Install Zaxy
pip install -e ".[dev]"

# Verify connectivity
python -m zaxy status

# Run tests
pytest

# Start MCP server
python -m zaxy serve

# Check local onboarding prerequisites
python -m zaxy doctor

# Emit machine-readable setup diagnostics
python -m zaxy doctor --json

# Start MCP over SSE for daemon mode
python -m zaxy serve --transport sse --port 8080

Daily Operations

Health Checks

# Check all services
python -m zaxy status

# Or manually:
curl http://localhost:7474  # Neo4j HTTP

# Only when PATHLIGHT_ENABLED=true:
curl http://localhost:4100/health  # Pathlight collector
curl http://localhost:3100  # Pathlight dashboard

Event Log Inspection

# Replay a session
python -m zaxy replay .eventloom/work.jsonl

# Replay from a specific point
python -m zaxy replay .eventloom/work.jsonl --from-seq 42

# Export as JSON
python -m zaxy replay .eventloom/work.jsonl --json

# Write a standalone HTML viewer for one log or an Eventloom directory
python -m zaxy viewer .eventloom --output eventloom-viewer.html

# Rebuild Neo4j projection after extractor changes
python -m zaxy reproject .eventloom/default.jsonl --session-id default

# Audit identity and citation safety before compacting
python -m zaxy compact .eventloom/work.jsonl --audit

# Export a machine-readable audit report
python -m zaxy compact .eventloom/work.jsonl --audit --json

# Store a source-backed medoid projection without rewriting the log
python -m zaxy compact .eventloom/work.jsonl --projection-output .eventloom/work.compaction.json

# Store a bounded exemplar projection for high-spread clusters
python -m zaxy compact .eventloom/work.jsonl --projection-output .eventloom/work.compaction.json --strategy exemplar --max-records 5

# Rewrite compaction appends compaction.completed to the output log.
# Audit and projection-only modes leave the source log unchanged.

# Projections under the Eventloom directory are auto-discovered
python - <<'PY'
from zaxy import MemoryFabric

fabric = MemoryFabric(eventloom_path=".eventloom")
PY

# Explicit projection paths are still supported for artifacts stored elsewhere
python - <<'PY'
from zaxy import MemoryFabric

fabric = MemoryFabric(projection_paths=["/secure/projections/work.compaction.json"])
PY

# Compact old logs
python -m zaxy compact .eventloom/work.jsonl --snapshot-every 10000

Memory Queries (via MCP)

When the MCP server is running, any MCP client can:

{
  "tool": "memory_append",
  "arguments": {
    "event_type": "goal.created",
    "actor": "user",
    "payload": {"title": "Ship MVP"}
  }
}
{
  "tool": "memory_query",
  "arguments": {
    "query": "What are our goals?",
    "temporal_filter": "2024-06-01T00:00:00Z",
    "limit": 5
  }
}

Backup & Recovery

Critical Data

Data Location Backup Priority
Eventloom logs .eventloom/*.jsonl Critical — immutable source of truth
Neo4j database Docker volume neo4j_data High — can be rebuilt from Eventloom
Pathlight traces Pathlight deployment volume, if enabled Medium — observability only

Backup Procedures

scripts/backup.sh \
  --root . \
  --output-dir /backups/zaxy \
  --name "zaxy-$(date -u +%Y%m%dT%H%M%SZ)"

This archives .eventloom/ and non-secret operational docs/config templates, excludes secrets/ and .certs/, and writes a .sha256 manifest next to the archive. Neo4j can be rebuilt from Eventloom; take a separate Neo4j dump only when fast point-in-time restore matters more than minimizing backup surface.

Recovery Procedures

scripts/restore.sh \
  --archive /backups/zaxy/zaxy-20260506T120000Z.tar.gz \
  --manifest /backups/zaxy/zaxy-20260506T120000Z.sha256 \
  --target /srv/zaxy

Restore validates the checksum before extraction and refuses to overwrite an existing target .eventloom/ unless --force is provided.

Monitoring & Alerting

Key Metrics

Metric Target Alert If
Event append latency <50ms >100ms
Graph upsert latency <100ms >200ms
Hybrid query latency <200ms >500ms
Event log size <10GB >50GB
Neo4j disk usage <80% >90%

Neo4j Monitoring

// Check database size
CALL dbms.database.state("neo4j") YIELD status;

// Check index status
SHOW INDEXES;

// Check constraint status
SHOW CONSTRAINTS;

// Entity count
MATCH (e:Entity) RETURN count(e) AS entities;

// Provenance backbone count
MATCH (:Session)-[:HAS_EVENT]->(:Event) RETURN count(*) AS projected_events;

// Relationship count
MATCH ()-[r:RELATES]->() RETURN count(r) AS relations;

// Typed relationship label sample
MATCH p=()-[:CALLS_SYMBOL|DEFINES_SYMBOL|PROJECTED_LLM_PACKET]->() RETURN p LIMIT 25;

// Temporal validity check — entities without valid_to
MATCH (e:Entity) WHERE e.valid_to IS NULL RETURN count(e) AS active_entities;

Log Rotation

Eventloom logs grow indefinitely. Set up rotation:

scripts/rotate-logs.sh \
  --log .eventloom/default.jsonl \
  --archive-dir .eventloom/archive \
  --name "default-$(date -u +%Y%m%dT%H%M%SZ)"

Rotation copies the active JSONL file into an archive, writes a checksum manifest, then truncates the active file only after archive creation succeeds. Verify rotated logs with zaxy replay .eventloom/archive/<name>.jsonl.

Troubleshooting

"Agent is hallucinating / using stale context"

  1. Check Eventloom: Verify the event was actually recorded.
   python -m zaxy replay .eventloom/work.jsonl --from-seq N
  1. Check graph temporal validity:
   MATCH (e:Entity {name: "X"})
   RETURN e.valid_from, e.valid_to, e.entity_type
  1. Pathlight trace (if enabled): Inspect the query/result metadata and operation timing.

"Hash chain verification failed"

  1. Identify the broken event:
   from zaxy.event import EventLog
   log = EventLog(".eventloom/work.jsonl")
   report = log.verify()
   print(f"Broken at seq: {report.broken_at_seq}")
  1. If tampered: Restore from backup. Eventloom logs are append-only and should never be modified.
  1. If corrupted disk: Check filesystem integrity (fsck, SMART tests).

"Neo4j connection refused"

  1. Check container status:
   docker compose ps neo4j
   docker compose logs neo4j
  1. Check memory: Neo4j needs at least 2GB heap.
   docker compose exec neo4j neo4j-admin memrec
  1. Check ports:
   netstat -tlnp | grep 7687

Performance Degradation

  1. Query slow? Check Neo4j query plan:
   PROFILE MATCH (e:Entity {name: "X"}) RETURN e;
  1. Event append slow? Check disk I/O:
   iostat -x 1
  1. Graph upsert slow? Check for missing indexes:
   SHOW INDEXES;

Scaling Considerations

Current

Future Scale-Out

Security

Encryption

Access Control

# Neo4j: Create read-only user for agents
CREATE USER agent_reader SET PASSWORD 'secure_password';
GRANT ROLE reader TO agent_reader;

Secrets Management

Use Docker secrets or *_FILE environment variables for sensitive settings. Direct environment variables take precedence over file-backed values.

# Local production scaffold
./scripts/setup.sh --production

# Starts Neo4j and Zaxy with Docker secrets from ./secrets/
./scripts/generate-certs.sh .certs
docker compose -f docker-compose.prod.yml up -d

Production mode rejects NEO4J_PASSWORD=testpassword. When using the generated custom CA, set NEO4J_URI=bolt://... with NEO4J_CA_CERT so the Neo4j driver enables encryption and trusts the mounted CA.

For external secret managers such as Vault or AWS Secrets Manager, write values to mounted files and set:

NEO4J_PASSWORD_FILE=/run/secrets/neo4j_password
MCP_ADMIN_TOKEN_FILE=/run/secrets/mcp_admin_token
PATHLIGHT_ACCESS_TOKEN_FILE=/run/secrets/pathlight_access_token

Maintenance Windows

Weekly

Monthly

Quarterly

Go-Live Gate

Before promoting a build, run:

zaxy doctor --release-smoke
scripts/release-check.sh --root .

The release smoke check verifies the package version, changelog entry, publish workflow, and PyPI Trusted Publishing posture. The release gate runs ruff, mypy, the full coverage-gated pytest suite, Python artifact build/metadata validation, public site/documentation validation, and deployment validation. A release is not ready until all six gates pass, the production .env points at TLS-enabled Neo4j, remote MCP/SSE bearer auth is configured, and secret files are not world-readable.

Prometheus Alerts

groups:
  - name: zaxy-degraded-mode rules:
      - alert: ZaxyGraphFallbacks expr: increase(zaxy_degraded_operations_total{reason=~"graph_.*"}[10m]) > 0 for: 5m labels: severity: warning annotations: summary: Zaxy graph degradation detected
      - alert: ZaxyEmbeddingFallbacks expr: increase(zaxy_degraded_operations_total{reason="embedding_provider_unavailable"}[10m]) > 0 for: 5m labels: severity: warning annotations: summary: Zaxy embedding provider unavailable
      - alert: ZaxyRerankerFallbacks expr: increase(zaxy_degraded_operations_total{reason="reranker_unavailable"}[10m]) > 0 for: 10m labels: severity: info annotations: summary: Zaxy reranker degraded to MMR

Incident Response

Severity Levels

Level Example Response Time
P0 Data loss, all agents down Immediate
P1 Query failures, single agent down <1 hour
P2 Performance degradation <4 hours
P3 Observability gaps <24 hours

P0: Data Loss

  1. Stop all writes immediately
  2. Restore from most recent backup
  3. Replay Eventloom from last known good state
  4. Verify graph consistency
  5. Post-mortem within 24 hours

Escalation

Reference

Environment Variables

Variable Default Purpose
NEO4J_URI bolt://localhost:7687 Neo4j Bolt URI
NEO4J_USER neo4j Neo4j username
NEO4J_PASSWORD testpassword Neo4j password
NEO4J_PASSWORD_FILE unset File containing Neo4j password
NEO4J_CA_CERT unset CA certificate path for encrypted custom-CA Bolt connections
NEO4J_TRUST_ALL false Trust all Neo4j certs; development only
NEO4J_AUTO_START true Auto-start a local Docker Neo4j container for localhost development MCP startup
NEO4J_AUTO_START_IMAGE neo4j:5.26-community Docker image used by local Neo4j auto-start
NEO4J_AUTO_START_CONTAINER zaxy-neo4j Container name used by local Neo4j auto-start
PATHLIGHT_URL http://localhost:4100 Pathlight collector
PATHLIGHT_ENABLED false Enable Pathlight client and health check
PATHLIGHT_ACCESS_TOKEN unset Optional Pathlight token
PATHLIGHT_ACCESS_TOKEN_FILE unset File containing optional Pathlight token
TRACE_RAW_QUERIES false Include raw query text in traces
EVENTLOOM_PATH .eventloom Event log directory
EVENTLOOM_THREAD default Default session/log name
ZAXY_DOMAIN unset Stable project/domain label used by generated MCP configs
ZAXY_ENV development Runtime environment; production enables stricter config validation
MCP_ADMIN_TOKEN unset Optional token for replay/invalidate tools
MCP_ADMIN_TOKEN_FILE unset File containing optional admin token
MCP_REMOTE_AUTH_TOKEN unset Bearer token required for remote MCP/SSE requests when configured
MCP_REMOTE_AUTH_TOKEN_FILE unset File containing remote MCP/SSE bearer token
MCP_REMOTE_SESSION_HEADER x-zaxy-session-id HTTP header that scopes remote MCP/SSE requests to a session
MCP_OIDC_ISSUER unset OIDC issuer for remote MCP/SSE JWT validation
MCP_OIDC_AUDIENCE unset Expected JWT audience for remote MCP/SSE
MCP_OIDC_JWKS_URL unset JWKS URL for remote MCP/SSE JWT signatures
MCP_OIDC_REQUIRED_SCOPE zaxy:mcp Required OAuth scope for remote MCP/SSE
MCP_OIDC_SESSION_CLAIM zaxy_session JWT claim containing the Zaxy session/tenant ID
MCP_OIDC_CLIENT_SECRET_FILE unset Optional OIDC client secret file for future introspection flows
MCP_RATE_LIMIT_ENABLED true Enable session-scoped remote MCP/SSE request rate limiting
MCP_RATE_LIMIT_REQUESTS 120 Maximum remote MCP/SSE requests per window
MCP_RATE_LIMIT_WINDOW_SECONDS 60 Remote MCP/SSE rate-limit window
MCP_AUDIT_ENABLED false Export remote MCP/SSE request audit JSONL
MCP_AUDIT_PATH .eventloom/remote_audit.jsonl Remote MCP/SSE request audit JSONL path
QUERY_DEFAULT_LIMIT 10 Default query result limit
CONTEXT_VERBATIM_ENABLED true Include exact Eventloom source recall in assembled context
CONTEXT_VERBATIM_SLOTS 1 Assembled context slots reserved for verbatim source recall
EMBEDDING_ENABLED true Generate embeddings for vector search
EMBEDDING_PROVIDER hash Embedding provider: hash, openai, local-http, or sentence-transformers
EMBEDDING_DIMENSION 1536 Vector dimension; must match the Neo4j vector index
OPENAI_API_KEY unset OpenAI API key for hosted embeddings
OPENAI_API_KEY_FILE unset File containing OpenAI API key
OPENAI_EMBEDDING_MODEL text-embedding-3-small OpenAI embedding model
OPENAI_BASE_URL https://api.openai.com/v1 OpenAI-compatible API base URL
EMBEDDING_HTTP_URL unset Local HTTP embedding endpoint for local-http
EMBEDDING_HTTP_MODEL unset Optional local HTTP embedding model name
EMBEDDING_HTTP_API_KEY unset Optional bearer token for local HTTP embeddings
EMBEDDING_SENTENCE_TRANSFORMER_MODEL sentence-transformers/all-MiniLM-L6-v2 Local model used by sentence-transformers; install zaxy-memory[local-embeddings]

CLI Commands

zaxy serve          # Start MCP stdio server
zaxy serve --transport sse --port 8080  # Start MCP SSE server bound to localhost
zaxy ide-config claude-desktop --eventloom-path .eventloom  # Print first-run MCP config
zaxy local-profile --output .env.local  # Write offline retrieval profile
zaxy local-profile --check  # Validate deterministic local retrieval providers
zaxy init-session . --session-id zaxy-default  # Append workspace genesis profile event
zaxy index-codebase . --session-id zaxy-default  # Append codebase file, symbol, import, dependency, call, and coverage events
zaxy memory status --eventloom-path .eventloom  # Inspect Eventloom sessions, latest hashes, and integrity
zaxy memory log --eventloom-path .eventloom --limit 20  # Show recent Eventloom events
zaxy memory diff --eventloom-path .eventloom --session-id zaxy-default --from-seq 10 --to-seq 20  # Show added events in a sequence range
zaxy replay PATH    # Replay Eventloom log
zaxy compact PATH --audit  # Audit compaction safety without rewriting the log
zaxy compact PATH   # Compact log + create snapshot
zaxy status         # Check service health
scripts/backup.sh --root . --output-dir backups
scripts/restore.sh --archive backups/zaxy.tar.gz --manifest backups/zaxy.sha256 --target restored
scripts/rotate-logs.sh --log .eventloom/default.jsonl
scripts/validate-deployment.sh --root .
scripts/build-dist.sh --root .
scripts/validate-docs.sh --root .
scripts/release-check.sh --root .

MCP Tools

Tool Purpose
memory_append Write event to log + graph
memory_query Hybrid retrieval from graph
memory_replay Replay session events; requires admin_token if configured
memory_invalidate Soft-delete (bi-temporal); requires admin_token if configured

---

Last updated: 2026-05-05