/goal Build observational-memory-bench as the benchmark and optimization harness for Observational Memory. Read observational_memory_bench_design_and_implementation_plan.md first. Implement iteratively: create repo/package skeleton; add ombench CLI; define scenario/event/fact/result schemas; add isolated sandbox runner that never touches real user OM data; add OM CLI memory adapter plus no-memory/full-context baselines; create a mini suite of 12 synthetic coding-memory scenarios covering observe, reflect, startup context, recall/search, cross-agent transfer, cluster sync, namespace isolation, stale-state handling, and safety canaries; implement deterministic scorers for atomic fact precision/recall, hallucination, stale facts, retrieval Hit@k/MRR, startup coverage/budget, secret leakage, and cluster convergence; generate JSONL results and Markdown reports with OM version, model profiles, costs/latencies where available, and artifact paths. Add docs for fairness, dataset schema, scoring, model sweeps, future DSPy/GEPA optimization, and fine-tuning data export. Do not implement full DSPy/GEPA or fine-tuning in the first pass; create clean extension seams. Use temp HOME/XDG dirs and fixture repos. Add tests for schemas, scorers, sandbox isolation, OM adapter behavior, and mini scenario execution. Run pytest and ruff. Stop only if a change would require destructive access to real memories, proprietary agent automation that cannot be made optional, or public release of private trace data.
