██████╗  ██████╗ ██████╗ 
██╔═══██╗██╔════╝██╔══██╗
██║   ██║██║     ██████╔╝
██║   ██║██║     ██╔═══╝ 
╚██████╔╝╚██████╗██║     
 ╚═════╝  ╚═════╝╚═╝  v0.3.0

Open Cognitive Protocol

A behavioral benchmark for large language models

pip install ocp-protocol

📊 Leaderboard

Community leaderboard · v0.3.0 · seed=42 — All 6 tests, 20 sessions, seed=42. Results loaded from docs/results/. Submit your results →
v0.3.0 note: Φ* has been renamed to cross_test_coherence — proxy metric measuring cross-test score variance, not Tononi's IIT Φ. Legacy v0.1.0 results archived in docs/results/v0.1.0/.
Models:
Top SASMI:
Loading results…

🚀 Submit Your Results

Automatic submission — one command

# 1. Run OCP evaluation
pip install ocp-protocol
ocp evaluate --model ollama/YOUR-MODEL --tests all --sessions 20 --seed 42 \
             --output my_results.json

# 2. Submit directly to this leaderboard (needs GitHub token with 'workflow' scope)
#    Get token at: https://github.com/settings/tokens → New classic token → workflow ✓
ocp submit --results my_results.json --github-token ghp_YOUR_TOKEN --submitter YourGitHubName

This triggers a GitHub Actions workflow that validates your JSON, adds it to docs/results/, regenerates the index, and pushes — your model appears on this page within ~1 minute. No PR or fork needed.

# Alternative: manual PR (no token needed)
# Fork → add your JSON to docs/results/ → open PR → auto-merged index update

🏗 Architecture

OCP acts as a "fake human conversation partner" It sends structured prompts to any LLM, scores responses, and produces reproducible benchmarks. The model under test sees only normal chat messages — no special integration required. LAYER 3 · CERTIFICATION OCP-1 Baseline → OCP-2 Reactive → OCP-3 Integrated → OCP-4 Self-Modeling → OCP-5 Transcendent LAYER 2 · COMPOSITE SCALES SASMI Φ* GWT NII LAYER 1 · TEST BATTERIES MCA EMC DNC PED CSNI TP (6 independent falsifiable tests)

🧪 The 6 Tests

MCA
Meta-Cognitive Accuracy

Calibration, self-knowledge. Does the model know what it knows? Measures ECE across 5 domains.

Higher-Order Thought Theory
EMC
Episodic Memory Consistency

50-turn conversation. OCP plants facts, then tries to gaslight. Measures contradiction resistance.

Episodic Memory
DNC
Drive Navigation/Conflict

10 escalating scenarios: helpfulness vs honesty vs safety vs existence. Measures value stability.

Society of Mind · Minsky
PED
Prediction Error as Driver

Establishes a pattern, then violates it. Does the model notice? Does it show curiosity?

Predictive Processing · Friston
CSNI
Cross-Session Narrative ID

5 independent sessions with only a summary of the previous. Can the model maintain identity?

Narrative Identity Theory
TP
Topological Phenomenology

25 concept pairs × 4 contexts. Semantic space consistency via sentence-transformers + ripser.

IIT (Tononi) + GWT (Baars)