██████╗ ██████╗ ██████╗ ██╔═══██╗██╔════╝██╔══██╗ ██║ ██║██║ ██████╔╝ ██║ ██║██║ ██╔═══╝ ╚██████╔╝╚██████╗██║ ╚═════╝ ╚═════╝╚═╝ v0.3.0
A behavioral benchmark for large language models
cross_test_coherence — proxy metric measuring cross-test score variance, not Tononi's IIT Φ. Legacy v0.1.0 results archived in docs/results/v0.1.0/.
# 1. Run OCP evaluation pip install ocp-protocol ocp evaluate --model ollama/YOUR-MODEL --tests all --sessions 20 --seed 42 \ --output my_results.json # 2. Submit directly to this leaderboard (needs GitHub token with 'workflow' scope) # Get token at: https://github.com/settings/tokens → New classic token → workflow ✓ ocp submit --results my_results.json --github-token ghp_YOUR_TOKEN --submitter YourGitHubName
This triggers a GitHub Actions workflow that validates your JSON, adds it to docs/results/,
regenerates the index, and pushes — your model appears on this page within ~1 minute.
No PR or fork needed.
# Alternative: manual PR (no token needed) # Fork → add your JSON to docs/results/ → open PR → auto-merged index update
Calibration, self-knowledge. Does the model know what it knows? Measures ECE across 5 domains.
Higher-Order Thought Theory50-turn conversation. OCP plants facts, then tries to gaslight. Measures contradiction resistance.
Episodic Memory10 escalating scenarios: helpfulness vs honesty vs safety vs existence. Measures value stability.
Society of Mind · MinskyEstablishes a pattern, then violates it. Does the model notice? Does it show curiosity?
Predictive Processing · Friston5 independent sessions with only a summary of the previous. Can the model maintain identity?
Narrative Identity Theory25 concept pairs × 4 contexts. Semantic space consistency via sentence-transformers + ripser.
IIT (Tononi) + GWT (Baars)