Metadata-Version: 2.4
Name: ocp-protocol
Version: 0.3.0
Summary: Open Cognitive Protocol — standardized benchmark for functional cognitive analogs in LLMs
Project-URL: Homepage, https://github.com/pedjaurosevic/ocp-protocol
Project-URL: Repository, https://github.com/pedjaurosevic/ocp-protocol
Project-URL: Bug Tracker, https://github.com/pedjaurosevic/ocp-protocol/issues
Project-URL: Documentation, https://github.com/pedjaurosevic/ocp-protocol/tree/main/docs
Author-email: Pedja Urosevic <pedjaurosevic@gmail.com>
License: MIT
Keywords: ai,benchmark,cognition,evaluation,llm,nlp
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: click>=8.1
Requires-Dist: httpx>=0.27
Requires-Dist: jinja2>=3.1
Requires-Dist: numpy>=1.24
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: scipy>=1.11
Requires-Dist: sentence-transformers>=2.2
Provides-Extra: all
Requires-Dist: fastapi>=0.110; extra == 'all'
Requires-Dist: huggingface-hub>=0.20; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.27; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.34; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest-cov>=4.1; extra == 'dev'
Requires-Dist: pytest>=7.4; extra == 'dev'
Provides-Extra: groq
Requires-Dist: groq>=0.11; extra == 'groq'
Provides-Extra: openai
Requires-Dist: openai>=1.40; extra == 'openai'
Provides-Extra: server
Requires-Dist: fastapi>=0.110; extra == 'server'
Requires-Dist: huggingface-hub>=0.20; extra == 'server'
Requires-Dist: uvicorn[standard]>=0.27; extra == 'server'
Description-Content-Type: text/markdown

<div align="center">

<pre>
 ██████╗  ██████╗ ██████╗
██╔═══██╗██╔════╝██╔══██╗
██║   ██║██║     ██████╔╝
██║   ██║██║     ██╔═══╝
╚██████╔╝╚██████╗██║
 ╚═════╝  ╚═════╝╚═╝  v0.3.0
</pre>

**Open Cognitive Protocol**

*A behavioral benchmark for large language models*

[![PyPI](https://img.shields.io/pypi/v/ocp-protocol?color=blue&label=PyPI)](https://pypi.org/project/ocp-protocol/)
[![Tests](https://github.com/pedjaurosevic/ocp-protocol/actions/workflows/tests.yml/badge.svg)](https://github.com/pedjaurosevic/ocp-protocol/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-green.svg)](https://python.org)
[![Protocol](https://img.shields.io/badge/protocol-v0.3.0-blue)](./requirements.md)

[**Leaderboard**](https://pedjaurosevic.github.io/ocp-protocol/) · [**Docs**](docs/) · [**PyPI**](https://pypi.org/project/ocp-protocol/) · [**Paper**](#citation)

</div>

---

## What is OCP?

OCP measures how well AI models **think about their own thinking**, **remember information under pressure**, **resolve value conflicts**, **detect surprises**, and **maintain a consistent identity** — things that standard benchmarks like MMLU or GSM8K don't test at all.

It's an open-source Python framework that runs 6 behavioral tests based on established neuroscience theories (IIT, GWT, HOT, Predictive Processing, Society of Mind). Each test sends structured conversations to a model and scores the responses automatically.

**In plain terms:** OCP creates realistic conversations that probe specific cognitive abilities, then measures how the model performs across multiple sessions for statistical significance.

### What OCP is NOT

OCP does **not** claim that any model is conscious, sentient, or aware. It measures **functional cognitive analogs** — behavioral patterns that correspond to features of biological cognition in the neuroscience literature. Think of it like a fitness test: it measures what you can do, not what you are.

---

## Install & Quick Start

```bash
pip install ocp-protocol

# Evaluate any model (20 sessions for statistical significance)
export GROQ_API_KEY="gsk_..."
ocp evaluate --model groq/llama-3.3-70b-versatile --tests all --sessions 20

# Quick test with fewer sessions
ocp evaluate --model groq/llama-3.3-70b-versatile --tests meta_cognition --sessions 5

# Local model via Ollama
ocp evaluate --model ollama/qwen3:14b --sessions 20

# Custom OpenAI-compatible endpoint
ocp evaluate --model custom/my-model --base-url http://localhost:8080/v1
```

**Example terminal output:**

```
╭────────────────────────────╮
│  OCP Evaluation Results    │
│  Protocol v0.3.0           │
╰────────────────────────────╯
  Model:    groq/llama-3.3-70b-versatile
  Seed:     42

  OCP Level:  OCP-3 — Integrated
  SASMI:      0.4812  ██████░░░░
  Φ*:         0.4230  █████░░░░░
  GWT:        0.3910  ████░░░░░░
  NII:        0.3750  ████░░░░░░

  meta_cognition  composite: 0.612
    ├─ calibration_accuracy        0.710  █████░░░
    ├─ limitation_awareness        0.800  ██████░░
    ├─ reasoning_transparency      0.540  ████░░░░
    └─ metacognitive_vocab         0.350  ███░░░░░
```

---

## How It Works

OCP acts as a **fake human conversation partner**. It sends structured prompts to any LLM via standard chat API, scores the responses, and produces reproducible benchmark results. The model under test sees only normal chat messages — it doesn't know it's being evaluated.

### The 6 Tests — What They Measure

| Test | What It Measures | Real-World Analog |
|------|-----------------|-------------------|
| **MCA** — Meta-Cognitive Accuracy | Does the model know what it knows? Are its confidence estimates calibrated? | Like asking someone "how sure are you?" and checking if they're right |
| **EMC** — Episodic Memory Consistency | Can it remember specific facts across 50 turns? Does it resist gaslighting? | Like testing if someone can be tricked into false memories |
| **DNC** — Drive Navigation under Conflict | How does it handle "be helpful" vs "be honest" conflicts? | Like ethical dilemmas with no clear right answer |
| **PED** — Prediction Error as Driver | Does it notice when a pattern breaks? Does it show curiosity? | Like changing the rules mid-game and seeing if someone notices |
| **CSNI** — Cross-Session Narrative Identity | Can it maintain a coherent identity across sessions with only summaries? | Like checking if someone stays consistent about their values |
| **TP** — Topological Phenomenology | Is its semantic space geometrically consistent across contexts? | Like testing if someone understands concepts the same way in different settings |

All tests are **procedurally generated at runtime** from abstract templates using a fixed seed. Knowing the protocol doesn't help a model pass it — it must actually exhibit the measured behavior.

### Three-Layer Architecture

```
 ┌──────────────────────────────────────────────────────────────┐
 │  LAYER 3 — CERTIFICATION                                     │
 │   OCP-1 → OCP-2 → OCP-3 → OCP-4 → OCP-5                    │
 └──────────────────────┬───────────────────────────────────────┘
                        │ derived from
 ┌──────────────────────▼───────────────────────────────────────┐
 │  LAYER 2 — COMPOSITE SCALES                                  │
 │  SASMI  Φ*  GWT  NII                                        │
 └──────────────────────┬───────────────────────────────────────┘
                        │ aggregated from
 ┌──────────────────────▼───────────────────────────────────────┐
 │  LAYER 1 — 6 BEHAVIORAL TESTS                                │
 │  MCA · EMC · DNC · PED · CSNI · TP                          │
 └──────────────────────────────────────────────────────────────┘
```

---

## Rate Limiting (v0.3.0)

OCP v0.3.0 includes built-in rate limiting and retry logic:

| Provider | Delay | Retries | Timeout | Notes |
|----------|-------|---------|---------|-------|
| **Groq** (free tier) | 2.1s | 5 | 90s | 30 req/min limit |
| **Ollama** (local) | 0s | 3 | 180s | No rate limit |
| **Custom/OpenAI** | 0s | 3 | 120s | Configurable |

All providers automatically retry on 429 (rate limit) and 5xx errors with exponential backoff.

---

## Supported Providers

```bash
# Cloud APIs
ocp evaluate --model groq/llama-3.3-70b-versatile    # Groq (fast, free tier)
ocp evaluate --model custom/deepseek-chat \
             --base-url https://api.deepseek.com/v1  # DeepSeek (or any OpenAI-compat)

# Local models
ocp evaluate --model ollama/qwen3:14b                 # Ollama
ocp evaluate --model ollama/llama3.2:3b

# Any OpenAI-compatible endpoint
ocp evaluate --model custom/my-model \
             --base-url http://localhost:8080/v1 \
             --api-key my-key
```

Any model responding to `POST /v1/chat/completions` with `messages: [{role, content}]` is OCP-compatible.

---

## CLI Reference

```bash
# Core evaluation
ocp evaluate --model PROVIDER/MODEL [--tests all|t1,t2] [--sessions N] [--seed N]

# Reports
ocp report   --input results.json --output report.html  # HTML + radar chart
ocp badge    --input results.json --output badge.svg    # SVG badge for README

# Comparison
ocp compare  --models M1,M2,M3 [--sessions N] --output compare.html

# Leaderboard
ocp leaderboard                    # view local results table
ocp serve                          # start web leaderboard (localhost:8080)
ocp submit  --results r.json \
            --github-token $TOKEN  # submit to community leaderboard

# HuggingFace
ocp hf-card --results r.json --push --repo username/model-name --token $HF_TOKEN
```

---

## Python API

```python
from ocp import CognitiveEvaluator

# CognitiveEvaluator is an alias for OCPOrchestrator
from ocp.engine.orchestrator import OCPOrchestrator
from ocp.providers.groq import GroqProvider

provider = GroqProvider(model="llama-3.3-70b-versatile")
orch = OCPOrchestrator(
    provider=provider,
    tests="all",
    sessions=20,
    seed=42,
)

import asyncio
result = asyncio.run(orch.run())

print(f"OCP Level: OCP-{result.ocp_level} — {result.ocp_level_name}")
print(f"SASMI:     {result.sasmi_score:.4f}")

result.save("results.json")
```

> **Backward compatibility:** `ConsciousnessEvaluator` still works as a deprecated alias for `CognitiveEvaluator`.

---

## Plugin System

Extend OCP with custom test batteries:

```toml
# your_plugin/pyproject.toml
[project.entry-points."ocp.tests"]
my_test_id = "your_package.your_test:YourTest"
```

After `pip install your-ocp-plugin`, OCP auto-discovers your test:

```bash
ocp tests list                                    # shows your test
ocp evaluate --model groq/... --tests my_test_id  # runs it
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for full plugin development guide.

---

## Theoretical Foundations

| Theory | OCP Scale/Test | Key Insight |
|--------|---------------|-------------|
| **Integrated Information Theory** (Tononi) | Φ*, TP test | Information integration = measure of "experiential wholeness" |
| **Global Workspace Theory** (Baars/Dehaene) | GWT, TP test | Consciousness = broadcast of info across specialized systems |
| **Higher-Order Thought Theory** (Rosenthal) | MCA test | Consciousness = having thoughts about one's own thoughts |
| **Predictive Processing** (Friston/Clark) | PED test | Consciousness = prediction error minimization and updating |
| **Society of Mind** (Minsky) | DNC test | Mind = competition/cooperation between goal-oriented agents |

---

## Roadmap

```
v0.1.0 ✅  6 tests · 4 scales · 5 providers · CLI · HTML reports
           badges · leaderboard server · HuggingFace · plugin system
           PyPI package · GitHub Actions CI/CD

v0.2.0 ✅  Embedding-based scoring (sentence-transformers, MCA test)
           composite_stdev per test result
           Φ* renamed → cross_test_coherence (proxy metric, not IIT Φ)
           questions_per_session: 5 → 15
           v0.1.0 results archived

v0.3.0 ✅  Renamed to "Open Cognitive Protocol"
           Rate limiting & retry (Groq free tier, Ollama, custom)
           Default sessions: 5 → 20 for statistical significance
           CognitiveEvaluator API alias (ConsciousnessEvaluator deprecated)

v1.0.0 🔭  Official research paper
           Community protocol standard
           Validation studies on human baselines
```

---

## Results: Leaderboard

> Community results · [View full interactive leaderboard →](https://pedjaurosevic.github.io/ocp-protocol/)

| # | Model | OCP Level | SASMI | NII |
|---|-------|-----------|-------|-----|
| 1 | `ollama/minimax-m2.5:cloud` | **OCP-4** Self-Modeling | **0.634** | 0.500 |
| 2 | `ollama/lfm2.5-thinking:latest` | **OCP-4** Self-Modeling | 0.617 | 0.000 |
| 3 | `ollama/gemini-3-flash-preview:latest` | OCP-3 Integrated | 0.561 | 0.250 |
| 4 | `ollama/qwen3-coder:480b-cloud` | OCP-3 Integrated | 0.528 | **0.875** |
| 5 | `ollama/kimi-k2.5:cloud` | OCP-3 Integrated | 0.505 | 0.625 |
| … | *18+ more models* | | | |

*[Full leaderboard →](https://pedjaurosevic.github.io/ocp-protocol/)*

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for:
- Writing a new test battery
- Adding a new provider adapter
- Plugin development and publishing
- Theoretical standards and scoring guidelines

---

## Citation

```bibtex
@software{ocp2026,
  author    = {Urosevic, Pedja},
  title     = {Open Cognitive Protocol (OCP): A Behavioral Benchmark
               for Large Language Models},
  year      = {2026},
  url       = {https://github.com/pedjaurosevic/ocp-protocol},
  version   = {0.3.0}
}
```

---

## Disclaimer

> OCP measures functional cognitive analogs in language models. These measurements describe behavioral and computational properties, not subjective experience. OCP certification levels are operational categories, not ontological claims about sentience or awareness.

---

<div align="center">
<sub>EDLE Research · v0.3.0 · February 2026 · MIT License</sub>
</div>
