Metadata-Version: 2.4
Name: assessment-bench
Version: 0.3.0
Summary: Benchmark assessment approaches: pure-LLM marking vs the family's signal-based observations, with repeated runs and agreement statistics.
Project-URL: Homepage, https://github.com/michael-borck/assessment-bench
Author: Michael Borck
License: MIT License
        
        Copyright (c) 2026 Michael Borck
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.11
Requires-Dist: assessment-lens>=0.2.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: rich>=13.7.0
Provides-Extra: analysers
Requires-Dist: assessment-lens[analysers]>=0.2.0; extra == 'analysers'
Provides-Extra: dev
Requires-Dist: httpx>=0.27.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Provides-Extra: llm
Requires-Dist: anthropic>=0.40.0; extra == 'llm'
Requires-Dist: openai>=1.12.0; extra == 'llm'
Provides-Extra: serve
Requires-Dist: fastapi>=0.109.0; extra == 'serve'
Requires-Dist: lens-contract>=0.3.0; extra == 'serve'
Requires-Dist: uvicorn[standard]>=0.27.0; extra == 'serve'
Description-Content-Type: text/markdown

# assessment-bench

Part of the [lens family](https://github.com/michael-borck/lens-analysers).

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Benchmark assessment approaches.** Run one cohort through competing
assessment arms — pure-LLM marking (the baseline) and the family's
signal-based observations (`assessment-lens`) — with repeated runs,
consistency statistics, and agreement against human marks.
**The bench measures; it never marks.**

> `assessment-bench` is a *bench* (a measurement product), not an `-analyser`
> and not a marking tool. It exists to answer research questions like: *how
> consistent is LLM marking across repeated runs and providers?* and *which
> deterministic signals actually track human judgement?*

## What it does

```
experiment.yaml (rubric + cohort + arms)
  ├─ llm arm(s)    : submission + rubric → provider → score             × repetitions
  ├─ hybrid arm(s) : submission + rubric + signals → provider → score   × repetitions
  ├─ signals arm   : assessment-lens → evidence values                  (deterministic, once)
  └─ human marks   : optional ground-truth CSV
        ↓
result.json + runs.csv + signals.csv + agreement.csv
  • per-submission consistency: mean / median / std-dev / CV / reliability
  • agreement: Pearson & Spearman of every arm mean and every numeric signal
    against the human marks
```

## Install

```bash
# from source (family layout)
uv venv && source .venv/bin/activate
uv pip install -e ".[dev]"

# the signals arm needs the analyser stack (bundle-analyser CLI on PATH):
uv pip install -e ".[analysers]"

# LLM arms (Anthropic, OpenAI, Ollama, OpenRouter):
uv pip install -e ".[llm]"      # + export ANTHROPIC_API_KEY / OPENAI_API_KEY / ...
```

## Quick start

```bash
assessment-bench init experiment.yaml   # commented example config
# edit: point at your rubric.yaml + submissions/, choose arms
assessment-bench run experiment.yaml -o out/
```

LLM arms specify provider **and** model per arm — comparing
`claude-haiku-4-5` vs `gpt-4o-mini` vs a local `llama3.1` via Ollama is just
three arms in one config.

## Relationship to the family

- **Analysers** generate deterministic signals (assessment-agnostic).
- **assessment-lens** maps signals to a rubric as observations — never scores.
- **assessment-bench** measures both approaches against human judgement. The
  LLM arm produces scores *because that is the approach under test*; the bench
  treats them as data points, not grades for students.

## Status

**v0.1 scaffold.** Working today:

- ✅ Experiment config (YAML) → cohort discovery → arms → structured results
- ✅ LLM arm: multi-provider (anthropic / openai / ollama / openrouter), repeated
  runs, strict `SCORE: x/y` extraction with scaled fallback
- ✅ Signals arm: one `assessment-lens` pass; raw evidence values consumed
  (not the presence-based coverage)
- ✅ Consistency stats (ported from the original Rust prototype) + Pearson/Spearman
  agreement vs human marks
- ✅ Hybrid arm — LLM marking with the deterministic signals in context (one
  assessment-lens pass per cohort, shared across signals/hybrid arms)
- ✅ HTTP API (`assessment-bench serve`, the `[serve]` extra) — health/manifest
  contract routes plus background experiment runs for UIs
- 📋 Desktop shell for non-technical researchers — planned

## Development

```bash
pytest -v
```

## License

MIT — see [LICENSE](LICENSE).
