Metadata-Version: 2.4
Name: itemwise
Version: 0.1.0
Summary: LLM-based evaluation of multiple-choice items against item-writing guidelines
Keywords: mcq,multiple-choice,item-writing,llm,educational-measurement
Author: mathbullet
Author-email: mathbullet <mathbullet.compling@gmail.com>
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Education :: Testing
Classifier: Typing :: Typed
Requires-Dist: litellm>=1.81.16,!=1.82.7,!=1.82.8
Requires-Dist: pydantic>=2.0
Requires-Dist: tqdm>=4.67.3
Requires-Python: >=3.12
Project-URL: Homepage, https://github.com/kikagaku/itemwise
Project-URL: Repository, https://github.com/kikagaku/itemwise
Project-URL: Issues, https://github.com/kikagaku/itemwise/issues
Description-Content-Type: text/markdown

# itemwise

[![CI](https://github.com/kikagaku/itemwise/actions/workflows/ci.yml/badge.svg)](https://github.com/kikagaku/itemwise/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)

LLM-based evaluation of multiple-choice items against item-writing guidelines.

Evaluate the quality of multiple-choice questions (MCQs) using the 43 item-writing rules from Haladyna & Downing (1989), powered by any LLM provider via [litellm](https://docs.litellm.ai/).

## Installation

```bash
pip install git+https://github.com/kikagaku/itemwise.git
```

Or with [uv](https://docs.astral.sh/uv/):

```bash
uv add git+https://github.com/kikagaku/itemwise.git
```

Requires Python 3.12+.

## Quick Start

```python
from itemwise import evaluate

result = evaluate(
    item={
        "stem": "Which of the following is NOT a characteristic of mammals?",
        "options": [
            "They are warm-blooded",
            "They lay eggs",
            "They have hair or fur",
            "They produce milk",
        ],
        "correct": 1,
    },
    model="azure/gpt-5.1-chat",
)

print(result.score)       # 0.95 (fraction of rules passed)
print(result.violations)  # [RuleResult(rule_id=22, ...)]
```

## Features

- Evaluate MCQs against 41 item-writing rules (43 total, 2 batch-level rules excluded by default)
- Sync and async API (`evaluate`, `async_evaluate`)
- Batch evaluation with tqdm progress bar (`evaluate_batch`, `async_evaluate_batch`)
- Structured JSON output via `response_format` for reliable LLM responses
- Token usage and cost tracking (`UsageInfo`)
- Automatic retry on JSON parse failures
- Any LLM provider supported through litellm
- CLI with flexible parameter passthrough

## Usage

### Library API

```python
from itemwise import evaluate, evaluate_batch, async_evaluate_batch

# Single item evaluation (default: 41 rules)
result = evaluate(item=item, model="azure/gpt-5.1-chat")

# Select specific rules by ID
result = evaluate(item=item, model="azure/gpt-5.1-chat", rules=[22, 28, 37])

# Batch evaluation with progress bar
results = evaluate_batch(items=[item1, item2, ...], model="azure/gpt-5.1-chat")

# Disable progress bar
results = evaluate_batch(items=items, model="azure/gpt-5.1-chat", progress=False)

# Async batch evaluation (parallel LLM calls)
results = await async_evaluate_batch(items=items, model="azure/gpt-5.1-chat")

# Pass any LLM parameters through to litellm
result = evaluate(item=item, model="azure/gpt-5.1-chat", reasoning_effort="low")
```

### Token Usage and Cost

```python
result = evaluate(item=item, model="azure/gpt-5.1-chat")

print(result.usage.prompt_tokens)      # 304
print(result.usage.completion_tokens)  # 226
print(result.usage.total_tokens)       # 530
print(result.usage.cost)               # 0.00264
```

### CLI

```bash
# Evaluate items from a JSON file
itemwise evaluate questions.json --model azure/gpt-5.1-chat

# Select specific rules
itemwise evaluate questions.json --model azure/gpt-5.1-chat --rules 22,28,37

# Pass LLM parameters
itemwise evaluate questions.json --model azure/gpt-5.1-chat --param reasoning_effort=low

# Show version
itemwise --version
```

Input JSON format:

```json
[
  {
    "stem": "Question text",
    "options": ["Option A", "Option B", "Option C", "Option D"],
    "correct": 0
  }
]
```

### LLM Configuration

The LLM backend is connected via [litellm](https://docs.litellm.ai/). Model names and parameters follow litellm conventions.

For Azure OpenAI, set the following environment variables:

```bash
export AZURE_API_KEY=your-key
export AZURE_API_BASE=https://your-resource.cognitiveservices.azure.com/
export AZURE_API_VERSION=2024-12-01-preview
```

See the [litellm documentation](https://docs.litellm.ai/docs/providers) for other providers (OpenAI, Anthropic, Google, etc.).

## Item-Writing Rules

Evaluates MCQs against 43 rules from Haladyna & Downing (1989), organized in 6 categories:

| Category | Rules | Description |
|---|---|---|
| General (Procedural) | 1-7 | Item format, grammar, readability |
| General (Content) | 8-17 | Educational objectives, vocabulary level, higher-order thinking |
| Stem Construction | 18-23 | Stem clarity, positive wording, central idea placement |
| General Option | 24-35 | Option count, order, homogeneity, length consistency |
| Correct Option | 36-37 | Answer position distribution, uniqueness |
| Distractor | 38-43 | Plausibility, common errors, avoiding humor |

Rules 11 (item independence) and 36 (correct answer position distribution) require cross-item analysis and are excluded from default evaluation. They can be explicitly included via the `rules` parameter, but single-item evaluation accuracy is limited for these rules.

## References

- Haladyna, T. M., & Downing, S. M. (1989). A taxonomy of multiple-choice item-writing rules. *Applied Measurement in Education*, 2(1), 37-50.
- Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. *Applied Measurement in Education*, 15(3), 309-333.

## License

MIT
