Metadata-Version: 2.4
Name: itemwise
Version: 0.1.1
Summary: LLM-based evaluation of multiple-choice items against item-writing guidelines
Keywords: mcq,multiple-choice,item-writing,llm,educational-measurement
Author: mathbullet
Author-email: mathbullet <mathbullet.compling@gmail.com>
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Education :: Testing
Classifier: Typing :: Typed
Requires-Dist: litellm>=1.81.16,!=1.82.7,!=1.82.8
Requires-Dist: pydantic>=2.0
Requires-Dist: tqdm>=4.67.3
Requires-Python: >=3.12
Project-URL: Homepage, https://github.com/kikagaku/itemwise
Project-URL: Repository, https://github.com/kikagaku/itemwise
Project-URL: Issues, https://github.com/kikagaku/itemwise/issues
Description-Content-Type: text/markdown

# itemwise

[![PyPI](https://img.shields.io/pypi/v/itemwise.svg)](https://pypi.org/project/itemwise/)
[![CI](https://github.com/kikagaku/itemwise/actions/workflows/ci.yml/badge.svg)](https://github.com/kikagaku/itemwise/actions/workflows/ci.yml)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)

LLM-based evaluation of multiple-choice items against the 43 item-writing rules from Haladyna & Downing (1989). Works with any LLM provider via [litellm](https://docs.litellm.ai/).

## Installation

```bash
pip install itemwise
```

Requires Python 3.12+.

## Quick Start

```python
from itemwise import evaluate

result = evaluate(
    item={
        "stem": "Which of the following is NOT a characteristic of mammals?",
        "options": [
            "They are warm-blooded",
            "They lay eggs",
            "They have hair or fur",
            "They produce milk",
        ],
        "correct": 1,
    },
    model="azure/gpt-5.1-chat",
)

print(result.score)            # fraction of rules passed
print(result.violations)       # list of failed RuleResult
print(result.usage.cost)       # LLM cost in USD
```

## Usage

```python
from itemwise import evaluate, evaluate_batch, async_evaluate_batch

# Select specific rules
evaluate(item=item, model="azure/gpt-5.1-chat", rules=[22, 28, 37])

# Batch with progress bar (disable via progress=False)
evaluate_batch(items=items, model="azure/gpt-5.1-chat")

# Async / parallel
await async_evaluate_batch(items=items, model="azure/gpt-5.1-chat")

# Extra kwargs are forwarded to litellm
evaluate(item=item, model="azure/gpt-5.1-chat", reasoning_effort="low")
```

### CLI

```bash
itemwise evaluate questions.json --model azure/gpt-5.1-chat
itemwise evaluate questions.json --model azure/gpt-5.1-chat --rules 22,28,37 --param reasoning_effort=low
```

Input JSON format:

```json
[{"stem": "...", "options": ["A", "B", "C", "D"], "correct": 0}]
```

### LLM Configuration

Model names and parameters follow [litellm](https://docs.litellm.ai/docs/providers) conventions. For Azure OpenAI:

```bash
export AZURE_API_KEY=...
export AZURE_API_BASE=https://your-resource.cognitiveservices.azure.com/
export AZURE_API_VERSION=2024-12-01-preview
```

## Item-Writing Rules

43 rules from Haladyna & Downing (1989) across 6 categories:

| Category | Rules | Description |
|---|---|---|
| General (Procedural) | 1-7 | Format, grammar, readability |
| General (Content) | 8-17 | Objectives, vocabulary, higher-order thinking |
| Stem Construction | 18-23 | Clarity, positive wording |
| General Option | 24-35 | Count, order, homogeneity, length |
| Correct Option | 36-37 | Position distribution, uniqueness |
| Distractor | 38-43 | Plausibility, common errors |

Rules 11 (item independence) and 36 (answer position distribution) require cross-item analysis and are excluded by default. Pass them explicitly via `rules=[11, 36]` to include them.

## References

- Haladyna, T. M., & Downing, S. M. (1989). A taxonomy of multiple-choice item-writing rules. *Applied Measurement in Education*, 2(1), 37-50.
- Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. *Applied Measurement in Education*, 15(3), 309-333.

## License

MIT
