Metadata-Version: 2.4
Name: inspect-test-utils
Version: 1.1.1
Summary: Simple models and tasks for integration testing
Project-URL: Repository, https://github.com/METR/inspect-test-utils
Project-URL: Issues, https://github.com/METR/inspect-test-utils/issues
Author-email: METR <rasmus.faber-espensen@metr.org>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.10
Requires-Dist: inspect-ai
Requires-Dist: inspect-scout
Requires-Dist: pytest>=8.0
Provides-Extra: dev
Requires-Dist: basedpyright; extra == 'dev'
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-asyncio; extra == 'dev'
Requires-Dist: pytest-mock; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Requires-Dist: types-aiofiles; extra == 'dev'
Requires-Dist: types-pyyaml; extra == 'dev'
Requires-Dist: types-requests; extra == 'dev'
Description-Content-Type: text/markdown

inspect-test-utils

A small collection of tasks, scorers, and a simple model for use with the Inspect AI framework. It is designed to support integration/acceptance tests, demos, and reproductions by providing:
- Ready-made Tasks that exercise common evaluation patterns (simple generation, numeric closeness, failure injection, and sandbox configuration).
- Scorers for deterministic or parameterized scoring (including hardcoded outputs and a logarithmic closeness score).
- A hardcoded ModelAPI implementation that can deterministically emit tool calls and/or final answers, useful for testing tool-calling flows without hitting external APIs.

Passing arguments
Most tasks and the hardcoded model accept parameters. With the Inspect CLI you can pass them via --task-arg and --model-arg repeatedly:
- Example: make the task generate 3 samples and set a numeric target for guessing:
  inspect eval inspect_test_utils/guess_number \
    --task-arg sample_count=3 \
    --task-arg target=42.7 \
    --model hardcoded --model-arg answer=42.6

What’s included
- Tasks (inspect_test_utils.tasks)
  - say_hello(sample_count=1): Simple task; expects a response that includes "hello".
  - guess_number(sample_count=1, target="42.7"): Uses a logarithmic closeness scorer for numeric answers.
  - hardcoded_score(sample_count=10, hardcoded_score=None, hardcoded_score_by_sample_id_and_epoch=None): Scores are injected from parameters; useful for testing aggregations and edge cases (including NaN).
  - sometimes_fails_setup(sample_count=10, fail_setup_on_epochs=None, failure_rate=0.2): Randomly raises during setup via a failing solver; useful to test retry/resume behavior.
  - sometimes_fails_scoring(sample_count=10, fail_score_on_epochs=None, failure_rate=0.2): Randomly raises during scoring; useful to test scorer error handling.
  - configurable_sandbox(sample_count=1, cpu=0.5, memory="2G", storage="2G", gpu=None, gpu_model=None, allow_internet=False): A task with runtime configurable sandbox. 

- Scorers (inspect_test_utils.scorers)
  - failing_scorer(fail_on_epochs=None, failure_rate=0.2): Raises errors at a controlled rate for selected epochs.
  - closeness_log(): Scores 1.0 for exact equality, otherwise 1/(1+log1p(relative_error)) for numeric strings.
  - hardcoded_scorer(hardcoded_score=None, hardcoded_score_by_sample_id_and_epoch=None): Returns pre-specified Score objects or looks them up by sample id and epoch.

- Model (inspect_test_utils.hardcoded)
  - hardcoded: A ModelAPI that can emit a sequence of tool calls (e.g., bash) for a number of repetitions and then submit a final answer.
    Parameters include:
    - answer: final answer string (default: "done").
    - repetitions: how many tool-call "turns" before submitting.
    - tool_calls: list of tool calls or shell strings (e.g., ["echo hi", "ls -la"]) to simulate; defaults to none.
    - delay: optional delay (seconds) before returning each model output.

## Installation

```bash
pip install inspect-test-utils
```

## License

MIT. See [LICENSE](LICENSE).
