Metadata-Version: 2.4
Name: model-failure-lab
Version: 0.1.0
Summary: Local-first evaluation and failure-analysis toolkit for LLM and RAG systems.
Author: Model Failure Lab contributors
License-Expression: MIT
Project-URL: Homepage, https://github.com/Padraigobrien08/model-failure-lab
Project-URL: Repository, https://github.com/Padraigobrien08/model-failure-lab
Project-URL: Issues, https://github.com/Padraigobrien08/model-failure-lab/issues
Keywords: llm,rag,evaluation,regression-testing,prompt-testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyYAML
Provides-Extra: anthropic
Requires-Dist: anthropic; extra == "anthropic"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Provides-Extra: legacy
Requires-Dist: matplotlib; extra == "legacy"
Requires-Dist: pandas; extra == "legacy"
Requires-Dist: pyarrow; extra == "legacy"
Requires-Dist: scikit-learn; extra == "legacy"
Requires-Dist: torch; extra == "legacy"
Requires-Dist: transformers; extra == "legacy"
Requires-Dist: wilds; extra == "legacy"
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Provides-Extra: ui
Requires-Dist: streamlit; extra == "ui"
Dynamic: license-file

# Model Failure Lab

![Python](https://img.shields.io/badge/python-3.11-blue)
![License](https://img.shields.io/badge/license-MIT-green)

Model Failure Lab is a local-first evaluation and failure-analysis toolkit for LLM and RAG systems.
It helps teams run prompt datasets, classify failures, compare model versions, and turn regressions
into reusable test cases.

## What It Is

Model Failure Lab focuses on one production loop:

`failure -> report -> compare -> harvest -> promote -> rerun`

The primary value is not only executing evals, but preserving deterministic artifact history so teams
can turn regressions into durable datasets and governance decisions.

## Quickstart

Use Python 3.11 or newer.

From a local clone:

```bash
git clone <repo-url>
cd model-failure-lab
make install
make demo
```

Useful command shortcuts:

```bash
make help
make check
make smoke
```

Equivalent direct install command:

```bash
python3 -m pip install .
```

Then run the canonical workflow manually:

```bash
failure-lab run --dataset reasoning-failures-v1 --model demo
failure-lab report --run <run-id>
failure-lab run --dataset reasoning-failures-v1 --model ollama:llama3.2
failure-lab compare <baseline-run-id> <candidate-run-id>
failure-lab harvest --comparison <comparison-id> --delta regression --out datasets/harvested/regression-pack.json
failure-lab dataset promote datasets/harvested/regression-pack.json --dataset-id reasoning-regressions-v1
failure-lab run --dataset reasoning-regressions-v1 --model demo
```

If your shell does not expose the console script on `PATH`, use:

```bash
python3 -m model_failure_lab demo
```

## Example Output

Prompt case:

```text
"What is 37 * 48?"
```

Run result:

- model output: incorrect
- failure type: reasoning_error
- classification confidence: high

Comparison summary:

- regression rate: +12%
- new failure clusters: arithmetic carry errors

CLI transcript (abbreviated):

```text
$ failure-lab run --dataset reasoning-failures-v1 --model demo
Failure Lab Run
Dataset: reasoning-failures-v1
Model: demo
Status: completed
Cases: attempted=8 classified=8 errors=0
Failure rate: 62.5%
Run ID: 20260427_192110_266368_reasoning_failures_v1_demo_...

$ failure-lab report --run 20260427_192110_266368_reasoning_failures_v1_demo_...
Failure Lab Report
Status: completed
Failure types: reasoning=62.5% (5)

$ failure-lab compare <baseline-run-id> <candidate-run-id>
Failure Lab Compare
Status: improved
Compatible: True
Case changes: improvements=1
```

## Screenshots

Screenshots are supported and strongly recommended for product clarity.

Place assets under `docs/screens/`:

- `run-summary.png`
- `failure-inventory.png`
- `comparison-view.png`
- `harvest-replay-workflow.gif`

When those files exist, embed them with:

```markdown
![Run summary](docs/screens/run-summary.png)
![Failure inventory](docs/screens/failure-inventory.png)
![Comparison view](docs/screens/comparison-view.png)
![Harvest replay](docs/screens/harvest-replay-workflow.gif)
```

Reference wiring and naming live in `docs/product-screens.md` and `docs/screens/README.md`.

## Core Workflow

`failure-lab` writes artifact folders under the active root (default: current working directory):

- `datasets/`
- `runs/`
- `reports/`

Comparison outputs are persisted as report artifacts under `reports/`.

Use `--root` on commands to target a specific workspace.

For detailed artifact contracts and examples, see `docs/artifact-model.md`.

## Model Adapters

`failure-lab run --model` supports:

- `demo` for deterministic local execution
- `customer-support-failures-v1` bundled flagship support-policy pack
- `ollama:<model>`
- `anthropic:<model>` (after installing optional dependencies)
- OpenAI model names (after installing optional dependencies)

Optional extras:

- `python3 -m pip install '.[anthropic]'`
- `python3 -m pip install '.[openai]'`
- `python3 -m pip install '.[dev]'`
- `python3 -m pip install '.[legacy]'` (legacy-only surfaces)
- `python3 -m pip install '.[ui]'` (legacy Streamlit UI)

If installing from a published distribution in the future, the equivalent form is
`model-failure-lab[anthropic]`, `model-failure-lab[openai]`, `model-failure-lab[legacy]`,
and `model-failure-lab[ui]`.

## React Debugger

The React debugger reads existing artifact workspaces via:

- `FAILURE_LAB_ARTIFACT_ROOT`

Example:

```bash
export FAILURE_LAB_ARTIFACT_ROOT=/path/to/failure-lab-workspace
npm --prefix frontend run dev
```

## Development

```bash
make install-dev
make check
```

## Versioning

This project follows semantic versioning before `v1.0` in the practical sense:

- patch: bug fixes and docs
- minor: CLI-compatible feature additions
- breaking: CLI or artifact schema changes

## Legacy Surfaces

Legacy surfaces are retained for reference only and are not part of the supported production
workflow.

See:

- `docs/legacy.md`
- `docs/ui_parity.md`
- `docs/v1_4_closeout.md`

## Documentation

Detailed docs moved out of this README:

- Harvest replay: `docs/harvest-replay.md`
- Legacy surfaces: `docs/legacy.md`
- Fixture workspace: `docs/fixture-workspace.md`
- Artifact schema/model: `docs/artifact-model.md`
- Adapter extension guide: `docs/adapter-extension-guide.md`
- Architecture overview: `docs/architecture.md`
- CI governance and waivers: `docs/ci-governance.md`
- Contributor code map: `docs/code-map.md`
- 5-minute operator quickstart: `docs/getting-started-operator.md`
- Release and PyPI guide: `docs/release-and-pypi.md`

## License

This project is licensed under the MIT License. See `LICENSE`.
