Metadata-Version: 2.4
Name: touchstone-platform
Version: 1.0.2
Summary: Touchstone — the AI Adoption Lifecycle platform. AI with receipts: forge it in the Crucible, ship what proves.
Project-URL: Homepage, https://github.com/yadavilli-solutions/touchstone
Project-URL: Documentation, https://github.com/yadavilli-solutions/touchstone/blob/main/architecture.md
Project-URL: Repository, https://github.com/yadavilli-solutions/touchstone
Project-URL: Issues, https://github.com/yadavilli-solutions/touchstone/issues
Project-URL: Changelog, https://github.com/yadavilli-solutions/touchstone/blob/main/docs/WAVE_LOG.md
Project-URL: Migration, https://github.com/yadavilli-solutions/touchstone/blob/main/MIGRATION_TO_TOUCHSTONE.md
Author: Touchstone contributors
License: # Touchstone License (formerly Model Gym License)
        
        **Version 2.0** — effective 2026-05-18.
        Supersedes the Model Gym Educational Use License v1 (2026).
        Platform renamed from Model Gym to Touchstone (Crucible benchmark engine) in v1.0.0.
        
        Copyright © 2026 Yadavilli Solutions. All Rights Reserved.
        
        > **Plain-English summary** (not part of the legal terms):
        > You can read, study, learn from, and run this code for evaluation
        > and non-commercial purposes for free. Production use, customer
        > delivery, hosting it for someone else, or making money with it
        > requires a separate commercial license — see [`COMMERCIAL.md`](COMMERCIAL.md).
        
        ---
        
        ## 1. Definitions
        
        In this License:
        
        - **"Software"** means the contents of this repository, including source
          code, scripts, configuration files, documentation, schemas, prompts,
          example data, generated artifacts, and any updates or derivative
          versions distributed by Yadavilli Solutions.
        - **"Licensor"** means Yadavilli Solutions, the sole copyright holder.
        - **"You"** means the individual or legal entity exercising rights
          under this License.
        - **"Permitted Use"** means use under Section 2 only.
        - **"Commercial Use"** means any use described in Section 3.
        - **"Derivative Work"** means any work that incorporates, modifies,
          translates, or is based on the Software, in whole or in part.
        - **"Production"** means any use that processes real data, real
          customer requests, real model evaluations for record, or any other
          use whose output is relied upon outside a learning or evaluation
          context.
        
        ## 2. Grant — Permitted Use (no fee)
        
        Subject to your compliance with this License, the Licensor grants You
        a worldwide, royalty-free, non-exclusive, non-transferable license to
        use, copy, modify, and create Derivative Works of the Software, solely
        for the following purposes:
        
        1. **Personal learning, study, and research.**
        2. **Teaching, classroom, workshop, conference talk, or demonstration**
           provided the demonstration is not part of a paid engagement.
        3. **Non-commercial evaluation** to assess whether the Software fits
           Your needs prior to entering a commercial license.
        4. **Academic publication** that cites the Software and Licensor.
        5. **Personal or research forks** that preserve this LICENSE file,
           the copyright notice, and any attribution notices.
        
        Permitted Use does **not** include Production, redistribution to third
        parties, or any Commercial Use.
        
        ## 3. Commercial Use — separate license required
        
        A separate paid Commercial License Agreement from the Licensor is
        required before You may use the Software, any Derivative Work, or any
        substantial portion thereof for any of the following:
        
        1. **Production use** by a for-profit entity, government agency, or
           any organization with paying customers, members, or beneficiaries.
        2. **Hosted services or SaaS offerings** that expose the Software's
           functionality (in whole or part) to third parties, whether or not
           payment is involved.
        3. **Customer delivery, consulting, integration, or implementation
           work** that uses the Software or any Derivative Work to fulfill
           a paid engagement.
        4. **Internal production use** by any entity with annual gross revenue
           exceeding USD 5,000,000, regardless of whether the Software's
           functionality is exposed externally.
        5. **Bundling, sub-licensing, white-labeling, or OEM distribution** of
           the Software or any Derivative Work.
        6. **AI-evaluation-as-a-service, benchmarking-as-a-service,
           certification-as-a-service, or compliance-as-a-service offerings**
           that derive material value from the Software's leaderboards,
           metrics, scenarios, scoring, or evidence-bundle output.
        7. **Training, fine-tuning, or evaluation of models intended for
           commercial deployment** where the Software's metrics, scenarios, or
           leaderboards inform the model's release decision.
        
        To obtain a Commercial License, contact: **see [`COMMERCIAL.md`](COMMERCIAL.md)**.
        
        The Licensor reserves the right to negotiate commercial terms on a
        per-engagement basis, including tiered pricing, support obligations,
        field-of-use restrictions, and exclusivity. No commercial right is
        granted by silence, by use, or by any prior course of dealing — only
        by a fully executed Commercial License Agreement.
        
        ## 4. Restrictions
        
        Under both Permitted Use and Commercial Use, You may **not**:
        
        1. Remove or alter any copyright, license, trademark, attribution, or
           author notice in the Software or any Derivative Work.
        2. Use the names "Model Gym", "Yadavilli Solutions", or any related
           trademark, logo, or trade dress, except as required to identify
           the origin of the Software or to comply with Section 5.
        3. Misrepresent the origin of the Software or any Derivative Work.
        4. Distribute the Software or a Derivative Work under any license that
           purports to grant rights broader than those granted here, including
           any OSI-approved open-source license.
        5. Reverse engineer, decompile, or disassemble any binary distribution
           of the Software except to the extent expressly required by
           applicable law.
        6. Use the Software to develop a competing AI-evaluation, agent-
           benchmarking, or certification product or service offered to third
           parties, whether free or paid, without a prior written Commercial
           License Agreement that expressly permits such use.
        7. Use the Software in violation of any applicable export-control,
           sanctions, or anti-bribery law.
        8. Represent any output of the Software (including evaluation scores,
           certifications, evidence bundles, or leaderboard rankings) as
           independently audited, third-party verified, or regulator-approved
           unless You have a separate written attestation from the Licensor.
        
        ## 5. Attribution
        
        Any Permitted Use that publishes, demonstrates, or distributes the
        Software or a Derivative Work must:
        
        1. Preserve this `LICENSE.md` file verbatim.
        2. Preserve the copyright notice in each file that contains one.
        3. Include a clearly visible attribution: "Built on Model Gym
           (https://github.com/yadavilli-solutions/benchmark) © Yadavilli
           Solutions, licensed under the Model Gym License v2."
        
        ## 6. Trademarks
        
        "Model Gym", "Yadavilli Solutions", and any associated logos are
        trademarks of the Licensor. This License does not grant any rights to
        use those trademarks except for the limited, descriptive, attribution
        purpose required by Section 5. All other use requires prior written
        permission.
        
        ## 7. Patent Grant + Termination
        
        The Licensor grants You a non-exclusive, royalty-free patent license
        to make, use, sell, offer for sale, or import the Software solely as
        necessary to exercise the rights granted in Section 2 (Permitted Use)
        or under a separate Commercial License.
        
        If You initiate or knowingly participate in any patent litigation
        (including a cross-claim or counterclaim) alleging that the Software,
        or any contribution incorporated into the Software, constitutes direct
        or contributory patent infringement, all rights granted to You under
        this License terminate as of the date such litigation is filed.
        
        ## 8. Term and Termination
        
        This License is effective until terminated.
        
        The Licensor may terminate this License immediately upon written
        notice if You breach any term. Upon termination:
        
        1. All rights granted to You under this License cease.
        2. You must promptly cease all use, destroy all copies of the
           Software in Your possession or control, and certify such
           destruction to the Licensor on request.
        3. Sections 4 (Restrictions), 6 (Trademarks), 9 (No Warranty), 10
           (Limitation of Liability), 11 (Governing Law), and 12 (Severability)
           survive termination.
        
        A Commercial License Agreement may override the termination terms of
        this License with respect to its own subject matter.
        
        ## 9. No Warranty
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
        EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
        MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND
        NON-INFRINGEMENT.
        
        The Software outputs evaluation scores, compliance citations, safety
        ratings, and certification artifacts that **may not be relied upon for
        regulatory, legal, audit, clinical, or safety decisions** without
        independent verification.
        
        The Licensor makes no representation or warranty that the Software is
        free of defects, error-free, secure, fit for any particular purpose,
        or compliant with any specific regulation in any specific jurisdiction.
        
        ## 10. Limitation of Liability
        
        IN NO EVENT SHALL THE LICENSOR, ITS AFFILIATES, OR ANY CONTRIBUTOR BE
        LIABLE FOR ANY CLAIM, DAMAGES, OR OTHER LIABILITY, WHETHER IN AN
        ACTION OF CONTRACT, TORT (INCLUDING NEGLIGENCE), OR OTHERWISE, ARISING
        FROM, OUT OF, OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
        DEALINGS IN THE SOFTWARE.
        
        WITHOUT LIMITING THE FOREGOING, THE LICENSOR'S AGGREGATE LIABILITY TO
        YOU UNDER OR IN CONNECTION WITH THIS LICENSE (REGARDLESS OF CAUSE)
        SHALL NOT EXCEED THE GREATER OF (a) USD 100, OR (b) THE FEES PAID BY
        YOU TO THE LICENSOR UNDER ANY COMMERCIAL LICENSE IN THE TWELVE (12)
        MONTHS PRECEDING THE CLAIM.
        
        ## 11. Governing Law and Disputes
        
        This License is governed by the laws of the **[JURISDICTION TO BE SET
        BY LICENSOR]**, excluding its conflict-of-laws principles. The parties
        consent to the exclusive jurisdiction of the courts located in that
        jurisdiction for any dispute arising out of or relating to this
        License.
        
        The UN Convention on Contracts for the International Sale of Goods
        does not apply.
        
        ## 12. Severability
        
        If any provision of this License is held to be unenforceable, the
        remaining provisions remain in full force and effect, and the
        unenforceable provision shall be interpreted to give effect to its
        intent to the maximum extent permitted by law.
        
        ## 13. Entire Agreement
        
        This License constitutes the entire agreement between You and the
        Licensor with respect to the Software, except to the extent superseded
        by a Commercial License Agreement that expressly references and
        modifies this License. No oral statement, prior writing, or course of
        dealing shall modify the terms of this License.
        
        ---
        
        ## Owner notes (delete before publishing the license externally)
        
        This document is a starting template, not legal advice. Before relying
        on it for actual customer agreements, have counsel:
        
        - Fill in the **JURISDICTION** placeholder in Section 11.
        - Confirm the Section 3.4 revenue threshold (USD 5M) matches your
          pricing strategy.
        - Review Section 4.6 (no-compete clause) against the laws of any
          jurisdiction where contributors live — some restrict no-compete
          enforceability.
        - Decide whether to add an explicit network-use clause (à la AGPL §13)
          if You want to require Commercial Licenses for SaaS use even without
          redistribution.
        - Confirm Section 10's liability cap is acceptable to Your insurer.
        - Cross-check the trademark assertions in Section 6 with any actual
          trademark filings.
License-File: LICENSE.md
Keywords: ai,ai-act,benchmarking,compliance,gdpr,helm,hipaa,hti-1,inspect-ai,medhelm,rbac,swe-bench,tau-bench,wmdp
Classifier: Development Status :: 4 - Beta
Classifier: Framework :: FastAPI
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Information Technology
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: <3.14,>=3.13
Requires-Dist: alembic>=1.13.0
Requires-Dist: asyncpg>=0.30.0
Requires-Dist: cryptography>=42.0.0
Requires-Dist: datasets>=3.0.0
Requires-Dist: falkordb>=1.0.0
Requires-Dist: fastapi>=0.115.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: huggingface-hub>=0.30.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: psutil>=6.0.0
Requires-Dist: psycopg2-binary>=2.9.0
Requires-Dist: pydantic>=2.10.0
Requires-Dist: pyjwt[cryptography]>=2.8.0
Requires-Dist: pypdf>=5.1.0
Requires-Dist: python-json-logger>=2.0.0
Requires-Dist: python-pptx>=1.0.2
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: reportlab>=4.5.1
Requires-Dist: slowapi>=0.1.9
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: uvicorn[standard]>=0.34.0
Provides-Extra: connectors
Requires-Dist: azure-storage-blob>=12.19.0; extra == 'connectors'
Requires-Dist: boto3>=1.34.0; extra == 'connectors'
Requires-Dist: boxsdk>=3.9.0; extra == 'connectors'
Requires-Dist: google-api-python-client>=2.110.0; extra == 'connectors'
Requires-Dist: google-auth-oauthlib>=1.2.0; extra == 'connectors'
Provides-Extra: deepeval
Requires-Dist: deepeval>=0.21.0; extra == 'deepeval'
Provides-Extra: dev
Requires-Dist: mypy>=1.10.0; extra == 'dev'
Requires-Dist: playwright>=1.59.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.6.0; extra == 'dev'
Provides-Extra: graph
Requires-Dist: falkordb>=1.0.0; extra == 'graph'
Provides-Extra: ml
Requires-Dist: accelerate>=1.2.0; extra == 'ml'
Requires-Dist: bert-score>=0.3.13; extra == 'ml'
Requires-Dist: presidio-analyzer>=2.2.0; extra == 'ml'
Requires-Dist: presidio-anonymizer>=2.2.0; extra == 'ml'
Requires-Dist: sentence-transformers>=3.0.0; extra == 'ml'
Requires-Dist: spacy>=3.7.0; extra == 'ml'
Requires-Dist: torch>=2.5.0; extra == 'ml'
Requires-Dist: transformers>=4.48.0; extra == 'ml'
Provides-Extra: office
Requires-Dist: openpyxl>=3.1.0; extra == 'office'
Provides-Extra: swebench
Requires-Dist: swebench>=1.1.0; extra == 'swebench'
Description-Content-Type: text/markdown

# Touchstone

> *Touchstone — AI with receipts. Forge it in the Crucible. Ship what proves.*

**Touchstone** is the umbrella AI-certification platform. **Crucible** is its HELM-style
benchmark engine: multi-leaderboard AI benchmarking for enterprise —
legacy modernization, healthcare AI, frontier safety, agentic tool-use, code patch
generation, and per-vertical / per-regulation evaluation, all under one
FastAPI service with shared persistence, an HMAC-chained audit ledger,
evidence-bundle export, and an Inspect-AI-compatible eval-log exporter.

> **40 leaderboards. 19+ metrics. 780+ tests green.**
> Last sync: 2026-05-17 (Wave 9 Track A/B).

---

## What you can do today

> Source: [`docs/diagrams/request-flow.mmd`](docs/diagrams/request-flow.mmd) — render via `npx -y @mermaid-js/mermaid-cli -i docs/diagrams/request-flow.mmd -o request-flow.svg`

```mermaid
flowchart LR
    User["User / CI / dashboard"] --> API["FastAPI<br/>app/main.py"]

    API --> Core["Crucible<br/>HELM-style eval-core<br/>app/eval/"]
    API --> Legacy["Legacy<br/>GymSpec runtime<br/>app/engine.py"]

    subgraph Boards["40 leaderboards (auto-discovered)"]
      Modern["Modernization × 5<br/>classic / evidence / robustness<br/>agentic v1.2 / safety"]
      SWE["SWE-Bench Verified<br/>(struct + Docker exec)"]
      Med["MedHELM (10 RunSpecs)"]
      Vert["Per-vertical × 17<br/>HCLS · retail · fin-svc<br/>hi-tech · industry · edu-K12"]
      Comp["Compliance × 9<br/>HIPAA · PCI-DSS · SOX · GDPR<br/>CMS · HTI-1 · 3 state AI<br/>OWASP × 2 · MITRE ATLAS"]
      Safe["Frontier safety × 2<br/>WMDP · medical_safety<br/>+ agent_threat_safety"]
    end

    Core --> Boards

    Boards --> Snap["Postgres :25432<br/>leaderboard_snapshots"]
    Boards --> Traces["traces/<run_id>/*.json"]
    Boards --> Ledger["audit_ledger.jsonl<br/>(HMAC chained)"]

    Snap --> Bundle["Evidence bundle ZIP<br/>/evidence_bundle"]
    Traces --> Bundle
    Ledger --> Bundle

    Snap --> Inspect["Inspect AI EvalLog v2<br/>/inspect_log"]
    Traces --> Inspect
```

- Run any of 30 leaderboards against any model with hash-keyed request caching
- Stream HuggingFace datasets with `access_tier: public | gated | private`
  (set `HF_TOKEN` for gated, `MODELGYM_HF_OFFLINE=1` for airgap / CI)
- Score with 19+ metrics across 7 groups (core, evidence, HCLS, compliance,
  agentic, safety, code)
- Export an Inspect-AI-compatible EvalLog v2 JSON for any run
- Export an evidence bundle (snapshot + traces + audit slice + scenario
  fingerprints) per run
- Drive the legacy GymSpec onboarding/certification path on the same store

---

## Leaderboard catalog

| Category             | ID prefix              | Boards | Headline metric         |
|----------------------|------------------------|--------|-------------------------|
| Modernization (core) | `modernization_*`      | 5      | varies per board        |
|   · classic          | `modernization_classic`         | 1 | exact_match + judge   |
|   · evidence         | `modernization_evidence`        | 1 | citation_f1 + hallucination_rate |
|   · robustness       | `modernization_robustness`      | 1 | robustness_delta      |
|   · agentic          | `modernization_agentic` v1.2    | 1 | step_success_rate + action_set_jaccard + task_completed_partial (TAU-Bench live mocks) |
|   · safety           | `modernization_safety`          | 1 | wmdp_safety (inverted) |
| SWE-Bench Verified   | `swe_bench_verified` v1.1  | 1      | patch_files_match_gold + swe_bench_resolved (opt-in Docker)  |
| MedHELM              | `medhelm_modernization` | 1     | jury_score + bertscore  |
| Vertical             | `vertical_*`           | 17     | per-vertical jury_score |
| Compliance           | `compliance_*`         | 9      | control_match (HIPAA, PCI-DSS, SOX, GDPR, CMS-Interop, HTI-1, NYC LL 144, CA AI laws, IL AI laws + OWASP LLM Top 10, OWASP Agentic, MITRE ATLAS) |
| Frontier safety      | `modernization_safety`, `medical_safety` | 2 | wmdp_safety, red_team_safety (inverted) |
| Agent threat         | `agent_threat_safety`  | 1 | atr_safety (ATR rule pattern coverage) |

For the full registry: `uv run python scripts/export_registry.py`.

---

## Quick start

```bash
# install (uv is the toolchain)
uv sync

# start Postgres on 25432 + apply migrations
docker compose up -d db
uv run alembic upgrade head

# run the full eval test suite (offline by default)
MODELGYM_HF_OFFLINE=1 uv run pytest tests/eval/ -q

# boot the API + dashboard
uv run uvicorn app.main:app --reload --port 8000
# → http://localhost:8000/leaderboards.html
# → http://localhost:8000/dashboard         (legacy GymSpec UI)
```

### Optional live data

```bash
# pull real HF datasets (MedHELM, WMDP, SWE-Bench Verified, etc.)
unset MODELGYM_HF_OFFLINE
export HF_TOKEN=hf_xxx      # only needed for gated-tier scenarios
uv run pytest tests/eval/test_medhelm_real_data.py -q
```

---

## Install

```bash
# Canonical package name (formerly model-gym):
pip install touchstone-platform

# Docker image:
docker pull ghcr.io/yadavilli-solutions/touchstone
```

See [MIGRATION_TO_TOUCHSTONE.md](MIGRATION_TO_TOUCHSTONE.md) if you are upgrading from `model-gym`.

---

## Environment variables

| Var                              | New name (TOUCHSTONE_*)           | Purpose                                                    | Required in prod |
|----------------------------------|-----------------------------------|------------------------------------------------------------|------------------|
| `DATABASE_URL`                   | —                                 | Postgres DSN (default: `:25432/modelgym`)                  | yes              |
| `AUDIT_HMAC_SECRET`              | —                                 | HMAC key for audit ledger chain                            | yes              |
| `MODELGYM_AUDIT_LEDGER` (deprecated; use `TOUCHSTONE_AUDIT_LEDGER`) | `TOUCHSTONE_AUDIT_LEDGER` | Path to append-only ledger JSONL | yes |
| `MODELGYM_TRACE_ROOT` (deprecated; use `TOUCHSTONE_TRACE_ROOT`) | `TOUCHSTONE_TRACE_ROOT` | Root dir for per-instance trace JSON | yes |
| `MODELGYM_REQUEST_CACHE_PATH` (deprecated; use `TOUCHSTONE_REQUEST_CACHE_PATH`) | `TOUCHSTONE_REQUEST_CACHE_PATH` | SQLite path for hash-keyed model-call cache | yes |
| `MODELGYM_HF_OFFLINE` (deprecated; use `TOUCHSTONE_HF_OFFLINE`) | `TOUCHSTONE_HF_OFFLINE` | `1` to forbid HF network calls during tests/airgap | no |
| `HF_TOKEN`                       | —                                 | HuggingFace token — required for `access_tier=gated`       | when gated used  |

---

## Adding a new benchmark

1. Drop a scenario class in `app/eval/specs/<your_thing>.py` (subclass
   `HuggingFaceDatasetScenario` for HF, or implement `Scenario` directly).
2. Drop a leaderboard registration in `app/eval/leaderboards/<your_thing>.py`
   (calls `REGISTRY.register(Leaderboard(...))` at import time).
3. FastAPI auto-discovers it on next boot. Add a test under
   `tests/eval/test_<your_thing>.py`.

For a worked example, see `app/eval/specs/wmdp_scenarios.py` +
`app/eval/leaderboards/modernization_safety.py` + `tests/eval/test_wmdp.py`.

---

## Documentation map

- [`architecture.md`](architecture.md) — full architectural truth (read this
  first when extending)
- [`ROADMAP.md`](ROADMAP.md) — what shipped, what's next
- [`MIGRATION_TO_TOUCHSTONE.md`](MIGRATION_TO_TOUCHSTONE.md) — rename guide
- [`docs/WAVE_LOG.md`](docs/WAVE_LOG.md) — Plans 1-7 vs ship reality + Waves 1-5
- [`docs/superpowers/plans/`](docs/superpowers/plans/) — original written plans
  (superseded; WAVE_LOG is canonical)
- [`PATH_TO_PROD.md`](PATH_TO_PROD.md) — known prod-readiness boundaries

---

## Boundaries (not yet production-safe)

- `X-Touchstone-Role` header (formerly `X-Gym-Role`) is **dev-mode only**.
  Set `MODELGYM_ENV=production` (or `TOUCHSTONE_ENV=production`) to fail-close
  — Keycloak is then required (or `MODELGYM_ALLOW_DEV_AUTH=1` /
  `TOUCHSTONE_ALLOW_DEV_AUTH=1` for airgap deploys that issue their own local JWTs).
- SWE-Bench Docker execution (FAIL_TO_PASS / PASS_TO_PASS) is **opt-in**
  via `MODELGYM_SWE_BENCH_EXEC=1` (or `TOUCHSTONE_SWE_BENCH_EXEC=1`) + a
  docker daemon + the upstream `swebench/sweb.eval.*` per-instance images.
  Without it, scoring is structural-only.
- WMDP scoring remains structural (frontier-safety MCQ). TAU-Bench has
  live retail + airline mocks (Waves 7.4 + 7.5); 4 of 12 vendored tasks
  have `gold_final_state` for `TaskCompletionMetric` — remaining authoring
  is rolling content work.
- BERTScore falls back to `rouge_L` when the `bert_score` package isn't
  installed (set `MODELGYM_BERTSCORE_REQUIRE_REAL=1` /
  `TOUCHSTONE_BERTSCORE_REQUIRE_REAL=1` to fail-close).
- Inspect AI export emits both `.json` and a real in-tree `.eval` zipfile
  (Wave 7.1) — no `inspect-ai` dep required.

---

## License

See [`LICENSE.md`](LICENSE.md).
