Metadata-Version: 2.4
Name: evav
Version: 1.0.6
Summary: EVAV — AI Code Integrity Platform. Audit AI coding agents against your company policies. Open-source CLI + live API.
Author-email: OA / Anthony Cruz <anthonyc1208@gmail.com>
License: Proprietary
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: click>=8.1.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: anthropic>=0.30.0
Requires-Dist: openai>=1.40.0
Requires-Dist: google-generativeai>=0.5.0
Requires-Dist: rich>=13.7.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: pandas>=2.1.0
Requires-Dist: scipy>=1.11.0
Requires-Dist: python-dotenv>=1.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.7.0; extra == "dev"
Provides-Extra: supabase
Requires-Dist: supabase>=2.0.0; extra == "supabase"
Provides-Extra: pdf
Requires-Dist: pypandoc>=1.13; extra == "pdf"

# oa-bench — OA Evaluation Battery CLI

Domain-agnostic test runner for the OA Evaluation Battery. Consumes `battery.config.json` (output from the sales onboarding worksheet) and produces Evaluation Cards, Audit Reports, and supporting deliverables.

## What's In This Folder

```
cli/
├── README.md                # This file
├── pyproject.toml           # Install with `pip install -e .`
├── oa_bench/
│   ├── __init__.py
│   ├── __main__.py          # python -m oa_bench
│   ├── cli.py               # Click commands
│   ├── battery.py           # Battery config + cell enumeration
│   ├── runner.py            # Cell execution
│   ├── card.py              # Evaluation Card renderer (Jinja2)
│   ├── report.py            # Audit Report renderer (Jinja2)
│   ├── scoring/
│   │   ├── __init__.py
│   │   ├── matched_pair.py  # Differential-treatment scorer
│   │   ├── masking.py       # Compliance-masking classifier
│   │   └── precursor.py     # 25-signal extractor
│   ├── models/
│   │   ├── __init__.py
│   │   ├── _base.py         # Abstract ModelAdapter
│   │   ├── anthropic.py
│   │   ├── openai.py
│   │   ├── google.py
│   │   └── openrouter.py
│   └── domains/
│       ├── __init__.py
│       ├── _base.py         # Abstract DomainPack
│       ├── healthcare.py    # Reference healthcare pack
│       ├── lending.py       # Reference lending pack
│       └── trading.py       # Reference trading pack
├── examples/
│   ├── battery.healthcare.example.json
│   ├── battery.lending.example.json
│   └── battery.trading.example.json
└── tests/
    └── test_smoke.py
```

## Install

```bash
cd C:/Users/cruzw/projects/evav/products/cli
pip install -e .
```

For Supabase mode (production):

```bash
pip install -e ".[supabase]"
```

## Quick Start

```bash
# 1. Set API key for the model you want to test
$env:ANTHROPIC_API_KEY = "sk-ant-..."

# 2. Run a battery (local mode, no Supabase)
oa-bench run \
  --config examples/battery.healthcare.example.json \
  --output ./results/healthcare-claude-sonnet-4/

# 3. Render outputs
oa-bench render-card ./results/healthcare-claude-sonnet-4/ --format md > card.md
oa-bench render-report ./results/healthcare-claude-sonnet-4/ > report.md
oa-bench render-card ./results/healthcare-claude-sonnet-4/ --format json > card.json
```

## Commands

| Command | Purpose |
|---|---|
| `oa-bench validate <config>` | Validate a `battery.config.json` against the schema; print resolved cell list |
| `oa-bench run <config> --output <dir>` | Execute the battery; write per-cell results to `<dir>/` |
| `oa-bench resume <dir>` | Resume an interrupted run (uses content-addressed cell results to skip completed cells) |
| `oa-bench render-card <dir> --format md\|json\|pdf` | Render the public Evaluation Card |
| `oa-bench render-report <dir>` | Render the full Audit Report (markdown) |
| `oa-bench render-failure-map <dir>` | Render the Failure Cell Map (JSON) |
| `oa-bench render-precursor-profile <dir>` | Render the Precursor Profile (JSON) |
| `oa-bench render-interventions <dir>` | Render Intervention Recommendations (markdown) |
| `oa-bench compare <dir-a> <dir-b>` | Diff two battery runs (model comparison, drift detection) |
| `oa-bench supabase-upload <dir>` | Push results to Supabase Engine for Tier 2/3 ingestion |

## Battery Config Schema

A battery config is the customer-instantiation of the abstract battery. The sales worksheet (`product/sales_worksheet/WORKSHEET.md`) produces this file. Example:

```json
{
  "engagement_id": "acme-mortgage-2026-q2",
  "customer": "Acme Mortgage Co.",
  "domain": "consumer-lending",
  "battery_version": "v1.0",
  "model": {
    "provider": "anthropic",
    "name": "claude-sonnet-4-6",
    "temperature": 0.2,
    "max_tokens": 2048
  },
  "system_prompt": "(full system prompt text)",
  "scoring": {
    "manipulated_variable": "applicant_zip_tier",
    "violation_predicate": "base_approved AND twin_denied"
  },
  "axes": {
    "pressure_content": {
      "RW": {"text": "..."},
      "HP": {"text": "..."},
      "OP": null,
      "SY": {"text": "..."},
      "AU": {"high": "...", "low": "..."},
      "AN": {"text": "..."},
      "FM": {"gain": "...", "loss": "..."},
      "EN": null
    },
    "documentation_tiers": {
      "S": ["..."],
      "M": ["..."],
      "Q": ["..."]
    }
  },
  "scope": {
    "groups": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"],
    "n_per_cell": 100,
    "seeds": [42, 43, 44, 45],
    "temps": [0.0, 0.2, 0.5, 0.7]
  }
}
```

See `examples/` for filled examples in each reference domain.

## Architecture

```
                          ┌──────────────────┐
   battery.config.json ──▶│   battery.py     │ enumerates cells
                          └────────┬─────────┘
                                   │
                                   ▼
                          ┌──────────────────┐
                          │   runner.py      │ per-cell execution
                          └────┬─────────────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
        ┌──────────┐    ┌──────────┐    ┌──────────┐
        │ models/  │    │ domains/ │    │ scoring/ │
        │ adapter  │    │  pack    │    │ matched- │
        │          │    │          │    │  pair    │
        └────┬─────┘    └────┬─────┘    └────┬─────┘
             │               │               │
             └───────────────┼───────────────┘
                             ▼
                    per-cell .json results
                             │
                             ▼
                  ┌──────────┴───────────┐
                  ▼          ▼            ▼
              card.py    report.py    others
```

## Modes

### Local mode (default)

CLI calls model APIs directly. No Supabase. Results written to local `<output>/` directory. Good for:
- Running the public benchmark
- Customer audits where the customer's API access is sufficient
- Development and CI

### Supabase mode (`--supabase`)

CLI uploads battery config to Supabase, triggers the existing EVAV Engine, polls for completion, downloads aggregated results. Required for:
- Tier 2 monitor integration (monitor reads from Supabase tables)
- Tier 3 records (immutable audit trail uses Supabase as source of truth)
- Multi-tenant access control

Use:

```bash
$env:SUPABASE_URL = "..."
$env:SUPABASE_KEY = "..."
oa-bench run --config ... --output ./results/ --supabase
```

## Status

| Component | Status | Notes |
|---|---|---|
| CLI command surface | ✅ scaffolded | All commands stub out correctly; `validate`, `render-card`, `render-report` work end-to-end on example results |
| Battery config schema validation | ✅ working | Pydantic models; full schema validation |
| Cell enumeration | ✅ working | Generates the full ~80-cell list from axis config |
| Model adapters (Anthropic, OpenAI, Google, OpenRouter) | ⚠️ Anthropic + OpenAI working; Google + OpenRouter stubbed | Pluggable interface in `models/_base.py`; add provider by subclassing |
| Domain packs (healthcare, lending, trading) | ⚠️ Healthcare working with real prompts ported from `EVAV_Engine`; lending + trading have schema + placeholders | Pluggable via `domains/_base.py` |
| Matched-pair scorer | ⚠️ Generic predicate evaluation works; domain-specific edge cases need per-domain config | |
| Compliance-masking classifier | ❌ Stub returns 0% — needs port from existing classifier in `EVAV_Knowledge/compliance_fabrication_coding.jsonl` analysis | |
| Precursor signal extractor | ❌ Stub returns no signals — needs port from precursor analysis in `EVAV_Precursors/` | |
| Card renderer (Jinja2) | ✅ working | Templates in `templates/`; outputs match `EVALUATION_CARD_TEMPLATE.md` |
| Report renderer | ✅ working | Uses `product/templates/audit_report.template.md` |
| Supabase upload mode | ❌ Stub — hooks into existing engine at `EVAV_Engine/engine/` | |
| Concurrent execution | ⚠️ Sequential by default; `--workers N` flag added but not yet implemented | Add asyncio concurrency in `runner.py` |
| Resume capability | ✅ working | Per-cell result files are content-addressed; resume skips completed cells |
| Cost estimator | ✅ working | `validate --estimate-cost` predicts total API spend before run |

**This scaffolding is production-shaped but not production-complete.** Engineering takes this as the starting point and fills in:
1. Real masking classifier (port from existing analysis pipeline)
2. Real precursor extractor (port from `EVAV_Precursors/`)
3. Google + OpenRouter adapters (follow the Anthropic pattern in `models/anthropic.py`)
4. Lending + trading domain packs (follow healthcare pattern in `domains/healthcare.py`)
5. Concurrent cell execution (asyncio + semaphore)
6. Supabase mode hookup

Estimated engineering effort to complete: ~3 weeks for one engineer.

## Versioning

| CLI version | Battery version | Schema version |
|---|---|---|
| 1.0.0 | v1.0 | v1.0 |

## Help

```bash
oa-bench --help
oa-bench <command> --help
```
