Metadata-Version: 2.4
Name: reqfence
Version: 0.1.0
Summary: An agent regression firewall: replay saved agent traces and flag regressions by checking requirements, not text diffs. PASS / FAIL / UNCERTAIN.
Author: Ayush Singh
License: MIT
Project-URL: Homepage, https://github.com/AyushSingh110/reqfence
Project-URL: Repository, https://github.com/AyushSingh110/reqfence
Project-URL: Issues, https://github.com/AyushSingh110/reqfence/issues
Keywords: llm,agents,regression-testing,ci,evaluation,agent-testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pydantic>=2.0
Requires-Dist: click>=8.1
Provides-Extra: groq
Requires-Dist: groq>=0.11; extra == "groq"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == "anthropic"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Provides-Extra: publish
Requires-Dist: build>=1.0; extra == "publish"
Requires-Dist: twine>=5.0; extra == "publish"
Dynamic: license-file

# reqfence

> **An agent regression firewall.** When you change a prompt, model, or tool,
> `reqfence` replays saved agent traces and flags **regressions** by checking
> whether outputs still satisfy their **requirements** — not by text-diffing.
> Every check returns **PASS / FAIL / UNCERTAIN**.

Standalone package. Does **not** depend on `ariadx` or `fie-sdk`: the requirement
critic + trace schema are vendored and dependency-cleaned. Milestone 1 (CLI).

## Why two tiers

Validated by the [Milestone 0 derisking experiment](../milestone0/RESULTS.md)
(🟢 GREEN): text-diff can't tell a harmless reword from a confidently-wrong
answer. `reqfence` uses two tiers over developer-declared requirements:

Each declared requirement has exactly **one owner**, decided by decidability:

| Tier | Owns | Role |
|---|---|---|
| **Deterministic** (`checks.py`) | every **checkable** item (JSON-valid, field-present, tool-called, word-count, …) | Primary hard gate, ~100% precision by construction |
| **Semantic** (`semantic.py`) | only the **uncheckable** items (factual correctness) | Catches *confidently-wrong* outputs; abstains (UNCERTAIN) when the judge isn't unanimous |

The semantic judge is **never asked to grade a checkable item** — that alone
removed the false alarms an earlier "grade everything" design produced (the LLM
can't reliably count words). See [RESULTS.md](RESULTS.md).

**Final verdict** (`engine.py`, `schema.combine`): each requirement resolves to
one PASS/FAIL/UNCERTAIN; the candidate **FAILs if any requirement fails**,
**PASSes iff all pass**, else UNCERTAIN. A semantic **UNCERTAIN never fails the
build**; a deterministic **FAIL always does**.

## Install

```bash
pip install -e ".[groq]"     # or ".[anthropic]"; core installs with just pydantic+click
```
Python ≥ 3.11 (uses stdlib `tomllib`).

## The three commands

### `reqfence init`
Scaffolds `reqfence.toml` + empty `fixtures.jsonl` / `candidates.jsonl`.

### `reqfence record` — save a baseline
Stores a frozen baseline trace + its developer-declared requirement checklist.
Ingests an already-captured trace (it does not execute an agent):

```bash
# requirements.json: [{"id":"json","desc":"valid JSON","check":{"type":"valid_json"}}, ...]
reqfence record --id weather --task "Return weather as JSON" \
  --requirements requirements.json --from-trace baseline_trace.json
# or convert a framework trace:
reqfence record --id t1 --task "..." --requirements reqs.json --from-langgraph messages.json
reqfence record --id t1 --task "..." --requirements reqs.json --from-openai steps.json --openai-format run_steps
```

### `reqfence check` — gate a change
Replays candidate traces against baselines, runs both tiers, prints a
per-requirement table, and **exits non-zero if any FAIL** (UNCERTAIN does not):

```bash
reqfence check                       # uses paths from reqfence.toml
reqfence check --no-semantic         # deterministic gate only (no API key needed)
```

The semantic tier runs only when enabled **and** a key is in the environment
(`GROQ_API_KEY` / `ANTHROPIC_API_KEY`). Keys are read from the environment only;
`check` will also read a nearby `.env` for convenience but never prints or writes it.

## Requirement checks (catalog)

**Core six** (the reliable gate, unit-tested for precision):
`valid_json`, `contains_substring` (+ `regex`), `max_words`, `contains_field`,
`tool_called`, `no_tool_error`.
**Extended** (thin, tested): `min_words`, `min_sources`, `json_array_len`, `file_written`.
**Special:** `semantic` — always abstains deterministically; only the LLM tier judges it.

## Fixtures format

Versioned JSONL, one record per line (`fixtures.jsonl` = baselines + checklists,
`candidates.jsonl` = labeled candidate traces). The Milestone 0 benchmark is
migrated in under [`fixtures/`](fixtures/) via `python scripts/migrate_m0.py`.
The format is a first-class artifact designed to grow.

## Tests

```bash
pip install -e ".[dev]" && pytest      # 26 tests: checks, union/abstention, fixtures, CLI
```
