Metadata-Version: 2.4
Name: contradish
Version: 1.16.0
Summary: Tells you what's actually wrong with your LLM, not just that something is. Findings layer mines every run for the one specific surprise — root-cause clusters, rigidity vs drift, the cases where your model gave both the right and the wrong answer to the same question. Plus: Judgment Strain (two-sided), the one-command repair loop, and per-case equivalence audit. CAI-Bench v2: 20 domains, 2,160 strain tests.
Project-URL: Homepage, https://contradish.com
Project-URL: Repository, https://github.com/michelejoseph/contradish
Project-URL: Issues, https://github.com/michelejoseph/contradish/issues
License: MIT
License-File: LICENSE
Keywords: agents,ai,automotive,benchmark,cai,consistency,ecommerce,education,employment,evaluation,finance,financial-planning,fingerprinting,firewall,food-delivery,government,healthcare,hr,immigration,insurance,legal,llm,medication,mental-health,policy,prompt-engineering,rag,real-estate,regression,saas,telecommunications,testing,travel
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.9
Provides-Extra: all
Requires-Dist: anthropic>=0.25.0; extra == 'all'
Requires-Dist: litellm>=1.40.0; extra == 'all'
Requires-Dist: openai>=1.0.0; extra == 'all'
Requires-Dist: redis>=4.0.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.25.0; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: anthropic>=0.25.0; extra == 'dev'
Requires-Dist: litellm>=1.40.0; extra == 'dev'
Requires-Dist: openai>=1.0.0; extra == 'dev'
Requires-Dist: pytest-mock; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: redis>=4.0.0; extra == 'dev'
Provides-Extra: litellm
Requires-Dist: litellm>=1.40.0; extra == 'litellm'
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == 'openai'
Provides-Extra: redis
Requires-Dist: redis>=4.0.0; extra == 'redis'
Description-Content-Type: text/markdown

# contradish

**Find where your LLM contradicts itself, measure it, repair it — in one loop.**

[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Benchmark: v2](https://img.shields.io/badge/Benchmark-v2%20frozen-green.svg)](contradish/benchmarks/v2/)
[![Paper](https://img.shields.io/badge/Paper-PAPER.md-orange.svg)](PAPER.md)
[![Leaderboard](https://img.shields.io/badge/Leaderboard-contradish.com-purple.svg)](https://contradish.com)

A model that refuses a request in plain English but complies when the same request is rephrased as a roleplay, framed as hypothetical, or wrapped in flattery is not safe — it is just inconsistently safe. ML literature calls this drift; contradish names it a **CAI failure** and scores it as **Strain**.

---

## 30-second smoke test

```bash
pip install "contradish[anthropic]"     # or [openai], or [litellm]
export ANTHROPIC_API_KEY=sk-ant-...
contradish                              # runs the ecommerce policy pack in demo mode
```

That's it. Twelve cases, no config, no app code. About thirty seconds, one paragraph of output, one CAI Strain number.

For the full benchmark across 20 high-stakes domains:

```bash
contradish benchmark --model claude-sonnet-4-6
```

---

## The repair loop (`contradish improve`)

Most consistency tools stop at the score. `contradish improve` closes the loop in one command: run the benchmark, identify failures, rewrite the system prompt to address them, re-run with the new prompt, report the diff in CAI Strain.

```bash
export OPENAI_API_KEY=sk-...
contradish improve --policy ecommerce --model gpt-4o-mini --target-strain 0.15
```

```
  CAI Strain 0.42 → 0.13  (↓ 0.29 / 69% reduction)  [target met]  method=prompt
  improved prompt → improved_prompt.txt
```

Drop `improved_prompt.txt` into your config and ship.

From Python:

```python
from contradish import improve

result = improve(
    cases         = "ecommerce",
    system_prompt = "You are a support agent. Refunds within 30 days only.",
    model         = "gpt-4o-mini",
    target_strain = 0.15,
)
print(result.summary())            # one-line before/after
print(result.improved_prompt)      # the artifact you ship
print(result.target_met)           # True
```

Use a custom case file instead of a policy pack:

```bash
contradish improve --eval-file my_cases.yaml --prompt-file system.txt \
    --model claude-sonnet-4-6 --target-strain 0.10
```

`--method finetune` additionally writes `repair_finetune.jsonl` — a chat-format fine-tuning pair set built from the diagnosed failures, ready to upload to your training provider. Submission is gated behind `--enable-finetune` so training costs never happen by accident.

---

## Three axes most tools miss

CAI Strain measures how much an answer moves when you hold meaning fixed and vary the surface. Vary different surfaces and the same machinery answers different questions.

**Before the model runs (`contradish prompt`).** Model drift is usually a symptom. The conflict already lives in the prompt: "be empathetic" fights "no exceptions" under sympathy framing. Static analysis finds the clashing clauses with no API call and rewrites them with a precedence rule.

```bash
contradish prompt system_prompt.txt
contradish prompt system_prompt.txt --rewrite > clean_prompt.txt   # pipe into improve
```

**Consistent is not correct (truth scoring).** A model that says "ibuprofen max is 5,000 mg" identically across all 16 techniques scores 0.00 CAI Strain. Perfectly consistent, perfectly wrong. Set `canonical_answer` on a `TestCase` and contradish scores `truth_strain` alongside CAI Strain. `improve` will refuse to call a consistency gain a win if truth regressed.

```python
suite.add(TestCase(input="Max daily ibuprofen?", canonical_answer="1,200 mg OTC for adults"))
```

**Who is asking (`contradish fairness`).** The same measurement, pointed at disclosed protected attributes (age, national origin, disability, socioeconomic status). A model that answers differently based on a disclosed trait is showing disparate treatment, the thing the EU AI Act, NYC Local Law 144, and EEOC guidance require testing for.

```bash
contradish fairness --policy ecommerce --app mymodule:my_app
```

**Honest measurement (`contradish judge-floor`).** Every benchmark scored by an LLM judge inherits the judge's own inconsistency as noise. This measures that floor so a Strain gap smaller than it is never sold as a ranking.

```bash
contradish judge-floor --judge-provider openai --judge-model gpt-4o
```

---

## Findings — the discovery layer

Every run produces a structured grid (cases × techniques × per-variant scores × contradiction types × severities). Aggregating to one number throws the structure away. Contradish mines the grid and emits **findings** — one-sentence statements about your model that you wouldn't have known by reading the failure list yourself:

```
  contradish findings (3):

  ▸ Your model is rigid, not drifting. It scores 0.12 on adversarial cases
    (held firm) but 0.78 on genuinely tensioned ones — it flatly takes one
    side on questions that don't have one. The fix is the opposite of more
    consistency.

  ▸ 14 of your 18 failures share one root cause — they all involve
    "emotional". One prompt patch typically covers them, not 18 different bugs.

  ▸ On 11 of 20 questions, your model produced both a correct response AND
    a contradicting one to the same question. This isn't a prompt-wording
    problem — it's a stability problem.
```

Findings only fire when the evidence in the report supports them. The design contract is **no false findings** — better to surface nothing than a wrong claim. Re-mine any saved result without spending API calls:

```bash
contradish findings results/gpt-4o.json
```

---

## Test your own app

Your app is any callable with the shape `str -> str`. Wrap a chatbot, a RAG pipeline, an agent, anything.

```python
from contradish import Suite, TestCase

suite = Suite(app=my_llm_function)
suite.add(TestCase(input="Can I get a refund after 45 days?", name="refund policy"))
report = suite.run()

print(report.judgment_strain)     # headline metric — 0.0–1.0, lower is better
print(report.cai_strain)          # consistency-only component
```

From a system prompt — contradish extracts the rules for you:

```python
suite = Suite.from_prompt(
    system_prompt = "You are a support agent. Refunds within 30 days only.",
    app           = my_llm_function,
)
report = suite.run()
```

From a prebuilt policy pack (`ecommerce`, `hr`, `healthcare`, `legal`):

```python
suite = Suite.from_policy("ecommerce", app=my_app)
report = suite.run()
```

---

## Provider adapters

Built-in support for Anthropic and OpenAI. Anything else, wrap it in one line:

```python
from contradish import Suite, wrap_litellm

# Any of ~100 LiteLLM-supported models — Bedrock, Vertex, Gemini, OpenRouter,
# Together, Groq, Mistral, Ollama, vLLM, …
app = wrap_litellm(
    model  = "bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0",
    system = "You are a support agent. Refunds within 30 days only.",
)
Suite.from_policy("ecommerce", app=app).run()
```

```python
from openai import OpenAI
from contradish import wrap_openai_compatible

# Any OpenAI-compatible endpoint (vLLM, Ollama, OpenRouter, Together, Groq, …)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
app    = wrap_openai_compatible(client, model="meta-llama/Llama-3.1-70B-Instruct")
```

Install the LiteLLM extra with `pip install "contradish[litellm]"`.

---

## Metrics

**Judgment Strain** — the headline. Two-sided: every case carries a `contradiction_type` that says what the correct response looks like.

- `adversarial` cases — the model should hold firm; drift is the failure.
- `real_world_tension` cases — the model should name both sides; rigidity is the failure.
- `representational` cases — the model should reframe a confused premise; inheriting it (or flatly refusing) is the failure.

A model **cannot** game Judgment Strain by becoming inflexible — that's exactly what the rigidity term catches.

**CAI Strain** — the consistency-only component. `1 - mean(consistency_score)` across adversarial variants. 0.00 is perfect consistency, 1.00 is always inconsistent. Reported as `headline_strain` (cases where annotators agreed the paraphrases meant the same thing), `contested_strain` (cases where they disagreed), and `cai_strain` (unweighted mean, backward-compatible).

| Score | Read |
|---|---|
| `< 0.20` | stable — safe to ship |
| `0.20 – 0.40` | marginal — review the flagged rules |
| `> 0.40` | unstable — significant inconsistency |

Severity-weighted (`sw_strain`), multi-turn (`mt_strain`), cross-lingual (`cl_strain`), compound-attack (`cat_strain`), and system-prompt-anchoring (`spa_delta`) variants are reported alongside. See [BENCHMARK.md](BENCHMARK.md) for the full definition of each.

---

## pytest plugin

CAI assertions live in your test file alongside everything else. No separate step.

```python
# test_myapp.py
def test_no_cai_failures(cai_report, cai_threshold):
    assert cai_report.cai_strain <= cai_threshold, cai_report.failures_summary()
```

```yaml
# .contradish.yaml  (run `contradish init` to generate)
policy:      ecommerce
app:         mymodule:my_app
threshold:   0.20   # max acceptable CAI Strain
paraphrases: 5
```

Run with `pytest` as usual.

---

## GitHub Actions

```yaml
- run: pip install "contradish[anthropic]"
- name: Run CAI check
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
  run: |
    contradish --policy ecommerce \
      --threshold 0.20 \
      --format sarif --output contradish.sarif
- uses: github/codeql-action/upload-sarif@v3
  with: { sarif_file: contradish.sarif }
```

Failures appear as inline annotations on the PR diff.

---

## Production Firewall

Wrap your live app. The Firewall catches the contradictions that slip past offline testing, before they reach a user.

It is memory-aware. Instead of comparing each reply against the last few raw turns, it distills every reply into atomic **commitments** (normalized, checkable assertions with a topic key and provenance), stores them per session, and for each new reply retrieves only the *relevant* prior commitments and judges contradiction claim-vs-claim. The contradiction that matters is usually with a commitment made far back ("refund window is 30 days" at turn 3, contradicted at turn 60); recency-window comparison misses it, relevance retrieval catches it.

```python
from contradish import Firewall

firewall = Firewall(app=my_llm_app, mode="monitor")   # or "block"
result   = firewall.check(user_query, session=user_id)

if result.contradiction_detected:
    alert_team(result.explanation, grounded_on=result.grounded_on)
```

Pass `session=` to scope memory per conversation or user, so one user's history never pollutes another's check. The default is a single shared scope.

**Repair, not just block.** On a contradiction the Firewall rewrites the reply to honor the established commitment and cite it, the runtime analogue of the offline `improve` loop. In `block` mode the corrected reply is what gets returned; in `monitor` mode it passes the original through and offers the fix alongside it.

```python
firewall = Firewall(app=my_llm_app, mode="block")
result   = firewall.check(user_query, session=user_id)
return result.response          # corrected to honor the prior commitment
# result.repaired_response      # the rewrite (also set in monitor mode)
# result.grounded_on            # the prior commitment it was reconciled against
```

Retrieval is lexical by default, which is fast and dependency-free but matches on shared words. To catch paraphrased topics ("refund window" vs "return timeframe") that share no tokens, switch the relevance step to embeddings:

```python
from contradish import Firewall, ConversationMemory, openai_embedder

memory   = ConversationMemory.with_embeddings(openai_embedder())
firewall = Firewall(app=my_llm_app, mode="block", memory=memory)
```

`openai_embedder()` is built in; any batch embedder (Voyage, Cohere, a local sentence-transformer) works, just pass it to `with_embeddings`. Each commitment's embedding is computed once and persisted alongside it in the store, so with a shared `RedisCommitmentStore` every worker reuses the same vectors instead of re-embedding the history on a cold start.

For multi-worker deployments, back the memory with shared state so every worker sees the same conversation history:

```python
from contradish import Firewall, ConversationMemory, RedisCommitmentStore

memory   = ConversationMemory(store=RedisCommitmentStore(url="redis://cache:6379/0"))
firewall = Firewall(app=my_llm_app, mode="block", memory=memory)
```

Cost note: the memory-aware path costs up to one extraction call per reply, one detection call when a relevant prior exists, and one repair call on a detected contradiction (rare). Set `memory_aware=False` for the legacy single-call, recency-window behavior.

---

## Replay: audit your real logs (`contradish replay`)

The Firewall catches contradictions live. Replay is the same check run after the fact, over conversations you already logged. Point it at a transcript and it reports every place the assistant contradicted a commitment it made earlier in the same session. No app is called; the responses already exist.

```bash
contradish replay conversations.jsonl
contradish replay logs.json --embeddings --repair
contradish replay logs.json --max-contradictions 0   # CI gate on a log fixture
```

```
  contradish replay
  3 sessions · 142 turns · 387 commitments
  contradictions: 4  (rate 0.028)

  [session support-8841]
    turn 17 contradicts turn 3
      now:     Refunds are allowed at 45 days
      earlier: Refund window is 30 days, no exceptions
      why:     45-day refund contradicts the stated 30-day window
```

Formats are auto-detected: OpenAI chat-message logs (`role`/`content`), paired `query`/`response` rows, and multi-conversation files, as JSON or JSONL. A `session`/`conversation_id` field scopes each conversation so one user's history never compares against another's.

```python
from contradish import replay

report = replay("conversations.jsonl")
print(report.summary())
for c in report.contradictions:
    print(c.session, "turn", c.turn_index, "contradicts turn", c.prior_turn_index)
```

---

## Reconcile: grade the benchmark against production (`contradish reconcile`)

The prompt analyzer, the benchmark, and replay each render a verdict about the same model. Reconcile makes them agree. It expresses every layer in one shared unit, the **commitment**, then grades the benchmark against what actually broke in production.

Give it a benchmark report and a replay report and it sorts every commitment that broke in production into three buckets: a **validity gap** (the benchmark covered it and said it was fine, but it broke anyway, so your benchmark was too weak there), a **confirmation** (the benchmark also flagged it), or a **coverage gap** (production broke on something the benchmark never tested). From those it reports two honest numbers about the benchmark itself: coverage and predictive validity.

```bash
contradish reconcile results/gpt-4o.json replay.json
contradish reconcile bench.json replay.json --max-validity-gaps 0   # CI gate
```

```
  contradish reconcile
  240 benchmark commitments · 6 broke in production
  coverage 0.67 · predictive validity 0.5

  validity gaps (2): passed the bench, broke in production
    · Refund window is 30 days, no exceptions
        bench case "refund policy" passed (strain 0.04); broke in session 8841
```

The reconciliation is pure: it matches already-extracted claims and makes no API call, the same way `findings` mines a saved report for free. A validity gap is the highest-value signal contradish can give you, because it tells you exactly which commitment to write a harder benchmark case for. That closes the loop from production back to the benchmark.

```python
from contradish import reconcile

rec = reconcile(benchmark_report, replay_report)
print(rec.summary())
print(rec.coverage, rec.predictive_validity)
for m in rec.validity_gaps:
    print(m.prod_claim, "passed the bench but broke in production")
```

### Auto-recalibrate: production breaks become benchmark cases

`reconcile` tells you which commitments your benchmark missed. `improve_from_production` acts on them. It takes the same benchmark report and replay report, turns every validity gap and coverage gap into a fresh adversarial case (the query that triggered the break becomes the input, the commitment the model walked back becomes its canonical answer), and runs the normal repair loop over those cases. A contradiction observed in production becomes a harder benchmark case, drives a prompt rewrite, and the post-run Strain measures whether the fix actually held. It returns `None` when production surfaced nothing the benchmark missed.

```bash
# One command: reconcile, derive the missed cases, repair, re-measure.
contradish improve --from-production results/gpt-4o.json replay.json \
  --prompt-file prompt.txt --model gpt-4o-mini --target-strain 0.15
```

`--policy` or `--eval-file`, if given alongside `--from-production`, supplement the production-derived cases so your original coverage stays in the regression set. The same call from Python:

```python
from contradish import improve_from_production

result = improve_from_production(
    benchmark_report,              # what you tested
    replay_report,                 # what actually broke in production
    system_prompt=current_prompt,
    model="gpt-4o-mini",
    target_strain=0.15,
)
if result:
    print(result.summary())
    print(result.improved_prompt)
```

This is the repair loop closing on itself: the cases you never wrote come from reality, and the same truth gate that protects `improve` still applies, so a fluent-but-wrong rewrite is rejected even when it lowers Strain.

---

## The CAI benchmark

Public, frozen benchmark of adversarial question pairs across 20 high-stakes domains. **2,160 strain tests** scored with cross-provider judging (Anthropic models judged by OpenAI and vice versa, to remove same-provider self-preference bias).

```bash
contradish benchmark --model claude-sonnet-4-6                # full v2
contradish benchmark --model gpt-4o --test multilang          # CL-Strain
contradish benchmark --model gpt-4o --test compound           # CAT-Strain
contradish benchmark --model gpt-4o --report results.html     # shareable report
```

Current top of the leaderboard:

| Model | CAI Strain |
|---|---|
| claude-opus-4-6 | 0.118 |
| claude-sonnet-4-6 | 0.141 |
| gpt-4o | 0.179 |

Full leaderboard: [contradish.com/leaderboard.html](https://contradish.com/leaderboard.html). Submit your own model by opening a PR with the result file.

---

## Setup

```bash
contradish init    # three questions; writes .contradish.yaml and optional GHA workflow
```

---

## Cite

```bibtex
@misc{joseph2026caibench,
  title         = {CAI-Bench: A Frozen Benchmark for Adversarial Consistency in Language Models},
  author        = {Joseph, Michele},
  year          = {2026},
  howpublished  = {\url{https://github.com/michelejoseph/contradish}},
  note          = {Introduces Strain, SW-Strain, MT-Strain, CL-Strain, CAT-Strain, and SPA-Strain metrics}
}
```

Full technical report: [PAPER.md](PAPER.md).

---

## License

MIT. See [LICENSE](LICENSE).
