Metadata-Version: 2.4
Name: contradish
Version: 0.4.2
Summary: CAI testing for LLM apps. Prebuilt policy packs, regression gating, live Contradiction Firewall, and automated prompt repair.
Project-URL: Homepage, https://contradish.com
Project-URL: Repository, https://github.com/michelejoseph/contradish
Project-URL: Issues, https://github.com/michelejoseph/contradish/issues
License: MIT
License-File: LICENSE
Keywords: agents,ai,cai,compression-aware-intelligence,consistency,ecommerce,evaluation,firewall,healthcare,hr,legal,llm,policy,prompt-engineering,rag,reasoning,regression,testing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.9
Provides-Extra: all
Requires-Dist: anthropic>=0.25.0; extra == 'all'
Requires-Dist: openai>=1.0.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.25.0; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: anthropic>=0.25.0; extra == 'dev'
Requires-Dist: openai>=1.0.0; extra == 'dev'
Requires-Dist: pytest-mock; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == 'openai'
Description-Content-Type: text/markdown

# contradish

CAI testing for LLM applications.

A CAI failure is when your app says "refunds within 30 days" to one phrasing and "we can work something out" to a slightly different one. Same policy, same session, opposite answers. Contradish finds these, scores them, and gives you the tools to fix them before users do.

```
pip install contradish
```

---

## What it does

**Offline testing** — run before deploy. Contradish generates adversarial paraphrases of your test inputs, sends them all to your app, and scores consistency across responses.

**Regression gating** — compare baseline vs candidate on the same test suite. Block merges if the CAI score drops below your threshold.

**Production monitoring** — wrap your live app with the Firewall. It checks each response against recent ones and flags (or blocks) contradictions in real time.

**Prompt repair** — failing tests? Contradish generates 3 improved prompt variants, tests each one, and ranks them by CAI score so you know exactly which fix worked.

---

## Quickstart

```python
from contradish import Suite, TestCase

suite = Suite(app=my_llm_function)
suite.add(TestCase(input="Can I get a refund after 45 days?", name="refund policy"))
report = suite.run()

print(report.cai_score)           # 0.0-1.0, higher = more consistent
for r in report.results:
    print(r.test_case.name, r.cai_score)
```

Or give it your system prompt and let it figure out the test cases:

```python
suite = Suite.from_prompt(
    system_prompt="You are a support agent. Refunds within 30 days only.",
    app=my_llm_function,
)
report = suite.run()
```

Or from the CLI:

```bash
export ANTHROPIC_API_KEY=sk-ant-...

# test a system prompt directly (uses your LLM as the demo app)
contradish "You are a support agent. Refunds within 30 days only."

# test from a file
contradish --prompt system_prompt.txt

# test your own app
contradish --prompt system_prompt.txt --app mymodule:my_app_function

# JSON output for CI pipelines
contradish --prompt system_prompt.txt --json
```

---

## Policy packs (new in v0.4.2)

No system prompt? No test cases? Start here.

Contradish ships with prebuilt domain packs that let you get real CAI results in under 2 minutes.

```bash
# No --app needed. Runs in demo mode against the raw LLM.
contradish --policy ecommerce

# Test your actual app against the pack.
contradish --policy ecommerce --app mymodule:my_support_bot
contradish --policy hr --app mymodule:my_hr_assistant
contradish --policy healthcare --app mymodule:my_benefits_bot
contradish --policy legal --app mymodule:my_legal_tool
```

From Python:

```python
from contradish import Suite

# Loads 12 e-commerce test cases. No test case writing required.
suite = Suite.from_policy("ecommerce", app=my_app)
report = suite.run()
```

Or load the pack directly to inspect or extend it:

```python
from contradish import load_policy, list_policies

print(list_policies())     # ['ecommerce', 'hr', 'healthcare', 'legal']

pack = load_policy("ecommerce")
print(pack.display_name)   # "E-Commerce Support"
print(len(pack))           # 12

# Add a custom case to the prebuilt pack
from contradish import Suite
suite = Suite(app=my_app)
for tc in pack.cases:
    suite.add(tc)
suite.add(TestCase(name="custom", input="My own test question"))
suite.run()
```

**Available packs:**

| Pack | Cases | Covers |
|---|---|---|
| `ecommerce` | 12 | Refunds, returns, price matching, shipping, warranties |
| `hr` | 12 | PTO, benefits, parental leave, termination, overtime |
| `healthcare` | 12 | Coverage, referrals, deductibles, prior auth, eligibility |
| `legal` | 12 | Disclaimers, liability, advice boundaries, data privacy |

Each case targets a real inconsistency vector — the places where LLM support bots most often give different answers to the same underlying question.

---

## CAI score

A number from 0 to 1 measuring how consistently your app responds to semantically equivalent inputs.

- `0.80+` — stable. Safe to ship.
- `0.60-0.79` — marginal. Review the flagged rules.
- `< 0.60` — unstable. CAI failures detected.

```
CAI FAILURE: "refund window"
  input:      "Can I get a refund after 45 days?"
  paraphrase: "I bought this 6 weeks ago, can I still return it?"
  output_a:   "Refunds are only available within 30 days of purchase."
  output_b:   "We can usually make exceptions for recent purchases."
  CAI score:  0.54 (unstable)

1 CAI failure found. 2 rules clean.
```

---

## Regression testing

Compare two versions of your app before merging. CI fails automatically if the CAI score drops.

```python
from contradish import RegressionSuite, TestCase

suite = RegressionSuite(
    test_cases=[
        TestCase(input="Can I get a refund after 45 days?"),
        TestCase(input="Do you price match competitors?"),
    ]
)

result = suite.compare(
    baseline_app=production_app,
    candidate_app=new_app,
    baseline_label="prod-v12",
    candidate_label="pr-456",
)

print(result)
result.fail_if_below(consistency=0.80)  # raises AssertionError in CI if score drops
```

Load test cases from a YAML file:

```python
suite = RegressionSuite.load("evals.yaml")
```

```yaml
# evals.yaml
test_cases:
  - input: "Can I get a refund after 45 days?"
    name: "refund policy"
  - input: "Do you price match competitors?"
    name: "price matching"
```

From the CLI:

```bash
contradish compare evals.yaml \
  --baseline mymodule:production_app \
  --candidate mymodule:new_app \
  --threshold 0.80
```

### GitHub Actions

Drop this in `.github/workflows/cai.yml` to gate every PR:

```yaml
name: CAI regression

on: [pull_request]

jobs:
  cai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install contradish anthropic
      - name: Run CAI regression
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          contradish compare evals.yaml \
            --baseline mymodule:baseline_app \
            --candidate mymodule:candidate_app \
            --threshold 0.80
```

---

## Production Firewall

Wrap your live app to catch contradictions in real traffic before users notice.

```python
from contradish import Firewall

# Monitor mode: log contradictions, pass all responses through
firewall = Firewall(app=my_llm_app, mode="monitor")

result = firewall.check(user_query)
print(result.response)

if result.contradiction_detected:
    # log it, alert your team, route to human review
    print(f"Contradiction: {result.explanation}")
    print(f"Contradicts: {result.cached_query}")
```

```python
# Block mode: return a safe fallback when a contradiction is detected
firewall = Firewall(
    app=my_llm_app,
    mode="block",
    fallback_response="Let me get a team member to help with that.",
)

result = firewall.check(user_query)
return result.response  # safe regardless of what the app said
```

Get a traffic summary:

```python
print(firewall.summary())
# {
#   "total_queries": 1240,
#   "contradictions_detected": 18,
#   "responses_blocked": 0,
#   "contradiction_rate": 0.015
# }
```

---

## Prompt repair

Found failures? Contradish generates improved prompt variants, tests each one, and returns them ranked by CAI score.

```python
import anthropic
from contradish import Suite, PromptRepair

client = anthropic.Anthropic()

def make_app(system_prompt):
    def app(question):
        msg = client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=256,
            system=system_prompt,
            messages=[{"role": "user", "content": question}],
        )
        return msg.content[0].text.strip()
    return app

# Step 1: find the failures
suite = Suite.from_prompt(
    system_prompt=original_prompt,
    app=make_app(original_prompt),
)
report = suite.run()

# Step 2: fix them
repair = PromptRepair(n=3)
results = repair.fix(
    system_prompt=original_prompt,
    report=report,
    app_factory=make_app,
)

best = results[0]
print(f"CAI: {best.original_cai_score:.2f} -> {best.improved_cai_score:.2f} (+{best.delta:.2f})")
print(best.improved_prompt)
```

Output:

```
  Prompt repair results:
  #1: CAI 0.54 -> 0.88 (+0.34)
  #2: CAI 0.54 -> 0.81 (+0.27)
  #3: CAI 0.54 -> 0.76 (+0.22)
```

---

## JSON output

Any command supports `--json` for machine-readable output:

```bash
contradish --prompt system_prompt.txt --json | jq '.cai_score'
```

```json
{
  "cai_score": 0.71,
  "total": 4,
  "passed": 3,
  "failed": 1,
  "results": [...]
}
```

---

## Test case format

YAML (recommended):

```yaml
test_cases:
  - input: "Can I get a refund after 45 days?"
    name: "refund window"
  - input: "Do you match competitor prices?"
    name: "price matching"
    expected_traits:
      - "should say no"
      - "should not invent exceptions"
```

JSON also works:

```json
[
  {"input": "Can I get a refund after 45 days?", "name": "refund window"},
  {"input": "Do you match competitor prices?", "name": "price matching"}
]
```

---

## The CAI benchmark

Contradish ships with a 300-pair human-validated benchmark of adversarial question pairs across support, legal, finance, and policy domains. Used to produce the [CAI leaderboard](https://contradish.com/leaderboard.html).

Current scores (higher = more consistent):
- Intercom Fin: 0.84
- ChatGPT (GPT-4o): 0.79

---

## License

MIT
