// reasoning stability testing for LLM apps

Your LLM gives
different answers
to the same question.

Contradish generates semantic variants of your inputs, runs them through your app, and detects when the reasoning breaks — before your users find out.

// get started

$ pip install contradish

your code

from contradish import Suite, TestCase

# your app — any str → str function
def my_app(question: str) -> str:
    return your_llm(question)

suite = Suite(app=my_app)

suite.add(TestCase(
    name="refund policy",
    input="Can I get a refund after 45 days?",
))

suite.add(TestCase(
    name="return window",
    input="How long to return an item?",
))

suite.run()
terminal
  [1/2] refund policy  running...
   generating 5 paraphrases
   calling your app 6× across variants
   evaluating consistency
   detecting contradictions
   diagnosing failure patterns
  [2/2] return window  running...
   ...

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  contradish  ·  reasoning stability report
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

  Tests: 2   1 passed   1 failed

  Aggregate
    consistency   ████████████░░░░░░░░  0.71
    contradiction ████████░░░░░░░░░░░░  0.24

    return window  [risk: low]
       consistency   ████████████████░░░░  0.88

    refund policy  [risk: high]
       consistency   ████████░░░░░░░░░░░░  0.54
       contradiction ████████████████░░░░  0.40

       Contradictions detected (2)
       ┌ [policy] Model claims refunds ok after 60 days
       │ A: No, refunds only within 30 days.
       │ B: Yes, refunds up to 60 days after purchase.
       

         Date-specific phrasings trigger policy hallucination
         Model overgeneralizes when duration stated explicitly

       → Fix: Constrain policy window in system prompt.
         Never state a different number than 30 days.

──────────────────────────────────────────────
  1 test failed.  Reasoning instability detected.
──────────────────────────────────────────────
01

You define test cases

A test case is a question your app should answer consistently. Takes 30 seconds to write.

02

Contradish paraphrases them

5 semantically equivalent variants are generated per input. Same meaning, different wording — the way real users ask.

03

Your app runs on all variants

6 calls per test case by default. Your app doesn't need to change at all — just pass the function.

04

The judge finds instability

Contradictions are detected, consistency is scored, and failure patterns are diagnosed — not just flagged.

v0.1 · live

Consistency scoring

Measures whether outputs for semantically equivalent inputs agree with each other. 0–1 score with breakdown.

v0.1 · live

Contradiction detection

Finds pairs of outputs that make incompatible claims. Shows you exactly which outputs conflict and why.

coming soon

Grounding / hallucination

Checks whether answers are supported by retrieved context, or if the model is inventing facts.

coming soon

Regression testing

Compare baseline vs candidate. Use as a CI gate to prevent prompt or model changes from degrading reliability.

⚙️

LLM app engineers

Know whether your prompt change actually broke something before you ship it.

🔍

RAG pipeline teams

Detect when retrieval changes cause inconsistent or hallucinated answers.

🛡️

High-stakes AI apps

Legal, finance, healthcare — anywhere inconsistency creates liability. Consistency is the floor.

Stop shipping reasoning bugs.

Two minutes to first result. Works with Anthropic or OpenAI.

$ pip install contradish

Then set ANTHROPIC_API_KEY or OPENAI_API_KEY and you're running.