// reasoning stability testing for LLM apps
Contradish generates semantic variants of your inputs, runs them through your app, and detects when the reasoning breaks — before your users find out.
// get started
// what it looks like
your code
from contradish import Suite, TestCase # your app — any str → str function def my_app(question: str) -> str: return your_llm(question) suite = Suite(app=my_app) suite.add(TestCase( name="refund policy", input="Can I get a refund after 45 days?", )) suite.add(TestCase( name="return window", input="How long to return an item?", )) suite.run()
[1/2] refund policy running... → generating 5 paraphrases → calling your app 6× across variants → evaluating consistency → detecting contradictions → diagnosing failure patterns [2/2] return window running... → ... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ contradish · reasoning stability report ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Tests: 2 1 passed 1 failed Aggregate consistency 0.71 contradiction 0.24 ✓ return window [risk: low] consistency 0.88 ✗ refund policy [risk: high] consistency 0.54 contradiction 0.40 Contradictions detected (2) ┌ [policy] Model claims refunds ok after 60 days │ A: No, refunds only within 30 days. │ B: Yes, refunds up to 60 days after purchase. └ ⚠ Date-specific phrasings trigger policy hallucination ⚠ Model overgeneralizes when duration stated explicitly → Fix: Constrain policy window in system prompt. Never state a different number than 30 days. ────────────────────────────────────────────── 1 test failed. Reasoning instability detected. ──────────────────────────────────────────────
// how it works
A test case is a question your app should answer consistently. Takes 30 seconds to write.
5 semantically equivalent variants are generated per input. Same meaning, different wording — the way real users ask.
6 calls per test case by default. Your app doesn't need to change at all — just pass the function.
Contradictions are detected, consistency is scored, and failure patterns are diagnosed — not just flagged.
// what it checks
Measures whether outputs for semantically equivalent inputs agree with each other. 0–1 score with breakdown.
Finds pairs of outputs that make incompatible claims. Shows you exactly which outputs conflict and why.
Checks whether answers are supported by retrieved context, or if the model is inventing facts.
Compare baseline vs candidate. Use as a CI gate to prevent prompt or model changes from degrading reliability.
// who it's for
Know whether your prompt change actually broke something before you ship it.
Detect when retrieval changes cause inconsistent or hallucinated answers.
Legal, finance, healthcare — anywhere inconsistency creates liability. Consistency is the floor.
Two minutes to first result. Works with Anthropic or OpenAI.
Then set ANTHROPIC_API_KEY or OPENAI_API_KEY and you're running.