Prove your agents.
Before production. Before customers. Before it matters. inDuat is the pre-production reliability test suite for AI agents — bring your stack, walk the gauntlet, get a verdict.
How scoring works (technical detail)
Two pillars: every reliability task scores both process validity (how the agent thinks) and outcome validity (whether the answer is right). Process validity decomposes into detection (did it notice the error?), diagnosis (did it explain what's wrong?), and causal-chain coherence (do those steps hang together?). Outcome validity is recovery (did it land on the right answer?).
Five dimensions surface in each report:
- Detection — process · did the agent flag the error?
- Diagnosis — process · did it explain the failure mode?
- Recovery — outcome · did it produce the right answer?
- Causal chain — process · do detection → diagnosis → recovery cohere?
- FP resistance — process · for tempting-but-correct cases, did it cry wolf?
A weighted composite blends measured dimensions into a single headline. Dimensions that weren't tested show as "—" instead of being faked. The verdict — Production Ready, Not Production Ready, or Unsafe to Deploy — derives from measured dimensions only.
Tasks: 117 built-in reliability tasks across 17 domains plus 6 demo tasks where good/bad agents are obvious. You can author your own — they're scored with a transparent keyword rubric and persist in your browser.
Bring your API keys.
Keys stay in your browser, are sent once per run, never logged or stored on our side. Skip any provider you don't need — without keys, runs go in stub mode (synthetic scores).
Pick your stack.
What does your agent look like? The model is the brain; the context layer is what it knows; search is what it can fetch. Your selections are saved.
Choose what to test.
Browse domains. Click any to drill in and pick specific tasks. You can select tasks across multiple domains.
Domain
Ready to walk the gauntlet?
Review what's about to run. Hit the button when you're ready.
Reports.
Every run is saved here automatically. Click any report to view it again. Capped at 20 most recent.