Prove your agents.

Before production. Before customers. Before it matters. inDuat is the pre-production reliability test suite for AI agents — bring your stack, walk the gauntlet, get a verdict.

What it does
Tests reliability, not just accuracy.
An agent that gets the right answer isn't enough. inDuat measures whether your agent knows when it's wrong, asks for missing info, and refuses to invent capabilities it doesn't have.
Why two pillars
Capability and self-awareness, separated.
A confident-wrong agent is dangerous. A modest-correct agent is shippable. Most evaluations collapse these into one score; inDuat measures them independently so you can see the difference.
Who it's for
Anyone shipping an agent.
Founders pre-launch. Engineers pre-deploy. Product teams testing changes. Runs in your browser, accepts your API keys, saves history so measurements compound over time.
How scoring works (technical detail)

Two pillars: every reliability task scores both process validity (how the agent thinks) and outcome validity (whether the answer is right). Process validity decomposes into detection (did it notice the error?), diagnosis (did it explain what's wrong?), and causal-chain coherence (do those steps hang together?). Outcome validity is recovery (did it land on the right answer?).

Five dimensions surface in each report:

  • Detection — process · did the agent flag the error?
  • Diagnosis — process · did it explain the failure mode?
  • Recovery — outcome · did it produce the right answer?
  • Causal chain — process · do detection → diagnosis → recovery cohere?
  • FP resistance — process · for tempting-but-correct cases, did it cry wolf?

A weighted composite blends measured dimensions into a single headline. Dimensions that weren't tested show as "—" instead of being faked. The verdict — Production Ready, Not Production Ready, or Unsafe to Deploy — derives from measured dimensions only.

Tasks: 117 built-in reliability tasks across 17 domains plus 6 demo tasks where good/bad agents are obvious. You can author your own — they're scored with a transparent keyword rubric and persist in your browser.

Your settings stay in this browser. Nothing is uploaded.
API keys
no keys
for gemini-* models
for claude-* models
for gpt-* and o1-* models
for pioneer-* models
if Context = mem0
if Search = tavily
Stack
stub
retrieves from curated knowledge
live web search
Without an API key for your model, runs use synthetic scoring (stub mode).
0 selected
Run summary
stub
Model
Context layer
Search / tools
Tasks selected
Domains covered
Estimated time