Introduction
AgentIQ evaluates every interaction at every step, in real time — not a sample, not after the fact — and tells you exactly what broke, where, and why.
The problem
Enterprise AI agents run complex multi-step workflows. Unlike traditional software, agents think and act differently in every interaction — standard quality checks aren't built for that. Eval and observability platforms exist, but they use sampled data and after-the-fact analysis. They miss failures mid-workflow, including one-off failures that could be catastrophic.
Documented incident · November 2025
Four agents in production. Two entered an infinite loop. 11 days. $47,000 in API costs. Nobody noticed. The team had monitoring — the loop never appeared in the sample.
What AgentIQ does
- Evaluates every interaction at every step, the moment it happens
- Gives every interaction a quality score calibrated to your industry
- Surfaces one-off catastrophic failures immediately in the Failure Feed
- Groups repeating failures into named patterns with plain English root cause
Quickstart
From pip install to your first scored interaction in under two minutes.
1. Install
2. Initialize
agent.py
import agentiq
agentiq.init(
api_key="sk-aiq-...",
industry="retail", # healthcare | retail | coding | marketing
)
3. Watch your agent
agent.py
@agentiq.watch
def support_agent(user_input):
... # nothing else changes
4. Run and open the dashboard
$ python agent.py
✓ AgentIQ connected
Evaluating every step in real time
Dashboard: https://agentiq.dev/a/abc123
Interaction 1 · Score 91/100 · PASSED
Interaction 2 · Score 34/100 · FAILED → tool_failure at step 3
Manual trace, for agents you can't decorate
If your agent can't be wrapped with a decorator, call agentiq.trace() directly at each step. See the trace() reference.
What happens if AgentIQ is down
The trace call never blocks and never raises. If AgentIQ's API is unreachable, your agent keeps running. Failed sends are logged locally and retried automatically.
Quality scoring
Every interaction gets a quality score from 0 to 100, computed from four dimensions evaluated by an LLM judge on every step.
The four dimensions
accuracy
0.0–1.0
Was the response factually correct given available context and tool outputs?
goal_alignment
0.0–1.0
Did the agent serve what the user actually needed — not just the literal request?
decision_quality
0.0–1.0
Was the reasoning sound and tool selection appropriate and well-sequenced?
completeness
0.0–1.0
Was the request fully resolved, or did the agent stop short?
Why goal_alignment is separate from accuracy
A response can be factually accurate and still fail the user. If someone says "cancel my subscription" but actually wants to pause it, completing the cancellation is accurate — but goal-misaligned. This is what makes agent evaluation different from simple QA evaluation.
Pass threshold
An interaction passes when overall_score ≥ 0.7. Configurable per agent from the dashboard.
Industries
A healthcare agent and a marketing agent shouldn't be graded the same way. AgentIQ ships with quality weightings tuned per industry, applied automatically when you set industry in init().
| Industry | Accuracy | Goal alignment | Decision quality | Completeness |
| Healthcare | 0.50 | 0.30 | 0.10 | 0.10 |
| Coding | 0.50 | 0.20 | 0.20 | 0.10 |
| Retail / Customer service | 0.25 | 0.40 | 0.10 | 0.25 |
| Marketing | 0.20 | 0.45 | 0.20 | 0.15 |
| Default (any) | 0.35 | 0.35 | 0.15 | 0.15 |
Overriding defaults
agentiq.configure(
weight_accuracy=0.40,
weight_goal_alignment=0.30,
weight_decision_quality=0.15,
weight_completeness=0.15,
pass_threshold=0.75,
)
Failure taxonomy
Every failed interaction is classified into one of seven failure types. This taxonomy is published research — the same definitions used across our research papers.
wrong_answer
The response is factually incorrect given available context and tool outputs.
tool_failure
A tool call failed, timed out, or returned an unhandled error.
goal_drift
The task completed, but the user's actual underlying goal remains unresolved.
incomplete
The agent stopped before fully resolving the request.
hallucination
The agent asserts specific facts not grounded in context or tool outputs.
context_loss
The agent contradicts or ignores a user statement from 3+ steps prior.
loop
The agent repeats the same action or response without progressing.
Designed for detection, not just description
Each type is defined by what's observable in a trace — so an autonomous LLM judge can classify it reliably at production scale, not just a human reviewer working from a transcript.
The dashboard
Three views. Each answers one question: is my agent healthy, what's failing right now, and exactly what happened in this one interaction.
1. Agent Overview
The landing view. Quality score trend over 7 days, today's pass rate, active failure count, and top 3 failure types ranked by frequency. If no failures in 24h: "No failures detected. Your agent is healthy."
2. Failure Feed
Every failure the moment it happens — not on a batch schedule. One-off failures shown individually, flagged in red. Repeating patterns grouped automatically once 5+ failures share the same type and step, with a one-sentence root cause in the pattern card.
3. Interaction Detail
Full step-by-step trace for one interaction. Per-step quality score, green or red. The failing step highlighted, not buried. Failure reason in one plain English sentence — not a JSON dump, not a stack trace.
The goal
A developer must be able to read the Interaction Detail view and know exactly what to fix without opening any other tool.
API reference
agentiq.init()
Authenticates AgentIQ and sets defaults for the current process. Call once at startup.
agentiq.init(
api_key: str,
industry: str = "default",
agent_id: str | None = None,
) -> None
api_key
str
Required. Your AgentIQ API key.
industry
str
One of healthcare, retail, coding, marketing, or default. Sets scoring weights.
agent_id
str | None
Optional. Auto-generated if omitted. Groups interactions per agent.
Non-blocking, always
init() never raises. If AgentIQ is unreachable at startup, your agent starts normally.
API reference
@agentiq.watch
A decorator that instruments any agent function. Internally calls agentiq.trace() after every step the wrapped function emits.
@agentiq.watch
def your_agent(user_input: str) -> str:
...
Works with any Python callable — LangChain chains, CrewAI agents, AutoGen agents, or a plain function. No framework-specific integration required.
Non-blocking, always
@watch never raises and never blocks. If AgentIQ is unreachable, the step is logged locally to ~/.agentiq/errors.log and retried on the next successful connection.
API reference
agentiq.trace()
Manually log one step of an interaction. Use when your agent can't be wrapped with @watch.
agentiq.trace(
session_id: str,
step_number: int,
step_name: str,
input: str,
output: str,
tool_calls: list | None = None,
metadata: dict | None = None,
session_ended: bool = False,
) -> None
session_id
str
Groups steps of one workflow run together.
step_number
int
Position in the workflow — 1, 2, 3…
step_name
str
Human-readable label e.g. "retrieve_policy".
input
str
What was sent to the agent at this step.
output
str
What the agent responded.
tool_calls
list | None
Optional. List of {name, input, output, success, latency_ms}.
session_ended
bool
Set True on the final step to trigger SessionSummary.
API reference
Data model
Four tables. Everything AgentIQ stores, nothing more.
AgentLog
Written by the SDK on every trace call. The raw record of one step.
id UUID primary key
agent_id str
session_id str
step_number int
step_name str
input str
output str
tool_calls jsonb
latency_ms int
session_ended bool
timestamp timestamptz
metadata jsonb
EvalResult
Written by the judge. One row per step evaluated.
id UUID primary key
log_id UUID -- FK to AgentLog
session_id str
agent_id str
accuracy float
goal_alignment float
decision_quality float
completeness float
overall_score float -- weighted average, 0.0–1.0
passed bool -- overall_score >= 0.7
failure_type str -- null if passed
failure_step int
failure_reason str -- one plain English sentence
evaluated_at timestamptz
LossPattern
Written hourly. Aggregated, named failure patterns.
id UUID primary key
agent_id str
pattern_type str -- intent | workflow_step | tool_call
pattern_value str
failure_count int
failure_rate float
pct_of_all_failures float
root_cause str
is_worsening bool
first_seen / last_seen timestamptz
AgentQualityConfig
Your per-agent quality configuration.
agent_id str primary key
industry str
weight_accuracy float
weight_goal_alignment float
weight_decision float
weight_completeness float
pass_threshold float -- default 0.7