How to Use Argus
A step-by-step walkthrough of every section in the Argus dashboard — from browsing your run history to reading failure details, comparing executions, and rerunning from a broken node.
Quick-start — paste this into your LLM (Claude Code, Cursor, etc.) to add ARGUS to your pipeline:
Add ARGUS monitoring to my LangGraph pipeline. ARGUS watches every node and catches silent failures, missing fields, and semantic issues.
Install: pip install argus-agents
Then add 2 lines to the file where the graph is built:
from argus import ArgusWatcher
watcher = ArgusWatcher(graph) # pass your StateGraph directly
app = graph.compile()
result = app.invoke(initial_state) # run auto-saves when done
print(watcher.run_id) # access the run ID directly
If the graph is already compiled, use watch_compiled() instead:
watcher = ArgusWatcher()
app = watcher.watch_compiled(app)
result = app.invoke(initial_state)
That's it. Run the pipeline, then use "argus show last" to see what ARGUS caught.
All parameters are keyword-only (except graph):
watcher = ArgusWatcher(graph,
record_http=True, # save API calls to disk for deterministic reruns
semantic_judge=True, # LLM judge on every node (needs OPENAI_API_KEY)
judge_model="gpt-4o", # model for the judge. default: gpt-4o
strict=True, # extra checks for CI/staging
max_field_size=50000, # max chars per field before truncation
investigate=True, # LLM root-cause analysis (True/False/"always")
redact_keys={"token"}, # field names to redact from stored outputs
persist_state=True, # save run records to .argus/runs/
validators={ # per-node semantic validators
"summarize": lambda o: (len(o.get("summary","")) > 10, "Summary too short"),
"*": lambda o: ("error" not in o, "error key present"),
},
)
Runs auto-save for linear and fan-out/fan-in graphs.
Only cyclic graphs need watcher.finalize().1. Runs List
The home page is your pipeline execution history. Every time your pipeline runs with Argus attached, an entry appears here automatically.

The runs list — aggregate stats at the top, evaluation panel, and the full run table below.
Summary cards
Run table columns
Evaluation panel
The Evaluation section lets you filter runs by criteria — set a goal description and add constraints like overall_status == clean to find runs that meet specific conditions. Hit Evaluate to filter the table.
Click any run ID to open its full detail page.
2. Run Detail
The run detail page gives you a complete picture of what happened during a single pipeline execution — metrics, the execution trace, AI analysis, and the initial state.

Top of the run detail page: run ID, status, root cause chain, metrics grid, and the execution timeline.
Header
Shows the run ID, overall status badge, timestamp, total duration, step count, and Argus version. The Compare button lets you immediately diff this run against another.
Root cause chain
When a failure propagates downstream, Argus traces back to find the originating node. The red banner shows the chain — e.g. extract_skills → generate_summary — so you know exactly which node to fix, not which node complained.
Metrics
Execution timeline
Each node is listed in order with its name, output type tag, duration, and status. Nodes with failures show an indented root cause annotation — the specific field that was missing and which upstream node failed to produce it. Expand any row with the arrow to see the full input/output JSON.

Lower execution timeline showing degraded_input propagation, followed by the AI Analysis panel.
AI Analysis
When OPENAI_API_KEY is set, Argus automatically investigates non-clean runs. The panel breaks down the failure into three parts:
Root Cause Node
The specific node Argus identified as the origin of the failure — not the node that complained, but the one that first produced the broken state.
Reason
A concise explanation of why that node failed and how the bad state propagated through downstream nodes.
How to Fix It
Numbered action items — each targeting a specific node — telling you exactly what to change to prevent the failure from recurring.
A confidence score is shown in the top-right of the panel. The footer shows how many causal hypotheses were evaluated and how many observations were used.

AI fix steps, the Correlation panel (origin node + confidence), and the Behavior/Initial State sections.
Correlation
Argus runs a correlation analysis to confirm which node is the true origin of the degradation. Shows the origin node name, step index, failure signals (e.g. missing_field), and a confidence score.
Behavior & Initial State
The Behavior section shows the raw initial state your pipeline received — the exact input dict at invocation time. Useful for reproducing the failure locally.
3. Compare
Compare two runs side-by-side to see exactly what changed — useful for verifying a fix worked, catching regressions, or understanding why one run is faster than another.

Compare page: winner verdict at the top, aggregate stats table, then a node-by-node status comparison.
How to compare
Open Compare
Enter two run IDs
Read the verdict
Read the node diff
The aggregate table shows Failures, Duration, and Success Rate side-by-side with a winner indicator (B ✓) for each metric.
4. Rerun
Rerun re-executes your pipeline from a specific node using the frozen input state captured from a previous run. This means you can test a fix without re-running the full pipeline or making new LLM calls for the nodes before the broken one. External API calls (OpenAI, search tools, etc.) execute live by default. Use record_http=True to capture and replay them from disk for fully deterministic reruns.
How rerun works
When Argus records a run, it saves the input state at every node. When you rerun from node X, Argus loads the exact input that node X received originally, then re-executes node X and everything downstream with your current code. A new run ID is created for the result.
Step by step — from the dashboard
Open the failing run
Find the root cause node
Click the rerun icon
Wait for the new run
Compare to confirm
pass.Step by step — from the CLI
argus replay <run-id> <node-name>
argus replay <run-id> <node-name> --app my_pipeline:build_graph
The --app flag takes a module:function path to your graph factory function. Only needed if node function references weren't captured at recording time. After rerun, use argus diff to compare:
argus diff <original-run-id> <rerun-run-id>
Screenshots for the rerun UI will be added in a future update.
5. ArgusWatcher Parameters
Pass your graph as the first argument, then use keyword arguments for everything else. Mix and match whatever fits your pipeline:
watcher = ArgusWatcher(
graph, # your LangGraph StateGraph — auto-calls watch()
# --- Output control ---
max_field_size=50_000, # max chars per field before truncation (default: 50k)
redact_keys={"token", "api_key"}, # field names to scrub from stored outputs
persist_state=True, # save run records to .argus/runs/ (default: True)
# --- Detection strictness ---
strict=True, # extra checks: nested error keys, rate-limit responses,
# empty lists, type mismatches. recommended for CI/staging.
# --- Semantic validators ---
validators={
"summarize": lambda o: (len(o.get("summary","")) > 10, "Summary too short"),
"*": lambda o: ("error" not in o, "error key present"), # runs on every node
},
# --- LLM investigation ---
investigate=True, # LLM root-cause analysis on failures (default: True)
# set to "always" for every node, False to disable
# --- Deterministic rerun ---
record_http=True, # saves every outbound API call (OpenAI, search, etc.)
# to disk. reruns replay from disk instead of calling
# live APIs — zero extra cost, fully reproducible.
# --- LLM semantic judge ---
semantic_judge=True, # enables an LLM that reviews every node's output for
# subtle quality issues (wrong tone, unhelpful, outdated).
# runs AFTER deterministic checks — so it only adds cost
# where structural checks can't tell if something's off.
# heads up: this calls GPT-4o once per node, so your
# OpenAI bill will go up. worth it for complex pipelines,
# probably overkill for simple ones.
judge_model="gpt-4o", # which model the semantic judge uses. default is gpt-4o.
# swap to gpt-4o-mini if you want cheaper runs.
)After the run, access watcher.run_id to get the run ID. Runs auto-save for linear and fan-out/fan-in (DAG) graphs — you only need to call watcher.finalize() for cyclic graphs with back-edges.
record_http — deterministic reruns
By default, when you rerun from a node, any external API calls (OpenAI, DuckDuckGo, etc.) execute live again. Different results, extra cost. With record_http=True, ARGUS captures every HTTP request and response during the original run. On rerun, it serves the recorded responses back — same data, zero cost, fully reproducible.
Use it when your pipeline calls paid APIs and you want cheap, identical reruns. Skip it when you want the rerun to hit the real API — e.g. testing a new prompt.
semantic_judge — LLM-powered quality checks
ARGUS catches ~80% of production failures deterministically (missing fields, empty results, type mismatches, placeholder outputs) — zero cost, zero false positives. The remaining ~20% are subtle: wrong tone, unhelpful responses, outdated info, answering the wrong question. That's what the semantic judge is for.
Requires OPENAI_API_KEY in your environment. Uses GPT-4o by default.
Use it for complex multi-agent pipelines, customer-facing outputs, or when deterministic checks show “pass” but outputs still feel wrong. Skip it for simple pipelines, CI/CD speed runs, or if you want zero monitoring costs.