How to Use Argus

A step-by-step walkthrough of every section in the Argus dashboard — from browsing your run history to reading failure details, comparing executions, and rerunning from a broken node.

Quick-start — paste this into your LLM (Claude Code, Cursor, etc.) to add ARGUS to your pipeline:

LLM Prompt — paste into Claude Code, Cursor, etc.
Add ARGUS monitoring to my LangGraph pipeline. ARGUS watches every node and catches silent failures, missing fields, and semantic issues.

Install: pip install argus-agents

Then add 2 lines to the file where the graph is built:

from argus import ArgusWatcher

watcher = ArgusWatcher(graph)          # pass your StateGraph directly
app = graph.compile()
result = app.invoke(initial_state)     # run auto-saves when done
print(watcher.run_id)                  # access the run ID directly

If the graph is already compiled, use watch_compiled() instead:

watcher = ArgusWatcher()
app = watcher.watch_compiled(app)
result = app.invoke(initial_state)

That's it. Run the pipeline, then use "argus show last" to see what ARGUS caught.

All parameters are keyword-only (except graph):

watcher = ArgusWatcher(graph,
    record_http=True,       # save API calls to disk for deterministic reruns
    semantic_judge=True,    # LLM judge on every node (needs OPENAI_API_KEY)
    judge_model="gpt-4o",   # model for the judge. default: gpt-4o
    strict=True,            # extra checks for CI/staging
    max_field_size=50000,   # max chars per field before truncation
    investigate=True,       # LLM root-cause analysis (True/False/"always")
    redact_keys={"token"},  # field names to redact from stored outputs
    persist_state=True,     # save run records to .argus/runs/
    validators={            # per-node semantic validators
        "summarize": lambda o: (len(o.get("summary","")) > 10, "Summary too short"),
        "*": lambda o: ("error" not in o, "error key present"),
    },
)

Runs auto-save for linear and fan-out/fan-in graphs.
Only cyclic graphs need watcher.finalize().

1. Runs List

The home page is your pipeline execution history. Every time your pipeline runs with Argus attached, an entry appears here automatically.

Argus runs list page

The runs list — aggregate stats at the top, evaluation panel, and the full run table below.

Summary cards

Total RunsNumber of pipeline executions recorded in your workspace.
CleanRuns where every node passed with no failures detected.
FailedRuns with at least one silent failure, crash, or semantic failure.
Pass RatePercentage of clean runs over the total.

Run table columns

RUN IDUnique identifier for the run. Click to open the full detail view.
STATUSOverall result — clean, silent failure, crashed, or semantic fail.
GRAPHThe node execution path, summarised as a chain.
STEPSTotal number of nodes that executed in this run.
FIRST FAILUREThe first node that produced bad output — the likely root cause.
SHAPEWhether all expected nodes ran (full) or the run was cut short (partial).

Evaluation panel

The Evaluation section lets you filter runs by criteria — set a goal description and add constraints like overall_status == clean to find runs that meet specific conditions. Hit Evaluate to filter the table.

Click any run ID to open its full detail page.

2. Run Detail

The run detail page gives you a complete picture of what happened during a single pipeline execution — metrics, the execution trace, AI analysis, and the initial state.

Run detail — header, root cause, metrics, and execution timeline

Top of the run detail page: run ID, status, root cause chain, metrics grid, and the execution timeline.

Header

Shows the run ID, overall status badge, timestamp, total duration, step count, and Argus version. The Compare button lets you immediately diff this run against another.

Root cause chain

When a failure propagates downstream, Argus traces back to find the originating node. The red banner shows the chain — e.g. extract_skills → generate_summary — so you know exactly which node to fix, not which node complained.

Metrics

DurationTotal wall-clock time for the full pipeline execution.
Success RatePercentage of nodes in this run that passed.
FailuresNumber of nodes with any failure status.
SeverityWorst severity level seen: ok, warning, or critical.
CompletedWhether the pipeline ran to the final node or was cut short.

Execution timeline

Each node is listed in order with its name, output type tag, duration, and status. Nodes with failures show an indented root cause annotation — the specific field that was missing and which upstream node failed to produce it. Expand any row with the arrow to see the full input/output JSON.

Execution timeline showing degraded input nodes and AI analysis

Lower execution timeline showing degraded_input propagation, followed by the AI Analysis panel.

AI Analysis

When OPENAI_API_KEY is set, Argus automatically investigates non-clean runs. The panel breaks down the failure into three parts:

Root Cause Node

The specific node Argus identified as the origin of the failure — not the node that complained, but the one that first produced the broken state.

Reason

A concise explanation of why that node failed and how the bad state propagated through downstream nodes.

How to Fix It

Numbered action items — each targeting a specific node — telling you exactly what to change to prevent the failure from recurring.

A confidence score is shown in the top-right of the panel. The footer shows how many causal hypotheses were evaluated and how many observations were used.

AI analysis fix steps, correlation panel, and behavior section

AI fix steps, the Correlation panel (origin node + confidence), and the Behavior/Initial State sections.

Correlation

Argus runs a correlation analysis to confirm which node is the true origin of the degradation. Shows the origin node name, step index, failure signals (e.g. missing_field), and a confidence score.

Behavior & Initial State

The Behavior section shows the raw initial state your pipeline received — the exact input dict at invocation time. Useful for reproducing the failure locally.

3. Compare

Compare two runs side-by-side to see exactly what changed — useful for verifying a fix worked, catching regressions, or understanding why one run is faster than another.

Compare page showing winner verdict and node-by-node diff

Compare page: winner verdict at the top, aggregate stats table, then a node-by-node status comparison.

How to compare

1

Open Compare

Click Compare in the sidebar, or use the Compare button on any run detail page (pre-fills Run A).
2

Enter two run IDs

Paste a Run A (typically the older / broken run) and Run B (the newer / fixed run).
3

Read the verdict

The winner banner shows which run performed better and why — fewer failures, faster duration, higher success rate.
4

Read the node diff

Each node is listed with its status in A and B. Nodes only present in one run are labelled only in A or only in B. Status changes are highlighted.

The aggregate table shows Failures, Duration, and Success Rate side-by-side with a winner indicator (B ✓) for each metric.

4. Rerun

Rerun re-executes your pipeline from a specific node using the frozen input state captured from a previous run. This means you can test a fix without re-running the full pipeline or making new LLM calls for the nodes before the broken one. External API calls (OpenAI, search tools, etc.) execute live by default. Use record_http=True to capture and replay them from disk for fully deterministic reruns.

How rerun works

When Argus records a run, it saves the input state at every node. When you rerun from node X, Argus loads the exact input that node X received originally, then re-executes node X and everything downstream with your current code. A new run ID is created for the result.

Step by step — from the dashboard

1

Open the failing run

Click the run ID on the runs list to open its detail page.
2

Find the root cause node

Check the red root cause banner at the top — it names the node that first produced bad output. That's the node you want to rerun from.
3

Click the rerun icon

In the execution timeline, each node row has a rerun icon (↺) on the right. Click it on the root cause node.
4

Wait for the new run

Argus re-executes from that node forward. When done, you're taken to the new run's detail page with a fresh set of results.
5

Compare to confirm

Use the Compare button to diff the original run against the rerun. The broken nodes should now show pass.

Step by step — from the CLI

Rerun from a specific node
argus replay <run-id> <node-name>
If node functions weren't stored in the run
argus replay <run-id> <node-name> --app my_pipeline:build_graph

The --app flag takes a module:function path to your graph factory function. Only needed if node function references weren't captured at recording time. After rerun, use argus diff to compare:

argus diff <original-run-id> <rerun-run-id>

Screenshots for the rerun UI will be added in a future update.

5. ArgusWatcher Parameters

Pass your graph as the first argument, then use keyword arguments for everything else. Mix and match whatever fits your pipeline:

All parameters explained
watcher = ArgusWatcher(
    graph,                  # your LangGraph StateGraph — auto-calls watch()

    # --- Output control ---
    max_field_size=50_000,  # max chars per field before truncation (default: 50k)
    redact_keys={"token", "api_key"},  # field names to scrub from stored outputs
    persist_state=True,     # save run records to .argus/runs/ (default: True)

    # --- Detection strictness ---
    strict=True,            # extra checks: nested error keys, rate-limit responses,
                            # empty lists, type mismatches. recommended for CI/staging.

    # --- Semantic validators ---
    validators={
        "summarize": lambda o: (len(o.get("summary","")) > 10, "Summary too short"),
        "*": lambda o: ("error" not in o, "error key present"),  # runs on every node
    },

    # --- LLM investigation ---
    investigate=True,       # LLM root-cause analysis on failures (default: True)
                            # set to "always" for every node, False to disable

    # --- Deterministic rerun ---
    record_http=True,       # saves every outbound API call (OpenAI, search, etc.)
                            # to disk. reruns replay from disk instead of calling
                            # live APIs — zero extra cost, fully reproducible.

    # --- LLM semantic judge ---
    semantic_judge=True,    # enables an LLM that reviews every node's output for
                            # subtle quality issues (wrong tone, unhelpful, outdated).
                            # runs AFTER deterministic checks — so it only adds cost
                            # where structural checks can't tell if something's off.
                            # heads up: this calls GPT-4o once per node, so your
                            # OpenAI bill will go up. worth it for complex pipelines,
                            # probably overkill for simple ones.

    judge_model="gpt-4o",  # which model the semantic judge uses. default is gpt-4o.
                            # swap to gpt-4o-mini if you want cheaper runs.
)

After the run, access watcher.run_id to get the run ID. Runs auto-save for linear and fan-out/fan-in (DAG) graphs — you only need to call watcher.finalize() for cyclic graphs with back-edges.

record_http — deterministic reruns

By default, when you rerun from a node, any external API calls (OpenAI, DuckDuckGo, etc.) execute live again. Different results, extra cost. With record_http=True, ARGUS captures every HTTP request and response during the original run. On rerun, it serves the recorded responses back — same data, zero cost, fully reproducible.

Use it when your pipeline calls paid APIs and you want cheap, identical reruns. Skip it when you want the rerun to hit the real API — e.g. testing a new prompt.

semantic_judge — LLM-powered quality checks

ARGUS catches ~80% of production failures deterministically (missing fields, empty results, type mismatches, placeholder outputs) — zero cost, zero false positives. The remaining ~20% are subtle: wrong tone, unhelpful responses, outdated info, answering the wrong question. That's what the semantic judge is for.

Deterministic firstStructural checks always run first — free, instant, reproducible.
LLM secondThe judge only reviews what the structural layer couldn't decide.
Per-nodeEach node's output is evaluated in context of its input and the pipeline's purpose.
HypothesesThe judge generates causal hypotheses ranked by confidence, with supporting evidence.

Requires OPENAI_API_KEY in your environment. Uses GPT-4o by default.

Use it for complex multi-agent pipelines, customer-facing outputs, or when deterministic checks show “pass” but outputs still feel wrong. Skip it for simple pipelines, CI/CD speed runs, or if you want zero monitoring costs.