Research Software | Open Source | MIT

Evidence-First Snapshot Capture for LLM-Assisted Debugging

llmdebug records failure-time evidence as a local, inspectable artifact: exception context, prioritized stack frames, and summarized local state that can be reviewed from the CLI, notebooks, and MCP clients.

Problem

Tracebacks alone rarely preserve enough runtime state for reliable diagnosis.

Solution

Structured snapshots preserve the failing context before it disappears.

$ pip install 'llmdebug[cli]'
Run the Minimal Example

Abstract

Context. LLM-assisted debugging often degrades when the model only sees a traceback and partial surrounding code. Method. llmdebug captures a structured snapshot at the exception boundary, prioritizes the crash frame, and summarizes local state for downstream inspection. Output. The resulting artifact is available through CLI, notebook, and MCP interfaces, with output-scope and redaction controls for different workflows. Scope. The project improves evidence availability; it does not certify diagnoses, patches, or root-cause correctness.

Features

Shipped capabilities described in the same evidence-first vocabulary used throughout the README and protocol docs.

Default failure capture

After installation, pytest failures emit a snapshot artifact without per-test wiring.

Structured, inspectable artifacts

Machine-readable snapshots preserve exception context, frames, locals, and selected environment metadata.

Consistent multi-surface access

The same evidence model can be inspected from terminal and notebook surfaces or integrated into production hook and MCP workflows.

Differential analysis

Snapshot diffing supports run-to-run comparison during regression triage and debugging review.

Hypothesis engine

A pattern-based engine ranks common failure mechanisms to support triage, not to prove root cause.

Production safety controls

Redaction profiles and rate limiting help keep snapshot capture usable in production settings.

Method Overview

From failing execution to reviewable evidence in three steps.

  1. Step 1

    Failure boundary

    def test_transform():
        result = transform(data)
        assert result.shape == (100, 5)
  2. Step 2

    Evidence summary

    {
      "exception": {
        "type": "ValueError",
        "message": "shape mismatch"
      },
      "crash_frame": {
        "file_rel": "pipeline.py",
        "line": 47
      }
    }
  3. Step 3

    Inspection and comparison

    $ llmdebug show
    $ llmdebug show --detail context
    # after a second failing run:
    $ llmdebug diff #2 #1

Capabilities

Current-release capabilities and their primary documentation entrypoints.

Shipped features available in the current release.
Capability Status Documentation
Pytest failures produce snapshots by default Available README: Quick Start
CLI inspection (show, list, frames, diff, git-context) Available README: CLI
Detail levels (crash, full, context) for evidence size control Available CLI Reference: Detail Levels
Production hooks with rate limiting and redaction controls Available Configuration: Production Hooks
MCP server with evidence tools and RCA state tools Available README: MCP Server

Reproducible Minimal Example

Install the package, trigger a failure, then inspect the emitted artifact.

terminal
$ pip install 'llmdebug[cli]'
$ pytest
$ llmdebug show

Interfaces

The evidence model stays the same; only the access surface changes.

shell
# zero additional setup after installation
$ pip install 'llmdebug[cli]'
$ pytest
$ llmdebug

# failure artifact:
#   .llmdebug/latest.json

Safety and Data Handling

Defaults are designed to balance diagnostic value, local usability, and operational caution.

Redaction-aware capture

Redaction policies can mask common secret-like strings before snapshots are written to disk.

Production rate limiting

Exception hooks apply rate limits to reduce runaway snapshot writes during repeated failures.

Local-first storage

Snapshots are stored locally by default, which supports offline and air-gapped debugging workflows.

Evidence, not answers

Snapshots improve evidence availability for diagnosis. Root-cause judgment remains yours.

Limitations

Use the snapshot as high-signal evidence, not as exhaustive ground truth.

  • Snapshots describe observed failing executions; unexercised paths and some nondeterministic conditions may still require reruns.
  • Large objects may be summarized, truncated, or redacted for compactness and safety, which can omit low-level detail.
  • Hypothesis ranking is heuristic and should be treated as triage support, not proof of root cause.
  • Benchmark and statistical evaluation workflows live in evals/ and should be interpreted separately from this overview page.

Citation and Further Reading

BibTeX (software citation template)
@software{schuler2026llmdebug,
  author  = {Schuler, Nicolas},
  title   = {llmdebug: Structured Debug Snapshots
             for LLM-Assisted Debugging},
  year    = {2026},
  url     = {https://github.com/NicolasSchuler/llmdebug},
  license = {MIT}
}