excelminer / Architecture

Architecture overview

From analyze_workbook() to a normalized graph.

End-to-end pipeline

  1. analyze_workbook() builds the AnalysisContext + WorkbookGraph.
  2. Backends run in order, adding nodes + edges.
  3. Reports are collected per backend.
  4. Optional post-analysis distillation condenses the graph.

Graph-centric view

Common relationships

  • Sheet → FormulaCell (contains)
  • Sheet → CellBlock (contains)
  • PivotTable → PivotCache (uses_cache)
  • Connection → Source (uses_source)
  • PowerQuery → MScript (has_script)
  • Chart → DefinedName (uses_defined_name)

Control gates

  • Options flags (include_*)
  • File-type checks (OOXML vs legacy)
  • Limits (max_sheets, max_cells_per_sheet)

Backend responsibilities

Backend Primary artifacts Notes
OOXMLZipBackend Sheets, defined names, charts, connections, sources Structural pass that builds the graph backbone.
VbaZipBackend VBA project nodes Extracts VBA module text when available.
PowerQueryZipBackend Power Query nodes, scripts, sources Parses XML + detects mashup containers.
PivotZipBackend Pivot tables, caches Links pivots to caches + connections.
CalamineBackend Cell blocks Requires excelminer[calamine].
OpenpyxlBackend Formula cells Formula text, no evaluation.
ComBackend Enrichment Windows + Excel required.