TraceBi

Code-first, traceable BI and analytics framework for Python

v0.5.0  ·  163 tests passing  ·  Phase 1–4 complete

What is TraceBi?

TraceBi is a Python library for building BI and analytics workflows entirely in code — no GUI builders, no config files. Every data operation from raw ingest to formatted report is tracked in a full, immutable lineage chain. You always know exactly where a number came from and how it was calculated.

Source DB / CSV / Memory
Bronze
Silver
Gold
Reports & Dashboard

The same DataModel and lineage chain powers every output — whether that's an Excel file, an HTML report, a live Dash dashboard, or a scheduled pipeline writing back to a database.

Phases

✓ Done

Phase 1 — Foundation

Connectors, DataModel, DataSet with immutable lineage chain

✓ Done

Phase 2 — Report Engine

Excel + HTML renderers, lineage manifest saved with every render

✓ Done

Phase 2.5 — Medallion + Star Schema

Bronze/Silver/Gold layers, StarSchema queries, LineageDiagram

✓ Done

Phase 3 — Dashboard

Live Dash app with associative filters driven by the same DataModel

✓ Done

Phase 4 — Pipelines

PipelineRunner with APScheduler, DB write-back, cross-layer lineage

✓ Done

Phase 5 — Web UI

FastAPI + Jinja2 browser UI with embedded Dash dashboards, medallion-aware demo, and registry pattern

Connectors

Connectors are the entry point to any data source. Every connector implements connect(), load(source), and write(df, table). They are registered in a DataModel by name.

ClassSourceExtra depNotes
CSVConnector CSV / Excel files openpyxl directory= + filename as source
SQLConnector Any SQL database sqlalchemy SQLite, Postgres, MySQL, etc. Supports write()
BigQueryConnector Google BigQuery google-cloud-bigquery Uses Application Default Credentials
SnowflakeConnector Snowflake snowflake-connector-python Keyword auth supported
MemoryConnector In-memory DataFrames none Dict of DataFrames. Supports write(). Used in tests and demos.
from tracebi import SQLConnector, MemoryConnector, DataModel

db = SQLConnector("sales_db", url="sqlite:///data/tracebi.db")

model = DataModel("SalesModel")
model.add_connector(db)
model.add_table("orders",    connector="sales_db", source="orders")
model.add_table("customers", connector="sales_db", source="customers")
model.add_relationship("orders_customers", "orders", "customers",
                        left_key="customer_id", how="left")

DataSet & Lineage

DataSet is an immutable pandas DataFrame wrapper. Every fluent operation returns a new DataSet — the original is never mutated. Each operation appends a LineageNode to the chain.

Available operations

MethodWhat it does
.filter(expr)pandas query string — records rows before/after
.transform(func)arbitrary (DataFrame) → DataFrame
.sort(by, ascending)sort by one or more columns
.select(columns)keep only the specified columns
.rename(columns){old: new} column mapping
.print_lineage()pretty-print the full lineage chain
.fingerprint()MD5 hash of DataFrame content
orders = (
    model.load("orders")
    .filter("status == 'shipped'", description="Shipped orders only")
    .transform(
        lambda df: df.assign(margin=df["revenue"] - df["cost"]),
        description="margin = revenue - cost",
    )
    .sort("margin", ascending=False)
)

orders.print_lineage()
# ============================================================
#   Lineage for DataSet: 'orders'   Shape: 198 rows × 7 cols
# ============================================================
#   Step 1: [LOAD]       Loaded 'orders' from connector 'sales_db'
#   Step 2: [FILTER]     Shipped orders only  (250 → 198 rows)
#   Step 3: [TRANSFORM]  margin = revenue - cost
#   Step 4: [SORT]       Sorted by margin (desc)
# ============================================================

LineageNode fields

FieldTypeDescription
operationstrload, filter, transform, join, bronze, silver, gold, …
descriptionstrhuman-readable step description
connectordictconnector name + type (load/bronze steps)
sourcestrtable name, file path, or query
timestampstrUTC ISO-8601, auto-set
metadatadictrows_before/after, agg spec, cast map, …

Lineage node colors

Used in LineageDiagram to color-code each step:

load
bronze
silver
gold
filter
transform
join
sort
select
rename

Report Engine (Phase 2)

Reports are built with a fluent builder and rendered to Excel or HTML. Every render saves a .manifest.json alongside the output file containing the full lineage for each section's dataset.

Section types

  • TextSection — heading, paragraph, or note
  • TableSection — DataFrame table with optional totals row
  • ChartSection — line, bar, or pie chart via matplotlib
  • .spacer() — blank row between sections

Renderers

  • ExcelRenderer().render(report, path) — multi-sheet .xlsx
  • HTMLRenderer().render(report, path) — standalone .html
  • HTMLRenderer().serve(report, port=8080) — local browser
  • HTMLRenderer().preview(report) — Jupyter inline
from tracebi.reports.report import Report, TextSection, TableSection, ChartSection
from tracebi.reports.excel_renderer import ExcelRenderer
from tracebi.reports.html_renderer import HTMLRenderer

report = (
    Report("Q2 Sales Report")
    .author("Data Team")
    .parameter("period", "Q2 2024")
    .add(TextSection(title="Summary", content="Summary", style="heading1"))
    .add(TextSection(content="Revenue up 12% vs Q1.", style="normal"))
    .add(ChartSection(title="Revenue Trend", dataset=trend_ds,
                      chart_type="line", x="month", y="revenue"))
    .add(TableSection(title="Orders", dataset=orders_ds,
                      columns=["region", "product", "revenue"],
                      totals=["revenue"],
                      number_formats={"revenue": "{:,.2f}"}))
)

ExcelRenderer().render(report, "output/q2_sales.xlsx")
HTMLRenderer().serve(report, port=8080)   # opens browser

Medallion Architecture (Phase 2.5)

Three distinct layer classes enforce the Bronze → Silver → Gold contract. Each layer can run standalone or as part of a PipelineRunner that coordinates scheduling and writes results back to a database.

Bronze — Raw ingest

Load data from any connector with zero transforms. Stamps lineage with operation="bronze" + ingestion timestamp.

Silver — Clean

Declarative pipeline: .cast(), .drop_nulls(), .deduplicate(), .rename(), .transform(). Each step adds a operation="silver" lineage node.

Gold — Aggregated

Delegates to StarSchema.query(). Stamps operation="gold" with query parameters in metadata.

StarSchema

Dimension references use dot notation: "dim_customer.region". Measures are a dict: {"revenue": "sum", "order_id": "count"}. Supported aggregations: sum, count, mean, min, max, nunique.

from tracebi.model.star_schema import StarSchema

schema = StarSchema("Sales", model=model)
schema.add_dimension("dim_customer", table_name="customers_silver",
                     key_col="customer_id", attributes=["region", "segment"])
schema.add_fact("fact_orders", table_name="orders_silver",
                measures=["revenue", "qty"],
                foreign_keys={"dim_customer": "customer_id"})

ds = schema.query(
    fact="fact_orders",
    measures={"revenue": "sum", "qty": "sum"},
    dimensions=["dim_customer.region"],
    filters={"status": "shipped"},
    aggregate=True,
)

Lineage Diagram

from tracebi.lineage.diagram import LineageDiagram

diag = LineageDiagram(ds)        # or LineageDiagram(report)
diag.show()                       # matplotlib window / Jupyter inline
diag.to_html("lineage.html")      # standalone HTML with embedded SVG
print(diag.to_mermaid())          # paste into GitHub markdown

Pipeline Runner (Phase 4)

PipelineRunner registers each layer with its own independent cron schedule and persists every run to a SQLite database. Layers are linked via depends_on, enabling refresh=True to walk the full chain.

System tables in SQLite

  • tracebi_layers — registered layer config + schedules
  • tracebi_runs — every run: status, rows_in/out, timestamps, upstream_run_id
  • tracebi_schemas — StarSchema definitions
  • tracebi_dimensions — dimension table registrations
  • tracebi_facts — fact table registrations
  • tracebi_relationships — DataModel join definitions

PipelineRunner API

  • .register(layer, name, schedule, depends_on)
  • .run(name, refresh=False) — on-demand
  • .start(blocking=True) — APScheduler
  • .stop()
  • .lineage(name) — print run history
  • .status() — all layers + last run
  • .register_schema(schema)
  • .register_model(model)
from tracebi.pipeline.runner import PipelineRunner

runner = PipelineRunner(db_url="sqlite:///data/tracebi.db")

# Each layer registered with its own schedule
runner.register(orders_bronze,   name="orders_bronze",     schedule="0 * * * *")
runner.register(orders_silver,   name="orders_silver",     schedule="15 * * * *",
                depends_on="orders_bronze")
runner.register(revenue_by_region, name="revenue_by_region", schedule="30 6 * * *",
                depends_on="orders_silver")

# On-demand: single layer only
runner.run("orders_silver")

# On-demand: full chain refresh (bronze → silver → gold)
runner.run("revenue_by_region", refresh=True)

# Start all scheduled jobs (blocking — Ctrl+C to stop)
runner.start()

Cross-layer lineage

Each run record stores an upstream_run_id pointing to the most recent successful run of its upstream layer. This means you can always trace a gold table row back through silver → bronze to the exact source ingestion.

runner.lineage("revenue_by_region")
# ============================================================
#   Pipeline Lineage — 'revenue_by_region'
#   Chain: orders_bronze → orders_silver → revenue_by_region
# ============================================================
#   [orders_bronze]  schedule=0 * * * *  runs=3
#     ✓ run_id=  1  2026-05-12T02:27:22  rows 11 → 11  [success]
#
#   [orders_silver]  schedule=15 * * * * ← orders_bronze  runs=2
#     ✓ run_id=  3  2026-05-12T02:27:25  rows 11 → 10  [success]  upstream_run=1
#
#   [revenue_by_region]  schedule=30 6 * * * ← orders_silver  runs=1
#     ✓ run_id=  5  2026-05-12T02:27:27  rows 10 → 4   [success]  upstream_run=3

One-command database setup

$ python seeds/seed_db.py

This script: creates data/tracebi.db, seeds source tables (orders_raw, customers_raw), registers all six layers with the runner, and runs the initial Bronze load so the DB is ready immediately.

Dashboard (Phase 3)

A live Dash 2.x application driven by the same DataModel. Filters are associative — selecting a value in any FilterPanel automatically filters every panel whose dataset contains that column.

Panel types

  • FilterPanel — dropdown filter (single or multi-select)
  • MetricPanel — KPI card with aggregation (sum/count/mean)
  • ChartPanel — bar, line, or pie chart via Plotly
  • TablePanel — paginated data table

Panel data sources

Each panel accepts either:

  • A pre-built dataset=DataSet
  • A table_name= string — loaded via DataModel.load() on each render, always fresh
  • An optional transform_fn= applied after loading
from tracebi.dashboard import Dashboard, DashboardServer
from tracebi.dashboard import FilterPanel, MetricPanel, ChartPanel, TablePanel

dashboard = (
    Dashboard("Q2 Sales Dashboard")
    .columns(2)
    .add_filter(FilterPanel("region-filter", label="Region",
                            column="region", table_name="orders"))
    .add_panel(MetricPanel("total-revenue", title="Total Revenue",
                           table_name="orders", column="revenue",
                           aggregation="sum", prefix="$"))
    .add_panel(ChartPanel("by-region", title="Revenue by Region",
                          table_name="orders", chart_type="bar",
                          x="region", y="revenue"))
    .add_panel(TablePanel("orders-table", title="Orders",
                          table_name="orders",
                          columns=["order_id", "region", "revenue"],
                          width=2))   # span both columns
)

DashboardServer(dashboard, model=model).run(port=8050)
# Open http://localhost:8050/

Web UI (Phase 5)

A FastAPI + Jinja2 browser interface over your TraceBi registry. Connectors, models, reports, pipelines, and live Dash dashboards — all accessible from one server on a single port.

Pages

  • Home — registry summary (connector / model / report counts)
  • Connectors — registered sources and table lists
  • Models — DataModel tables and relationships
  • Reports — render any registered report to HTML or Excel
  • Pipelines — Bronze/Silver/Gold layer status, run history, ▶ Run button
  • Dashboards — lists registered dashboards; each opens the live Dash app

How dashboards are embedded

Each DashboardServer is mounted inside FastAPI at /dashboards/<name>/ using Starlette's WSGIMiddleware. One port, no second server. The standalone DashboardServer.run() path is unaffected.

# Install web dependencies
pip install -r web/requirements.txt

# Start the dev server (hot-reload on)
python web/run.py
# → http://localhost:8000

# Point at your own app module
TRACEBI_APP=mypackage.tracebi_config python web/run.py

Registry pattern

Everything flows through a single Registry singleton. Your app module (default: web/demo_app.py) registers objects at startup; every API route reads from the registry at request time.

from web.api.registry import registry

registry.add_connector(connector)
registry.add_model(model)

@registry.report("my_report", description="...")
def my_report():
    ds = model.load("orders")
    return Report("My Report").add(TableSection(dataset=ds, title="Orders"))

registry.add_dashboard("sales", DashboardServer(dashboard, model=model))
registry.add_pipeline("main", runner)

Medallion-aware demo

demo_app.py detects data/tracebi.db at startup and adapts:

Project Structure

tracebi/
  tracebi/ ← the library
    connectors/ base, csv, sql, bigquery, snowflake, memory
    model/ dataset.py, data_model.py, star_schema.py
    etl/ bronze.py, silver.py, gold.py
    reports/ report.py, excel_renderer.py, html_renderer.py
    dashboard/ panels.py, dashboard.py, server.py
    pipeline/ runner.py
    lineage/ diagram.py
  web/ ← browser UI
    api/ FastAPI app, routers, registry singleton
    templates/ Jinja2 HTML templates (base, reports, pipelines, dashboards…)
    demo_app.py medallion demo — auto-detects SQLite, falls back to in-memory
    run.py dev server entrypoint (uvicorn + hot-reload)
    requirements.txt fastapi, jinja2, uvicorn, python-multipart
  examples/ phase1–4 runnable demos
  tests/ 163 tests across 5 files
  seeds/ seed_db.py — one-command DB setup
  requests/ _template.py + ad hoc report scripts
  data/ tracebi.db lives here (gitignored)
  NOTES.md design decisions + architecture reference
  pyproject.toml v0.5.0, optional dep groups

Install groups

GroupCommandIncludes
reportspip install -e ".[reports]"openpyxl, matplotlib
dashboardpip install -e ".[dashboard]"dash, plotly
pipelinepip install -e ".[pipeline]"apscheduler, sqlalchemy
lineagepip install -e ".[lineage]"networkx, matplotlib
sqlpip install -e ".[sql]"sqlalchemy

Quick Start

Run the examples

$ python examples/phase1_example.py # connectors + DataModel + lineage
$ python examples/phase2_example.py # report engine (opens browser)
$ python examples/phase25_example.py # medallion + star schema + lineage diagram
$ python examples/phase3_example.py # live Dash dashboard at localhost:8050
$ python seeds/seed_db.py # set up SQLite DB
$ python examples/phase4_example.py # full pipeline

Run tests

$ pytest tests/ # 163 passed

New ad hoc report

$ cp requests/_template.py requests/2024_q3_sales.py
# Edit the file, fill in connect / datasets / report / render sections
$ python requests/2024_q3_sales.py

Run the web UI

$ pip install -r web/requirements.txt # first time only
$ python seeds/seed_db.py # seed SQLite DB (first time only)
$ python -c "from seeds.seed_db import build_pipeline, DB_URL; r,_,_=build_pipeline(DB_URL); r.run('orders_silver'); r.run('customers_silver')" # run Silver layer
$ python web/run.py # → http://localhost:8000

Pull latest (after a session)

$ cd ~/Downloads/tracebi && git pull