{% extends "base.html" %} {% block title %}TraceBi — Home{% endblock %} {% block content %}
Code-first · Traceable · Open Source

Build analytics pipelines that explain themselves.

TraceBi is a Python framework for building BI workflows — from raw data ingestion to star schemas, reports, and dashboards — where every transformation is tracked with a full lineage chain. No black boxes. No mystery queries.

▤ View Reports ⊞ Explore Models ⧖ Pipelines
{{ connector_count }}
Connector{{ 's' if connector_count != 1 }}
{{ model_count }}
Data Model{{ 's' if model_count != 1 }}
{{ report_count }}
Report{{ 's' if report_count != 1 }}
{{ pipeline_count }}
Pipeline{{ 's' if pipeline_count != 1 }}
Core concepts

Four ideas that everything else is built on. Understand these and the rest follows.

1
DataSet — immutable by design

The core data container in TraceBi is the DataSet — a thin wrapper around a pandas DataFrame. The critical rule: every operation returns a new DataSet. Nothing mutates the original.

This means you can chain transformations freely — filter, transform, sort, rename, select — and each step produces a clean, independent result. If you filter the wrong rows, the original is still there. If you want to branch a dataset two different ways, both branches start from the same immutable source.

# Every step returns a NEW DataSet — nothing mutates
orders = model.load("orders")

shipped = orders.filter("status == 'shipped'")
with_margin = shipped.transform(
  lambda df: df.assign(margin=df["revenue"] - df["cost"])
)
top10 = with_margin.sort("margin", ascending=False)

# Original is unchanged — still has all statuses
orders.shape # same as when first loaded
2
Lineage — every step is recorded

Every DataSet carries a lineage list — a chain of LineageNode records. Each node records the operation type, a human-readable description, row counts before and after, and a timestamp.

Because each operation appends a new node rather than replacing anything, a DataSet at the end of a chain holds the complete history of everything that produced it — connector, filters, joins, transforms, aggregations — traceable back to the original source.

top10.print_lineage()

# Step 1: [LOAD] Loaded 'orders' from 'sales_db'
# Step 2: [FILTER] Shipped orders only (250 → 198 rows)
# Step 3: [TRANSFORM] margin = revenue - cost
# Step 4: [SORT] Sorted by margin (desc)

# Or visualise as a DAG diagram
from tracebi.lineage.diagram import LineageDiagram

LineageDiagram(top10).to_html("lineage.html")
3
Medallion architecture — structured ETL

TraceBi structures pipelines as three named layers, each with a distinct purpose:

Bronze Raw ingest — data copied as-is from the source. No transforms. Acts as the permanent audit record of exactly what arrived.
Silver Declarative cleaning — type casting, null removal, deduplication, column renames. Produces a trusted, analysis-ready table.
Gold Aggregated analytics — groups measures by dimensions via a StarSchema query. The output feeds reports and dashboards directly.

Each layer reads from the previous, writes to a configurable sink (SQL table, CSV, memory), and records its lineage. Every layer can be scheduled independently and re-run on demand.

# Bronze — raw ingest, zero transforms
bronze = BronzeLayer(
  connector=db, source="orders_raw",
  sink=db, sink_table="orders_bronze",
)

# Silver — declarative cleaning
silver = (
  SilverLayer(source=db, source_table="orders_bronze",
             sink=db, sink_table="orders_silver")
  .cast({"qty": "int64", "order_date": "datetime64[ns]"})
  .drop_nulls(subset=["order_id"])
  .deduplicate(subset=["order_id"])
)

# Gold — aggregated via StarSchema
gold = GoldLayer(
  schema=schema, fact="fact_orders",
  measures={"revenue": "sum"},
  dimensions=["dim_customer.region"],
  sink=db, sink_table="gold_revenue_by_region",
)
4
Star Schema — declarative analytics

A StarSchema sits above the connector layer and adds BI semantics. You declare two types of tables:

Facts Transactional tables with numeric measures to aggregate (revenue, qty, count).
Dimensions Lookup tables with categorical attributes to group by (region, segment, product).

Once defined, schema.query() is fully declarative. You describe the result you want — which measures, grouped by which dimension attributes, filtered how — and TraceBi resolves all the joins, applies filters, and aggregates automatically. You never write the join logic by hand.

schema = StarSchema("Sales", model=model)

schema.add_dimension(
  "dim_customer",
  table_name="customers_silver",
  key_col="customer_id",
  attributes=["region", "segment"],
)
schema.add_fact(
  "fact_orders",
  table_name="orders_silver",
  measures=["revenue", "qty"],
  foreign_keys={"dim_customer": "customer_id"},
)

# Declarative query — joins resolved automatically
ds = schema.query(
  fact="fact_orders",
  measures={"revenue": "sum", "qty": "sum"},
  dimensions=["dim_customer.region"],
  filters={"status": "shipped"},
)
What TraceBi includes

Six building blocks — each independent, all composable.

Connectors

CSV, SQL (any SQLAlchemy dialect), BigQuery, Snowflake, and in-memory DataFrames — all through a single connector.load(name) interface. Swap sources without touching transform code.

Medallion ETL

Bronze → Silver → Gold layers, each writing to a configurable sink. Declarative cleaning in Silver (cast, deduplicate, drop nulls), StarSchema-powered aggregation in Gold.

Star Schema

Declare facts and dimensions once. Query with dot-notation dimension references ("dim_customer.region"). Auto-joins, filters, and aggregates — no SQL.

Reports

Compose reports from TextSection, TableSection, and ChartSection blocks. Render to Excel or HTML. A lineage manifest is written alongside every render.

Pipelines

Register layers, assign cron schedules, declare dependencies between layers. Run history is persisted to SQLite with row counts and upstream run IDs linking the full cross-layer chain.

Dashboards

Interactive Plotly/Dash dashboards with associative filters — selecting one panel auto-filters all others that share the same column. Mounted directly inside the TraceBi web server.

End-to-end walkthrough

A complete pipeline from raw data to scheduled report — six steps.

Connect to your data

The fastest way to get started is the built-in demo model — no files, no database, no setup. It comes pre-loaded with sample orders, customers, and trend data so you can follow every step of this walkthrough immediately after pip install tracebi.

from tracebi.demo import load_demo_model

model = load_demo_model()
# Ready — tables: orders, customers, trend

When you're ready to connect to your own data, register connectors and logical table names in a DataModel. You can mix sources — orders from SQL, a lookup from CSV, a reference table from BigQuery — and reference them all by name in your transforms. Call model.connect() once to open all connections.

from tracebi import DataModel, SQLConnector, CSVConnector

db = SQLConnector("sales_db", url="sqlite:///path/to/sales.db")
csv = CSVConnector("lookups", directory="path/to/data/")

model = DataModel("SalesModel")
model.add_connector(db)
model.add_connector(csv)
model.add_table("orders", connector="sales_db", source="orders")
model.add_table("customers", connector="sales_db", source="customers")
model.add_table("regions", connector="lookups", source="regions.csv")
model.connect()
Load and transform with full lineage

Every method on DataSet — filter, transform, sort, select, rename — returns a new immutable DataSet with the step appended to its lineage chain. Call print_lineage() at any point to see the full audit trail, or fingerprint() to get a hash of the current data state.

orders = model.load("orders")

result = (
  orders
  .filter("status == 'shipped'", description="Shipped orders only")
  .transform(
    lambda df: df.assign(margin=df["revenue"] - df["cost"]),
    description="margin = revenue - cost",
  )
  .sort("margin", ascending=False)
  .select(["order_id", "region", "product", "revenue", "margin"])
)

result.print_lineage()
# Step 1: [LOAD] Loaded 'orders' from 'sales_db'
# Step 2: [FILTER] Shipped orders only (250 → 198 rows)
# Step 3: [TRANSFORM] margin = revenue - cost
# Step 4: [SORT] Sorted by margin (desc)
# Step 5: [SELECT] Selected 5 columns
Structure as a Medallion pipeline

For production pipelines, structure your work as Bronze → Silver → Gold layers. Each layer reads from the previous sink and writes its output to the next. SilverLayer transforms are declared once and reused on every run. GoldLayer queries a StarSchema to produce the final analytic output.

from tracebi import BronzeLayer, SilverLayer, GoldLayer
from tracebi.model.star_schema import StarSchema

bronze = BronzeLayer(connector=db, source="orders_raw",
                   sink=db, sink_table="orders_bronze")

silver = (
  SilverLayer(source=db, source_table="orders_bronze",
             sink=db, sink_table="orders_silver")
  .cast({"qty": "int64", "order_date": "datetime64[ns]"})
  .drop_nulls(subset=["order_id"])
  .deduplicate(subset=["order_id"])
)

schema = StarSchema("Sales", model=model)
schema.add_dimension("dim_customer", table_name="customers_silver",
                   key_col="customer_id", attributes=["region", "segment"])
schema.add_fact("fact_orders", table_name="orders_silver",
              measures=["revenue", "qty"],
              foreign_keys={"dim_customer": "customer_id"})

gold = GoldLayer(schema=schema, fact="fact_orders",
               measures={"revenue": "sum"}, dimensions=["dim_customer.region"],
               sink=db, sink_table="gold_by_region")
Build and render a report

A Report is assembled from section objects — text headings, tables, and charts. Pass any DataSet directly into a TableSection or ChartSection. Render to Excel or HTML; both renderers write a manifest.json alongside the file capturing the full lineage of every dataset in the report.

from tracebi.reports.report import Report, TextSection, TableSection, ChartSection
from tracebi.reports.excel_renderer import ExcelRenderer
from tracebi.reports.html_renderer import HTMLRenderer

report = (
  Report("Q2 Sales Report")
  .author("Data Team")
  .parameter("period", "Q2 2024")
  .add(TextSection(title="Summary", content="Summary", style="heading1"))
  .add(ChartSection(title="Revenue by Region", dataset=gold_ds,
                   chart_type="bar", x="dim_customer.region", y="revenue"))
  .add(TableSection(title="Detail", dataset=gold_ds,
                   columns=["dim_customer.region", "revenue"],
                   totals=["revenue"]))
)

ExcelRenderer().render(report, "output/q2_sales.xlsx")
HTMLRenderer().render(report, "output/q2_sales.html")
HTMLRenderer().serve(report, port=8080) # open in browser
HTMLRenderer().preview(report) # inline in Jupyter
Schedule with a PipelineRunner

Register all your layers with a PipelineRunner. Assign each layer a cron schedule and optionally declare which layer it depends on. Run any layer on demand with runner.run(name), or pass refresh=True to cascade through all upstream dependencies first. Every run is persisted to SQLite with row counts and an upstream_run_id linking the full cross-layer chain.

from tracebi.pipeline.runner import PipelineRunner

runner = PipelineRunner(db_url="sqlite:///data/tracebi.db")

runner.register(bronze, name="orders_bronze", schedule="0 * * * *")
runner.register(silver, name="orders_silver", schedule="15 * * * *",
               depends_on="orders_bronze")
runner.register(gold, name="revenue_by_region", schedule="30 6 * * *",
               depends_on="orders_silver")

# Run one layer on demand
runner.run("orders_silver")

# Full refresh — runs bronze → silver → gold in order
runner.run("revenue_by_region", refresh=True)

# Start the APScheduler (blocking)
runner.start()
Add a live dashboard

Build a Dashboard from panel components — metrics, charts, tables, and filters. Filters are associative: selecting a region in the filter panel automatically updates every chart and table that shares that column, without any event-wiring code. Mount the dashboard into the TraceBi web server via DashboardServer — no separate Dash process required.

from tracebi.dashboard import Dashboard, DashboardServer
from tracebi.dashboard import FilterPanel, MetricPanel, ChartPanel, TablePanel

dashboard = (
  Dashboard("Sales Dashboard").columns(2)
  .add_filter(FilterPanel("region-filter", label="Region",
                       column="region", table_name="orders"))
  .add_panel(MetricPanel("total-revenue", title="Total Revenue",
                      table_name="orders", column="revenue",
                      aggregation="sum", prefix="$"))
  .add_panel(ChartPanel("by-region", title="Revenue by Region",
                      table_name="orders", chart_type="bar",
                      x="region", y="revenue"))
)

# Standalone
DashboardServer(dashboard, model=model).run(port=8050)

# Or mount into the web server (registry.add_dashboard)
How to use it

Two ways to work with TraceBi — pick what fits your workflow.

Option 1 — Python library

Install TraceBi and use it from a notebook or script. Define your connectors, build your medallion pipeline, query your star schema, and render reports to HTML or Excel — no web server required.

pip install -e ".[reports,pipeline,lineage,sql]"

Then follow the walkthrough above. Run any example to see it in action:

python examples/phase25_example.py
Option 2 — Web UI (this app)

Register your connectors, models, reports, and pipelines in web/demo_app.py. The web server exposes a UI for running reports, browsing pipelines, and viewing lineage — plus a full REST API.

# web/demo_app.py
registry.add_connector(my_connector)
registry.add_model(my_model)
registry.add_pipeline("sales", runner)

@registry.report("my_report")
def my_report(): ...
TRACEBI_APP=mypackage.config \
  python -m uvicorn web.api.main:app --reload
Full lineage, always

Every DataSet carries a chain of LineageNode records describing exactly which connector, transformation, and timestamp produced each result. Run any report here to visualise the full DAG.

Explore lineage →
REST API included

Every UI feature is backed by a documented REST API. Trigger report runs, kick pipeline layers, and pull lineage data programmatically — without touching the UI.

Swagger UI ReDoc
{% endblock %}