18  Lineage Tracking

Track transformation history and understand data provenance using the @lineage mode.

18.1 Example 1: Basic Lineage Tracking

Enable lineage tracking by adding lineage=True to your operations, then use @lineage mode to see the transformation history.

import pandas as pd
import additory as add

df = pd.DataFrame({
    'price': [100, 200, 150],
    'quantity': [2, 1, 3]
})

# Perform operation with lineage tracking enabled
result = add.transform(
    '@calc',
    df,
    columns=['price', 'quantity'],
    expression='price * quantity',
    as_='total',
    lineage=True  # Enable lineage tracking
)

# Get lineage report
lineage_report = add.scan(
    '@lineage',
    result
)

print(lineage_report)

Output:

═══════════════════════════════════════════════════════════════
LINEAGE REPORT
═══════════════════════════════════════════════════════════════

DataFrame: 3 rows × 3 columns
Operations: 1 transformations applied

───────────────────────────────────────────────────────────────
Step 1: add.transform - 2026-03-11T09:53:59
───────────────────────────────────────────────────────────────
  Rows: 3 → 3 (no change)
  Columns Added: total
  
  Parameters:
    columns: ["price","quantity"]
    expression: ["price * quantity"]
    strategy: null
    by: null

Key Points: - Add lineage=True to any operation (transform, to, synthetic) - Lineage metadata is stored with the DataFrame - Use @lineage mode to generate a human-readable report - Report shows operation history, parameters, and column changes

Note: This also works with polars DataFrames.


18.2 Example 2: Multi-Step Lineage Tracking

Lineage accumulates across multiple operations, giving you a complete transformation history.

import pandas as pd
import additory as add

df = pd.DataFrame({
    'price': [100, 200, 150],
    'cost': [60, 120, 90],
    'quantity': [2, 1, 3]
})

# Step 1: Calculate profit with lineage tracking
result = add.transform(
    '@calc',
    df,
    columns=['price', 'cost'],
    expression='price - cost',
    as_='profit',
    lineage=True
)

# Step 2: Calculate total revenue (lineage continues)
result = add.transform(
    '@calc',
    result,
    columns=['price', 'quantity'],
    expression='price * quantity',
    as_='revenue',
    lineage=True
)

# Get complete lineage report
lineage_report = add.scan(
    '@lineage',
    result
)

print(lineage_report)

Output:

═══════════════════════════════════════════════════════════════
LINEAGE REPORT
═══════════════════════════════════════════════════════════════

DataFrame: 3 rows × 5 columns
Operations: 2 transformations applied

───────────────────────────────────────────────────────────────
Step 1: add.transform - 2026-03-11T10:15:23
───────────────────────────────────────────────────────────────
  Rows: 3 → 3 (no change)
  Columns Added: profit
  
  Parameters:
    columns: ["price","cost"]
    expression: ["price - cost"]

───────────────────────────────────────────────────────────────
Step 2: add.transform - 2026-03-11T10:15:23
───────────────────────────────────────────────────────────────
  Rows: 3 → 3 (no change)
  Columns Added: revenue
  
  Parameters:
    columns: ["price","quantity"]
    expression: ["price * quantity"]

Key Insights: - Lineage persists across multiple operations - Each step is numbered and timestamped - You can see the complete transformation pipeline - Useful for debugging complex data workflows

Note: This also works with polars DataFrames.


18.3 Example 3: Lineage Without Tracking - Error Handling

If you forget to enable lineage tracking, @lineage mode provides a helpful error message.

import pandas as pd
import additory as add

df = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [4, 5, 6]
})

# Transform WITHOUT lineage tracking
result = add.transform(
    '@calc',
    df,
    columns=['a', 'b'],
    expression='a + b',
    as_='c'
    # Note: lineage=True is missing!
)

# Try to get lineage report
try:
    lineage_report = add.scan('@lineage', result)
except ValueError as e:
    print(e)

Output:

No lineage metadata found. Lineage tracking must be enabled by adding
lineage=True to add.to(), add.transform(), or add.synthetic() calls.

Example:
  df = add.transform('@calc', df, strategy={'total': 'price * qty'}, lineage=True)
  result = add.scan('@lineage', df)

Error Handling: - Clear error message when lineage is missing - Provides example of how to enable lineage - Helps you quickly fix the issue

Note: This also works with polars DataFrames.


18.4 Lineage Tracking Features

18.4.1 What Gets Tracked

  • Operation Type: Which function was called (transform, to, synthetic)
  • Timestamp: When the operation occurred
  • Parameters: All operation parameters
  • Row Changes: Rows before and after
  • Column Changes: Columns added or modified

18.4.2 When to Use Lineage

  • Debugging: Understand how data was transformed
  • Auditing: Track data provenance for compliance
  • Documentation: Auto-generate transformation documentation
  • Collaboration: Share transformation history with team

18.4.3 Performance Impact

  • Minimal overhead (<100ms per operation)
  • Metadata stored in memory only
  • No impact on computation speed

18.5 Parameters

18.5.1 Required Parameters

  • mode: '@lineage' for lineage tracking
  • df: Input DataFrame with lineage metadata

18.5.2 Optional Parameters

  • columns: Filter report to specific columns (not yet implemented)
  • trace: Trace specific cell transformations (not yet implemented)
  • as_type: Output format (currently only ‘text’ supported)

18.5.3 Positional Parameters

# Also works without naming certain parameters:
lineage_report = add.scan('@lineage', df)

18.6 Enabling Lineage Tracking

18.6.1 In add.transform()

result = add.transform(
    '@calc',
    df,
    columns=['a', 'b'],
    expression='a + b',
    as_='c',
    lineage=True  # Enable lineage
)

18.6.2 In add.to()

result = add.to(
    orders,
    bring_from=customers,
    bring='name',
    against='customer_id',
    lineage=True  # Enable lineage
)

18.6.3 In add.synthetic()

result = add.synthetic(
    '@new',
    n=100,
    strategy={'a': {'distribution': 'normal', 'mean': 0, 'std': 1}},
    lineage=True  # Enable lineage
)

18.7 Limitations (v0.1.4)

18.7.1 Session-Only Lineage

  • Lineage metadata is stored in memory only
  • Not persisted when saving DataFrames to disk
  • Lost when Python session ends

18.7.2 Future Enhancements (v0.2.0)

  • Persistent lineage with add.save() and add.load()
  • Cell-level tracing with trace parameter
  • Column-specific lineage with columns parameter
  • Lineage visualization and export

18.8 Next Steps

  • Page 1: Basic data scanning with @analyze mode
  • Page 3: Real-world data quality workflows