20 Lineage Tracking Basics
Track transformation history and understand data provenance by enabling lineage tracking in your operations.
20.1 Example 1: Enable and View Lineage
Enable lineage tracking by adding lineage=True to any operation, then use add.scan('@lineage', df) to view the transformation history.
import pandas as pd
import additory as add
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Doohickey'],
'price': [100, 200, 150],
'quantity': [2, 1, 3]
})
# Transform with lineage tracking enabled
result = add.transform(
'@calc',
df,
columns=['price', 'quantity'],
expression='price * quantity',
as_='total',
lineage=True # Enable lineage tracking
)
# View lineage report
lineage_report = add.scan('@lineage', result)
print(lineage_report)Output:
═══════════════════════════════════════════════════════════════
LINEAGE REPORT
═══════════════════════════════════════════════════════════════
DataFrame: 3 rows × 4 columns
Operations: 1 transformations applied
───────────────────────────────────────────────────────────────
Step 1: add.transform - 2026-03-11T14:23:45
───────────────────────────────────────────────────────────────
Rows: 3 → 3 (no change)
Columns Added: total
Parameters:
columns: ["price","quantity"]
expression: ["price * quantity"]
mode: "@calc"
Key Points: - Add lineage=True to any operation (transform, to, synthetic) - Lineage metadata is stored with the DataFrame - Use add.scan('@lineage', df) to generate a human-readable report - Report shows operation history, parameters, and column changes
Note: This also works with polars DataFrames.
20.2 Example 2: Lineage with Filtering
Lineage tracking works with all transform modes, including filtering operations.
import pandas as pd
import additory as add
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig'],
'price': [100, 200, 150, 50],
'stock': [10, 5, 0, 15]
})
# Filter with lineage tracking
result = add.transform(
'@filter',
df,
columns=['product', 'price', 'stock'],
where='stock > 0',
lineage=True
)
# View lineage report
lineage_report = add.scan('@lineage', result)
print(lineage_report)Output:
═══════════════════════════════════════════════════════════════
LINEAGE REPORT
═══════════════════════════════════════════════════════════════
DataFrame: 3 rows × 3 columns
Operations: 1 transformations applied
───────────────────────────────────────────────────────────────
Step 1: add.transform - 2026-03-11T14:25:12
───────────────────────────────────────────────────────────────
Rows: 4 → 3 (1 row filtered out)
Columns Added: none
Parameters:
columns: ["product","price","stock"]
where: "stock > 0"
mode: "@filter"
Key Insights: - Lineage tracks row changes (4 → 3 rows) - Filter conditions are recorded in parameters - You can see exactly what was filtered and why
Note: This also works with polars DataFrames.
20.3 Example 3: Multi-Operation Lineage Chain
Lineage accumulates across multiple operations, giving you a complete transformation history.
import pandas as pd
import additory as add
df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Doohickey'],
'price': [100, 200, 150],
'quantity': [2, 1, 3]
})
# Step 1: Calculate total
result = add.transform(
'@calc',
df,
columns=['price', 'quantity'],
expression='price * quantity',
as_='total',
lineage=True
)
# Step 2: Filter high-value items (lineage continues)
result = add.transform(
'@filter',
result,
columns=['product', 'total'],
where='total > 200',
lineage=True
)
# View complete lineage report
lineage_report = add.scan('@lineage', result)
print(lineage_report)Output:
═══════════════════════════════════════════════════════════════
LINEAGE REPORT
═══════════════════════════════════════════════════════════════
DataFrame: 2 rows × 2 columns
Operations: 2 transformations applied
───────────────────────────────────────────────────────────────
Step 1: add.transform - 2026-03-11T14:27:33
───────────────────────────────────────────────────────────────
Rows: 3 → 3 (no change)
Columns Added: total
Parameters:
columns: ["price","quantity"]
expression: ["price * quantity"]
mode: "@calc"
───────────────────────────────────────────────────────────────
Step 2: add.transform - 2026-03-11T14:27:33
───────────────────────────────────────────────────────────────
Rows: 3 → 2 (1 row filtered out)
Columns Added: none
Parameters:
columns: ["product","total"]
where: "total > 200"
mode: "@filter"
Key Features: - Lineage persists across multiple operations - Each step is numbered and timestamped - You can see the complete transformation pipeline - Useful for debugging complex data workflows
Note: This also works with polars DataFrames.
20.4 Lineage Tracking Features
20.4.1 What Gets Tracked
- Operation Type: Which function was called (transform, to, synthetic)
- Timestamp: When the operation occurred
- Parameters: All operation parameters
- Row Changes: Rows before and after
- Column Changes: Columns added or modified
- Mode: Specific transformation mode used
20.4.2 When to Use Lineage
- Debugging: Understand how data was transformed
- Auditing: Track data provenance for compliance
- Documentation: Auto-generate transformation documentation
- Collaboration: Share transformation history with team
- Quality Assurance: Verify transformation steps
20.4.3 Performance Impact
- Minimal overhead (<100ms per operation)
- Metadata stored in memory only
- No impact on computation speed
- Lineage data is lightweight (JSON format)
20.5 Enabling Lineage Tracking
20.5.1 In add.transform()
result = add.transform(
'@calc',
df,
columns=['a', 'b'],
expression='a + b',
as_='c',
lineage=True # Enable lineage
)20.5.2 In add.to()
result = add.to(
orders,
bring_from=customers,
bring='name',
against='customer_id',
lineage=True # Enable lineage
)20.5.3 In add.synthetic()
result = add.synthetic(
'@new',
n=100,
strategy={'a': {'distribution': 'normal', 'mean': 0, 'std': 1}},
lineage=True # Enable lineage
)20.6 Parameters
20.6.1 Required Parameters
mode:'@lineage'for lineage trackingdf: Input DataFrame with lineage metadata
20.6.2 Optional Parameters
columns: Filter report to specific columns (not yet implemented)trace: Trace specific cell transformations (not yet implemented)as_type: Output format (currently only ‘text’ supported)
20.6.3 Positional Parameters
# Also works without naming certain parameters:
lineage_report = add.scan('@lineage', df)20.7 Error Handling
If you forget to enable lineage tracking, add.scan('@lineage', df) provides a helpful error message:
# Transform WITHOUT lineage tracking
result = add.transform('@calc', df, expression='a + b', as_='c')
# Try to get lineage report
try:
lineage_report = add.scan('@lineage', result)
except ValueError as e:
print(e)Output:
No lineage metadata found. Lineage tracking must be enabled by adding
lineage=True to add.to(), add.transform(), or add.synthetic() calls.
Example:
df = add.transform('@calc', df, strategy={'total': 'price * qty'}, lineage=True)
result = add.scan('@lineage', df)
Solution: Add lineage=True to your operations to enable tracking.
20.8 Limitations (v0.1.3)
20.8.1 Session-Only Lineage
- Lineage metadata is stored in memory only
- Not persisted when saving DataFrames to disk
- Lost when Python session ends
20.8.2 Type Conversion Restriction
- Cannot use
lineage=Truewithas_typeparameter - Lineage metadata would be lost during type conversion
- Use lineage OR type conversion, not both
20.8.3 Future Enhancements (v0.2.0)
- Persistent lineage with
add.save()andadd.load() - Cell-level tracing with
traceparameter - Column-specific lineage with
columnsparameter - Lineage visualization and export
- Lineage comparison and diff
20.9 Best Practices
- Enable lineage early: Add
lineage=Truefrom the start of your pipeline - Check lineage regularly: Use
add.scan('@lineage', df)to verify transformations - Document with lineage: Include lineage reports in documentation
- Debug with lineage: When data looks wrong, check the lineage first
- Share lineage: Include lineage reports when asking for help
20.10 Next Steps
- Page 2: Advanced lineage workflows and real-world use cases
- Troubleshooting Guide: Common errors and solutions