12 add.transform() - Real-World Workflows
12.1 Overview
Learn how to chain multiple add.transform() operations to create complete data transformation pipelines. Real-world data analysis often requires multiple steps - this page shows you how to combine operations effectively.
What you’ll learn: - How to chain transform operations - How to build multi-step data pipelines - Best practices for workflow design - Common transformation patterns
Prerequisites: - Calculations - Basic calculations - Filter & Sort - Data selection - Aggregation - Grouping and summarizing
12.2 Example 1: Sales Data Pipeline
Business Context: You have raw sales transactions and need to calculate total sales, filter out small transactions, aggregate by region, and rank regions by performance.
Code:
import additory as add
import pandas as pd
# Raw sales transactions
df = pd.DataFrame({
'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03'],
'region': ['North', 'South', 'North', 'South', 'North'],
'product': ['Widget', 'Widget', 'Gadget', 'Widget', 'Gadget'],
'quantity': [10, 15, 8, 12, 20],
'price': [30, 30, 50, 30, 50]
})
# Step 1: Calculate total sales
result = add.transform(
'@calc',
df,
columns=['date', 'region', 'product', 'quantity', 'price'],
expression='quantity * price',
as_='total_sales'
)
print("After calculation:")
print(result)
# Step 2: Filter out low-value transactions
result = add.transform(
'@filter',
result,
columns=['date', 'region', 'product', 'total_sales'],
where='total_sales > 300'
)
print("\nAfter filtering:")
print(result)
# Step 3: Aggregate by region
result = add.transform(
'@aggregate',
result,
columns=['region', 'total_sales'],
by='region',
strategy={'total_sales': 'sum'}
)
print("\nAfter aggregation:")
print(result)
# Step 4: Sort by total sales descending
result = add.transform(
'@sort',
result,
columns=['region', 'total_sales'],
by='total_sales',
order='desc'
)
print("\nFinal result:")
print(result)Output:
After calculation:
date region product quantity price total_sales
0 2024-01-01 North Widget 10 30 300
1 2024-01-01 South Widget 15 30 450
2 2024-01-02 North Gadget 8 50 400
3 2024-01-02 South Widget 12 30 360
4 2024-01-03 North Gadget 20 50 1000
After filtering:
date region product total_sales
0 2024-01-01 South Widget 450
1 2024-01-02 North Gadget 400
2 2024-01-02 South Widget 360
3 2024-01-03 North Gadget 1000
After aggregation:
region total_sales
0 North 1400
1 South 810
Final result:
region total_sales
0 North 1400
1 South 810
Explanation: - Step 1: Calculate total_sales = quantity × price for each transaction - Step 2: Keep only transactions over $300 (filters out first row) - Step 3: Sum total_sales by region (North: 400+1000=1400, South: 450+360=810) - Step 4: Sort regions by total_sales descending (North first) - Each step builds on the previous result - Pipeline transforms raw data into actionable insights
Note: This also works with polars DataFrames.
12.3 Example 2: Product Performance Analysis
Business Context: You have product sales and returns data. You need to aggregate by product, filter out products with high returns, and rank by sales performance.
Code:
import additory as add
import pandas as pd
# Product transactions
df = pd.DataFrame({
'product': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'sales': [100, 150, 200, 80, 90, 85, 300, 350],
'returns': [5, 10, 8, 15, 20, 18, 2, 3]
})
# Step 1: Aggregate by product
result = add.transform(
'@aggregate',
df,
columns=['product', 'sales', 'returns'],
by='product',
strategy={'sales': 'sum', 'returns': 'sum'}
)
print("After aggregation:")
print(result)
# Step 2: Filter products with low returns
result = add.transform(
'@filter',
result,
columns=['product', 'sales', 'returns'],
where='returns < 50'
)
print("\nAfter filtering:")
print(result)
# Step 3: Sort by sales descending
result = add.transform(
'@sort',
result,
columns=['product', 'sales', 'returns'],
by='sales',
order='desc'
)
print("\nFinal result:")
print(result)Output:
After aggregation:
product sales returns
0 A 450 23
1 B 255 53
2 C 650 5
After filtering:
product sales returns
0 A 450 23
1 C 650 5
Final result:
product sales returns
0 C 650 5
1 A 450 23
Explanation: - Step 1: Sum sales and returns for each product - Product A: 450 sales, 23 returns - Product B: 255 sales, 53 returns - Product C: 650 sales, 5 returns - Step 2: Filter out products with 50+ returns (removes Product B) - Step 3: Sort remaining products by sales (C first, then A) - Result shows top-performing products with acceptable return rates
Note: This also works with polars DataFrames.
12.4 Workflow Design Patterns
12.4.1 Pattern 1: Calculate → Filter → Aggregate → Sort
# Common analytics pipeline
df = add.transform('@calc', df, ...) # Derive metrics
df = add.transform('@filter', df, ...) # Remove noise
df = add.transform('@aggregate', df, ...) # Summarize
df = add.transform('@sort', df, ...) # Rank resultsUse when: Building reports, dashboards, or analytics
12.4.2 Pattern 2: Filter → Calculate → Aggregate
# Focus on relevant data first
df = add.transform('@filter', df, ...) # Select subset
df = add.transform('@calc', df, ...) # Compute on subset
df = add.transform('@aggregate', df, ...) # SummarizeUse when: Working with large datasets, want to reduce data early
12.4.3 Pattern 3: Aggregate → Filter → Sort
# Summarize then refine
df = add.transform('@aggregate', df, ...) # Group and summarize
df = add.transform('@filter', df, ...) # Keep top performers
df = add.transform('@sort', df, ...) # RankUse when: Finding top N items, identifying outliers
12.4.4 Pattern 4: Calculate → Calculate → Aggregate
# Multi-step calculations
df = add.transform('@calc', df, expression='a * b', as_='ab')
df = add.transform('@calc', df, expression='ab + c', as_='result')
df = add.transform('@aggregate', df, by='group', strategy={'result': 'sum'})Use when: Complex derived metrics, multi-step formulas
12.5 Best Practices
Start with a plan: Sketch out your pipeline before coding
# Good: Clear sequence # 1. Calculate revenue # 2. Filter valid transactions # 3. Aggregate by customer # 4. Sort by total revenueFilter early: Reduce data size as soon as possible
# Good: Filter first df = add.transform('@filter', df, where='valid == 1') df = add.transform('@calc', df, ...) # Work with less data # Less efficient: Filter last df = add.transform('@calc', df, ...) # Process all data df = add.transform('@filter', df, where='valid == 1')Use intermediate variables: Make pipelines readable
# Good: Clear steps with_sales = add.transform('@calc', df, ...) filtered = add.transform('@filter', with_sales, ...) aggregated = add.transform('@aggregate', filtered, ...) final = add.transform('@sort', aggregated, ...) # Less clear: Chained without names result = add.transform('@sort', add.transform('@aggregate', add.transform('@filter', add.transform('@calc', df, ...), ...), ...), ...)Verify each step: Check intermediate results
# Good: Verify as you go result = add.transform('@calc', df, ...) print(f"After calc: {len(result)} rows") result = add.transform('@filter', result, ...) print(f"After filter: {len(result)} rows")Keep operations simple: One transformation per step
# Good: Separate concerns df = add.transform('@calc', df, expression='a * b', as_='ab') df = add.transform('@calc', df, expression='ab + c', as_='result') # Avoid: Complex expressions df = add.transform('@calc', df, expression='(a * b) + c', as_='result')Document your pipeline: Add comments explaining business logic
# Calculate customer lifetime value df = add.transform('@calc', df, expression='orders * avg_value', as_='ltv') # Focus on high-value customers (LTV > $1000) df = add.transform('@filter', df, where='ltv > 1000') # Rank by lifetime value df = add.transform('@sort', df, by='ltv', order='desc')
12.6 Common Pitfalls
12.6.1 Pitfall 1: Forgetting to include grouping columns
# Wrong: Missing 'region' in columns
result = add.transform('@aggregate', df,
columns=['sales'], # Missing 'region'!
by='region',
strategy={'sales': 'sum'})
# Correct: Include grouping column
result = add.transform('@aggregate', df,
columns=['region', 'sales'],
by='region',
strategy={'sales': 'sum'})12.6.2 Pitfall 2: Filtering on wrong data type
# Wrong: Filtering on float columns (not yet supported)
df = add.transform('@filter', df, where='price > 29.99') # May fail
# Correct: Use integer columns for filtering
df = add.transform('@filter', df, where='quantity > 10') # Works12.6.3 Pitfall 3: Losing track of column names
# Wrong: Using old column name after calculation
df = add.transform('@calc', df, expression='a + b', as_='total')
df = add.transform('@filter', df, where='a > 10') # 'a' might not exist
# Correct: Use new column names
df = add.transform('@calc', df, expression='a + b', as_='total')
df = add.transform('@filter', df, where='total > 10')12.7 Key Takeaways
- Chain operations by passing results from one step to the next
- Design pipelines with clear, logical steps
- Filter early to improve performance
- Use intermediate variables for readability
- Verify results at each step
- Keep individual operations simple
- Document business logic with comments
- Common pattern: Calculate → Filter → Aggregate → Sort
12.8 Common Questions
Q: How many operations can I chain?
A: As many as needed. Each operation returns a DataFrame that can be input to the next.
Q: Does chaining operations affect performance?
A: Each operation creates a new DataFrame. For very large datasets, consider filtering early to reduce data size.
Q: Can I reuse intermediate results?
A: Yes! Store intermediate results in variables and use them in multiple pipelines.
Q: What if an operation fails mid-pipeline?
A: The pipeline stops at the failed operation. Check the error message and fix that step.
Q: Can I mix pandas and polars in a pipeline?
A: No. All operations in a pipeline must use the same DataFrame type (all pandas or all polars).
Q: Should I use method chaining or intermediate variables?
A: Intermediate variables are more readable and easier to debug. Use them for complex pipelines.
12.9 Next Steps
- add.to() Examples - Learn about data enrichment
- Calculations - Review calculation operations
- API Reference - Complete
add.transform()documentation
Version: 0.1.3
Last Updated: March 9, 2026