12  add.transform() - Real-World Workflows

12.1 Overview

Learn how to chain multiple add.transform() operations to create complete data transformation pipelines. Real-world data analysis often requires multiple steps - this page shows you how to combine operations effectively.

What you’ll learn: - How to chain transform operations - How to build multi-step data pipelines - Best practices for workflow design - Common transformation patterns

Prerequisites: - Calculations - Basic calculations - Filter & Sort - Data selection - Aggregation - Grouping and summarizing


12.2 Example 1: Sales Data Pipeline

Business Context: You have raw sales transactions and need to calculate total sales, filter out small transactions, aggregate by region, and rank regions by performance.

Code:

import additory as add
import pandas as pd

# Raw sales transactions
df = pd.DataFrame({
    'date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03'],
    'region': ['North', 'South', 'North', 'South', 'North'],
    'product': ['Widget', 'Widget', 'Gadget', 'Widget', 'Gadget'],
    'quantity': [10, 15, 8, 12, 20],
    'price': [30, 30, 50, 30, 50]
})

# Step 1: Calculate total sales
result = add.transform(
    '@calc',
    df,
    columns=['date', 'region', 'product', 'quantity', 'price'],
    expression='quantity * price',
    as_='total_sales'
)

print("After calculation:")
print(result)

# Step 2: Filter out low-value transactions
result = add.transform(
    '@filter',
    result,
    columns=['date', 'region', 'product', 'total_sales'],
    where='total_sales > 300'
)

print("\nAfter filtering:")
print(result)

# Step 3: Aggregate by region
result = add.transform(
    '@aggregate',
    result,
    columns=['region', 'total_sales'],
    by='region',
    strategy={'total_sales': 'sum'}
)

print("\nAfter aggregation:")
print(result)

# Step 4: Sort by total sales descending
result = add.transform(
    '@sort',
    result,
    columns=['region', 'total_sales'],
    by='total_sales',
    order='desc'
)

print("\nFinal result:")
print(result)

Output:

After calculation:
         date region product  quantity  price  total_sales
0  2024-01-01  North  Widget        10     30          300
1  2024-01-01  South  Widget        15     30          450
2  2024-01-02  North  Gadget         8     50          400
3  2024-01-02  South  Widget        12     30          360
4  2024-01-03  North  Gadget        20     50         1000

After filtering:
         date region product  total_sales
0  2024-01-01  South  Widget          450
1  2024-01-02  North  Gadget          400
2  2024-01-02  South  Widget          360
3  2024-01-03  North  Gadget         1000

After aggregation:
  region  total_sales
0  North         1400
1  South          810

Final result:
  region  total_sales
0  North         1400
1  South          810

Explanation: - Step 1: Calculate total_sales = quantity × price for each transaction - Step 2: Keep only transactions over $300 (filters out first row) - Step 3: Sum total_sales by region (North: 400+1000=1400, South: 450+360=810) - Step 4: Sort regions by total_sales descending (North first) - Each step builds on the previous result - Pipeline transforms raw data into actionable insights

Note: This also works with polars DataFrames.


12.3 Example 2: Product Performance Analysis

Business Context: You have product sales and returns data. You need to aggregate by product, filter out products with high returns, and rank by sales performance.

Code:

import additory as add
import pandas as pd

# Product transactions
df = pd.DataFrame({
    'product': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
    'sales': [100, 150, 200, 80, 90, 85, 300, 350],
    'returns': [5, 10, 8, 15, 20, 18, 2, 3]
})

# Step 1: Aggregate by product
result = add.transform(
    '@aggregate',
    df,
    columns=['product', 'sales', 'returns'],
    by='product',
    strategy={'sales': 'sum', 'returns': 'sum'}
)

print("After aggregation:")
print(result)

# Step 2: Filter products with low returns
result = add.transform(
    '@filter',
    result,
    columns=['product', 'sales', 'returns'],
    where='returns < 50'
)

print("\nAfter filtering:")
print(result)

# Step 3: Sort by sales descending
result = add.transform(
    '@sort',
    result,
    columns=['product', 'sales', 'returns'],
    by='sales',
    order='desc'
)

print("\nFinal result:")
print(result)

Output:

After aggregation:
  product  sales  returns
0       A    450       23
1       B    255       53
2       C    650        5

After filtering:
  product  sales  returns
0       A    450       23
1       C    650        5

Final result:
  product  sales  returns
0       C    650        5
1       A    450       23

Explanation: - Step 1: Sum sales and returns for each product - Product A: 450 sales, 23 returns - Product B: 255 sales, 53 returns - Product C: 650 sales, 5 returns - Step 2: Filter out products with 50+ returns (removes Product B) - Step 3: Sort remaining products by sales (C first, then A) - Result shows top-performing products with acceptable return rates

Note: This also works with polars DataFrames.


12.4 Workflow Design Patterns

12.4.1 Pattern 1: Calculate → Filter → Aggregate → Sort

# Common analytics pipeline
df = add.transform('@calc', df, ...)      # Derive metrics
df = add.transform('@filter', df, ...)    # Remove noise
df = add.transform('@aggregate', df, ...) # Summarize
df = add.transform('@sort', df, ...)      # Rank results

Use when: Building reports, dashboards, or analytics

12.4.2 Pattern 2: Filter → Calculate → Aggregate

# Focus on relevant data first
df = add.transform('@filter', df, ...)    # Select subset
df = add.transform('@calc', df, ...)      # Compute on subset
df = add.transform('@aggregate', df, ...) # Summarize

Use when: Working with large datasets, want to reduce data early

12.4.3 Pattern 3: Aggregate → Filter → Sort

# Summarize then refine
df = add.transform('@aggregate', df, ...) # Group and summarize
df = add.transform('@filter', df, ...)    # Keep top performers
df = add.transform('@sort', df, ...)      # Rank

Use when: Finding top N items, identifying outliers

12.4.4 Pattern 4: Calculate → Calculate → Aggregate

# Multi-step calculations
df = add.transform('@calc', df, expression='a * b', as_='ab')
df = add.transform('@calc', df, expression='ab + c', as_='result')
df = add.transform('@aggregate', df, by='group', strategy={'result': 'sum'})

Use when: Complex derived metrics, multi-step formulas


12.5 Best Practices

  1. Start with a plan: Sketch out your pipeline before coding

    # Good: Clear sequence
    # 1. Calculate revenue
    # 2. Filter valid transactions
    # 3. Aggregate by customer
    # 4. Sort by total revenue
  2. Filter early: Reduce data size as soon as possible

    # Good: Filter first
    df = add.transform('@filter', df, where='valid == 1')
    df = add.transform('@calc', df, ...)  # Work with less data
    
    # Less efficient: Filter last
    df = add.transform('@calc', df, ...)  # Process all data
    df = add.transform('@filter', df, where='valid == 1')
  3. Use intermediate variables: Make pipelines readable

    # Good: Clear steps
    with_sales = add.transform('@calc', df, ...)
    filtered = add.transform('@filter', with_sales, ...)
    aggregated = add.transform('@aggregate', filtered, ...)
    final = add.transform('@sort', aggregated, ...)
    
    # Less clear: Chained without names
    result = add.transform('@sort', 
                 add.transform('@aggregate',
                     add.transform('@filter',
                         add.transform('@calc', df, ...), ...), ...), ...)
  4. Verify each step: Check intermediate results

    # Good: Verify as you go
    result = add.transform('@calc', df, ...)
    print(f"After calc: {len(result)} rows")
    
    result = add.transform('@filter', result, ...)
    print(f"After filter: {len(result)} rows")
  5. Keep operations simple: One transformation per step

    # Good: Separate concerns
    df = add.transform('@calc', df, expression='a * b', as_='ab')
    df = add.transform('@calc', df, expression='ab + c', as_='result')
    
    # Avoid: Complex expressions
    df = add.transform('@calc', df, expression='(a * b) + c', as_='result')
  6. Document your pipeline: Add comments explaining business logic

    # Calculate customer lifetime value
    df = add.transform('@calc', df, expression='orders * avg_value', as_='ltv')
    
    # Focus on high-value customers (LTV > $1000)
    df = add.transform('@filter', df, where='ltv > 1000')
    
    # Rank by lifetime value
    df = add.transform('@sort', df, by='ltv', order='desc')

12.6 Common Pitfalls

12.6.1 Pitfall 1: Forgetting to include grouping columns

# Wrong: Missing 'region' in columns
result = add.transform('@aggregate', df,
                       columns=['sales'],  # Missing 'region'!
                       by='region',
                       strategy={'sales': 'sum'})

# Correct: Include grouping column
result = add.transform('@aggregate', df,
                       columns=['region', 'sales'],
                       by='region',
                       strategy={'sales': 'sum'})

12.6.2 Pitfall 2: Filtering on wrong data type

# Wrong: Filtering on float columns (not yet supported)
df = add.transform('@filter', df, where='price > 29.99')  # May fail

# Correct: Use integer columns for filtering
df = add.transform('@filter', df, where='quantity > 10')  # Works

12.6.3 Pitfall 3: Losing track of column names

# Wrong: Using old column name after calculation
df = add.transform('@calc', df, expression='a + b', as_='total')
df = add.transform('@filter', df, where='a > 10')  # 'a' might not exist

# Correct: Use new column names
df = add.transform('@calc', df, expression='a + b', as_='total')
df = add.transform('@filter', df, where='total > 10')

12.7 Key Takeaways

  • Chain operations by passing results from one step to the next
  • Design pipelines with clear, logical steps
  • Filter early to improve performance
  • Use intermediate variables for readability
  • Verify results at each step
  • Keep individual operations simple
  • Document business logic with comments
  • Common pattern: Calculate → Filter → Aggregate → Sort

12.8 Common Questions

Q: How many operations can I chain?
A: As many as needed. Each operation returns a DataFrame that can be input to the next.

Q: Does chaining operations affect performance?
A: Each operation creates a new DataFrame. For very large datasets, consider filtering early to reduce data size.

Q: Can I reuse intermediate results?
A: Yes! Store intermediate results in variables and use them in multiple pipelines.

Q: What if an operation fails mid-pipeline?
A: The pipeline stops at the failed operation. Check the error message and fix that step.

Q: Can I mix pandas and polars in a pipeline?
A: No. All operations in a pipeline must use the same DataFrame type (all pandas or all polars).

Q: Should I use method chaining or intermediate variables?
A: Intermediate variables are more readable and easier to debug. Use them for complex pipelines.


12.9 Next Steps


Version: 0.1.3
Last Updated: March 9, 2026