10 add.transform() - Aggregation
10.1 Overview
Learn how to group and aggregate data using add.transform(). Aggregation is essential for summarizing data, calculating statistics, and creating reports.
What you’ll learn: - How to group data by one or more columns - How to apply aggregation strategies (sum, mean, count, etc.) - How to aggregate multiple columns with different strategies - Best practices for data summarization
Prerequisites: - Calculations - Understanding of basic transform operations - Filter & Sort - Data selection and ordering - Basic understanding of grouping concepts
10.2 Example 1: Group and Aggregate
Business Context: You have sales data from multiple regions and want to see total sales by region.
Code:
import additory as add
import pandas as pd
# Sales data
df = pd.DataFrame({
'region': ['North', 'North', 'South', 'South', 'East'],
'product': ['Widget', 'Gadget', 'Widget', 'Gadget', 'Widget'],
'sales': [100, 150, 200, 250, 175]
})
# Group by region and sum sales
result = add.transform(
'@aggregate',
df,
columns=['region', 'sales'],
by='region',
strategy={'sales': 'sum'}
)
# Positional parameters (also works without naming certain parameters):
# result = add.transform('@aggregate', df, ['region', 'sales'], by='region', strategy={'sales': 'sum'})
print(result)Output:
region sales
0 East 175
1 North 250
2 South 450
Explanation: - '@aggregate' mode groups and summarizes data - columns specifies which columns to include (grouping column + aggregated columns) - by='region' groups rows by the region column - strategy={'sales': 'sum'} sums sales values within each group - North: 100 + 150 = 250 - South: 200 + 250 = 450 - East: 175 (only one row)
Note: This also works with polars DataFrames.
10.3 Example 2: Multiple Aggregations
Business Context: You want to see average price and total quantity for each product category.
Code:
import additory as add
import pandas as pd
# Product data
df = pd.DataFrame({
'category': ['A', 'A', 'B', 'B', 'C'],
'price': [100, 150, 200, 250, 175],
'quantity': [10, 15, 20, 25, 30]
})
# Group by category with multiple aggregations
result = add.transform(
'@aggregate',
df,
columns=['category', 'price', 'quantity'],
by='category',
strategy={'price': 'mean', 'quantity': 'sum'}
)
# Positional parameters (also works without naming certain parameters):
# result = add.transform('@aggregate', df, ['category', 'price', 'quantity'],
# by='category', strategy={'price': 'mean', 'quantity': 'sum'})
print(result)Output:
category price quantity
0 A 125.0 25
1 B 225.0 45
2 C 175.0 30
Explanation: - Different aggregation strategies for different columns - 'price': 'mean' calculates average price per category - Category A: (100 + 150) / 2 = 125 - Category B: (200 + 250) / 2 = 225 - Category C: 175 (only one row) - 'quantity': 'sum' sums quantities per category - Category A: 10 + 15 = 25 - Category B: 20 + 25 = 45 - Category C: 30 (only one row) - Each column can have its own aggregation strategy
Note: This also works with polars DataFrames.
10.4 Example 3: Group by Multiple Columns
Business Context: You want to see total sales for each combination of region and product.
Code:
import additory as add
import pandas as pd
# Sales data
df = pd.DataFrame({
'region': ['North', 'North', 'South', 'South', 'North', 'South'],
'product': ['A', 'B', 'A', 'B', 'A', 'A'],
'sales': [100, 150, 200, 250, 120, 180]
})
# Group by region and product
result = add.transform(
'@aggregate',
df,
columns=['region', 'product', 'sales'],
by=['region', 'product'],
strategy={'sales': 'sum'}
)
# Positional parameters (also works without naming certain parameters):
# result = add.transform('@aggregate', df, ['region', 'product', 'sales'],
# by=['region', 'product'], strategy={'sales': 'sum'})
print(result)Output:
region product sales
0 North A 220
1 North B 150
2 South A 380
3 South B 250
Explanation: - by=['region', 'product'] groups by multiple columns - Creates a group for each unique combination - North + A: 100 + 120 = 220 (2 rows combined) - North + B: 150 (1 row) - South + A: 200 + 180 = 380 (2 rows combined) - South + B: 250 (1 row) - Result has 4 rows (4 unique combinations)
Note: This also works with polars DataFrames.
10.5 Aggregation Strategies
10.5.1 Available Strategies
| Strategy | Description | Example Use Case |
|---|---|---|
'sum' |
Add all values | Total sales, total quantity |
'mean' |
Average of values | Average price, average rating |
'count' |
Count of values | Number of transactions |
'min' |
Minimum value | Lowest price, earliest date |
'max' |
Maximum value | Highest price, latest date |
'first' |
First value in group | Initial status |
'last' |
Last value in group | Final status |
10.5.2 Strategy Examples
# Sum - Total sales per region
strategy={'sales': 'sum'}
# Mean - Average price per category
strategy={'price': 'mean'}
# Count - Number of orders per customer
strategy={'order_id': 'count'}
# Min/Max - Price range per product
strategy={'price_min': 'min', 'price_max': 'max'}
# First/Last - Status tracking
strategy={'initial_status': 'first', 'final_status': 'last'}10.6 Common Patterns
10.6.1 Pattern 1: Sales Summary by Region
result = add.transform('@aggregate', df,
columns=['region', 'sales'],
by='region',
strategy={'sales': 'sum'})10.6.2 Pattern 2: Customer Statistics
result = add.transform('@aggregate', df,
columns=['customer_id', 'order_count', 'total_spent'],
by='customer_id',
strategy={'order_count': 'count', 'total_spent': 'sum'})10.6.3 Pattern 3: Product Performance
result = add.transform('@aggregate', df,
columns=['product', 'avg_rating', 'review_count'],
by='product',
strategy={'avg_rating': 'mean', 'review_count': 'count'})10.6.4 Pattern 4: Time-Based Aggregation
result = add.transform('@aggregate', df,
columns=['date', 'region', 'daily_sales'],
by=['date', 'region'],
strategy={'daily_sales': 'sum'})10.7 Best Practices
Include grouping columns in columns parameter: Always list the columns you’re grouping by
# Good columns=['region', 'sales'] # Include 'region' (the grouping column) by='region' # Will fail columns=['sales'] # Missing 'region' by='region'Choose appropriate aggregation strategies: Match the strategy to your data type
# Good strategy={'sales': 'sum'} # Numeric data strategy={'rating': 'mean'} # Numeric data strategy={'order_id': 'count'} # Any data type # Less useful strategy={'customer_name': 'sum'} # Text data - doesn't make senseUse multiple strategies for comprehensive analysis:
# Good: Multiple metrics strategy={ 'sales': 'sum', # Total sales 'price': 'mean', # Average price 'orders': 'count' # Number of orders }Group by meaningful dimensions: Choose grouping columns that make business sense
# Good: Business dimensions by='region' # Geographic analysis by='product_category' # Product analysis by=['date', 'store'] # Time and location analysisVerify your results: Check that aggregated values make sense
# After aggregation, verify totals print(f"Total sales: {result['sales'].sum()}") print(f"Number of groups: {len(result)}")
10.8 Key Takeaways
- Use
@aggregatemode to group and summarize data byparameter specifies grouping column(s)strategyparameter defines how to aggregate each column- Multiple columns can have different aggregation strategies
- Group by multiple columns for multi-dimensional analysis
- Always include grouping columns in the
columnsparameter - Choose aggregation strategies that match your data type
10.9 Common Questions
Q: Can I aggregate without grouping?
A: Yes, omit the by parameter to aggregate the entire DataFrame into a single row.
Q: What happens if a group has only one row?
A: The aggregation still works - sum/mean/etc. of one value is that value.
Q: Can I use different strategies for the same column?
A: Not in a single operation. Run multiple aggregations and combine results.
Q: How do I count distinct values?
A: Use 'count' strategy, but note it counts all values, not distinct ones. For distinct counts, use pandas/polars native methods.
Q: Can I aggregate text columns?
A: Yes, but only 'first', 'last', and 'count' strategies make sense for text.
Q: What if my grouping column has null values?
A: Null values are treated as a separate group.
10.10 Next Steps
- Advanced Modes - Learn about (harmonize?), (onehot?), (extract?), (deduce?)
- Calculations - Review calculation operations
- API Reference - Complete
add.transform()documentation
Version: 0.1.3
Last Updated: March 9, 2026