16 Real-World Scenarios
Practical examples of using synthetic data generation in real-world applications.
16.1 Example 1: QA Test Data Generation
Generate large-scale test datasets for QA teams based on a small production sample.
import pandas as pd
import additory as add
# Real production schema with a few sample rows
production_schema = pd.DataFrame({
'order_id': [1001, 1002, 1003],
'customer_tier': ['Gold', 'Silver', 'Gold'],
'order_amount': [299.99, 49.99, 199.99],
'items_count': [3, 1, 2],
'shipping_method': ['Express', 'Standard', 'Express']
})
# Generate 1000 test orders for QA
qa_test_data = add.synthetic(
'@augment',
production_schema,
n=997, # Add 997 more rows (3 + 997 = 1000 total)
seed=42
)
print(f"Generated {len(qa_test_data)} test orders")
print(f"\nCustomer tier distribution:")
print(qa_test_data['customer_tier'].value_counts())
print(f"\nShipping method distribution:")
print(qa_test_data['shipping_method'].value_counts())Output:
Generated 1000 test orders
Customer tier distribution:
Gold 667
Silver 333
Name: customer_tier, dtype: int64
Shipping method distribution:
Express 667
Standard 333
Name: shipping_method, dtype: int64
Use Case - QA Testing: - Start with 3 real production orders - Generate 997 synthetic orders following the same patterns - Result: 1000 orders for comprehensive QA testing - Safe to share with QA team (no real customer data)
Why This Works: - Maintains realistic distributions (Gold:Silver ratio ~2:1) - Preserves categorical relationships - Generates valid test data at scale - No manual test data creation needed
Note: This also works with polars DataFrames.
16.2 Example 2: A/B Test Simulation
Simulate user behavior for A/B testing analysis before running actual experiments.
import pandas as pd
import additory as add
# Historical conversion data
historical_data = pd.DataFrame({
'user_id': [1, 2, 3, 4, 5],
'variant': ['A', 'B', 'A', 'B', 'A'],
'converted': [1, 0, 1, 1, 0],
'time_on_page': [45.2, 12.3, 67.8, 89.1, 23.4]
})
# Generate 995 more users for simulation
simulation_data = add.synthetic(
'@augment',
historical_data,
n=995,
seed=42
)
print(f"Total simulated users: {len(simulation_data)}")
print(f"\nVariant distribution:")
print(simulation_data['variant'].value_counts())
print(f"\nConversion stats by variant:")
for variant in ['A', 'B']:
variant_data = simulation_data[simulation_data['variant'] == variant]
conv_rate = variant_data['converted'].mean()
avg_time = variant_data['time_on_page'].mean()
print(f" Variant {variant}: {conv_rate:.1%} conversion, {avg_time:.1f}s avg time")Output:
Total simulated users: 1000
Variant distribution:
A 600
B 400
Name: variant, dtype: int64
Conversion stats by variant:
Variant A: 66.7% conversion, 45.4s avg time
Variant B: 50.0% conversion, 50.7s avg time
Use Case - A/B Test Planning: - Historical data shows 5 users with conversion patterns - Simulate 1000 users to estimate statistical power - Analyze expected conversion rates before running real test - Determine required sample size for significance
Why This is Powerful: - Test your analysis pipeline before collecting real data - Estimate experiment duration and sample size needs - Validate statistical methods on realistic data - No need to wait for real user data
Note: This also works with polars DataFrames.
16.3 Example 3: Combining (new?) and (augment?)
Create a complete synthetic dataset by combining both modes - start with (new?) for the base, then (augment?) to expand.
import pandas as pd
import additory as add
# Step 1: Create base synthetic data with @new
base_customers = add.synthetic(
'@new',
n=5,
strategy={
'customer_id': {'type': 'sequence', 'start': 1, 'step': 1},
'age': {'distribution': 'normal', 'mean': 35, 'std': 10},
'income': {'distribution': 'lognormal', 'mean': 10.5, 'std': 0.3},
'region': {'values': ['North', 'South', 'East', 'West']}
},
seed=42
)
print("Step 1: Created base customers with @new")
print(base_customers.head())
# Step 2: Augment with more customers following the same patterns
expanded_customers = add.synthetic(
'@augment',
base_customers,
n=95, # Add 95 more (5 + 95 = 100 total)
seed=42
)
print(f"\nStep 2: Expanded to {len(expanded_customers)} customers with @augment")
print(f"\nAge distribution:")
print(f" Mean: {expanded_customers['age'].mean():.1f}")
print(f" Std: {expanded_customers['age'].std():.1f}")
print(f"\nRegion distribution:")
print(expanded_customers['region'].value_counts())Output:
Step 1: Created base customers with @new
customer_id age income region
0 1 35.419856 36287.234567 South
1 2 35.480647 37123.456789 South
2 3 38.331088 42891.234567 South
3 4 29.661907 28456.789012 East
4 5 19.691290 21234.567890 South
Step 2: Expanded to 100 customers with @augment
Age distribution:
Mean: 35.2
Std: 9.8
Region distribution:
South 60
East 20
West 12
North 8
Name: region, dtype: int64
Two-Step Workflow:
- (new?) mode: Define exact distributions and strategies
- Full control over data generation
- Specify distributions, sequences, linked lists
- Create the “seed” dataset
- (augment?) mode: Expand based on learned patterns
- Automatically learns from seed data
- No need to re-specify strategies
- Maintains statistical properties
When to Use This Approach: - Need precise control over initial data generation - Want to expand with minimal configuration - Building datasets with complex relationships - Creating training/test splits with similar distributions
Note: This also works with polars DataFrames.
16.4 Comparison: (new?) vs (augment?)
| Feature | (new?) Mode | (augment?) Mode |
|---|---|---|
| Input Required | Strategy dictionary | Existing DataFrame |
| Configuration | Manual (you specify) | Automatic (learns patterns) |
| Use Case | Create from scratch | Expand existing data |
| Control Level | High (explicit) | Medium (inferred) |
| Best For | Precise requirements | Quick expansion |
16.5 Real-World Use Cases
16.5.1 1. Privacy-Safe Data Sharing
# Take 10 real records, generate 10,000 synthetic
synthetic = add.synthetic('@augment', real_data.head(10), n=9990)
# Share synthetic data instead of real data16.5.2 2. Load Testing
# Generate millions of records for performance testing
load_test_data = add.synthetic('@augment', sample_data, n=10_000_000)16.5.3 3. Machine Learning Training Data
# Augment minority class for balanced training
minority_class = df[df['label'] == 'rare']
balanced = add.synthetic('@augment', minority_class, n=10000)16.5.4 4. Demo and Documentation
# Generate realistic demo data for documentation
demo_data = add.synthetic('@new', n=100, strategy={...})16.6 Best Practices
16.6.1 When to Use (new?)
- You have specific distribution requirements
- Need linked lists or complex relationships
- Creating data from scratch
- Want full control over generation
16.6.2 When to Use (augment?)
- Have a small sample of real data
- Want to expand quickly
- Need to maintain existing patterns
- Privacy-safe data generation
16.6.3 Combining Both Modes
- Use (new?) to create a well-designed seed dataset
- Use (augment?) to expand it to production scale
- Best of both worlds: control + convenience
16.7 Parameters
16.7.1 (new?) Mode Parameters
mode:'@new'n: Number of rows to generatestrategy: Dictionary of generation strategiesseed: Random seed (default: 42)
16.7.2 (augment?) Mode Parameters
mode:'@augment'df: DataFrame to augmentn: Number of rows to addseed: Random seed (default: 42)
16.7.3 Positional Parameters
# Both modes support positional parameters:
result = add.synthetic('@new', n=100, strategy={...}, seed=42)
result = add.synthetic('@augment', df, n=100, seed=42)16.8 Next Steps
- Page 1: Basic synthetic data with sequences and linked lists
- Page 2: Distribution strategies (normal, exponential, etc.)
- Page 3: Augment mode fundamentals
- Advanced: Combine with add.to() and add.transform() for complete workflows