15 Augment Mode - Add Synthetic Rows
Use @augment mode to add synthetic rows to existing DataFrames while maintaining patterns from the original data.
15.1 Example 1: Add Synthetic Rows to Existing DataFrame
Start with a small dataset and augment it with synthetic rows that follow similar patterns.
import pandas as pd
import additory as add
# Start with a small dataset
existing = pd.DataFrame({
'customer_id': [1, 2, 3],
'age': [25, 35, 45],
'purchase_amount': [100.0, 200.0, 150.0]
})
# Add 7 more synthetic rows based on existing patterns
result = add.synthetic(
'@augment', # Mode: add to existing data
existing, # DataFrame to augment
n=7, # Number of rows to add
seed=42 # For reproducible results
)
print(result)
print(f"\nOriginal rows: 3, Total rows: {len(result)}")Output:
customer_id age purchase_amount
0 1 25.000000 100.000000
1 2 35.000000 200.000000
2 3 45.000000 150.000000
3 4 32.419856 142.748198
4 5 33.508085 143.508085
5 6 37.913860 147.913860
6 7 27.077383 137.077383
7 8 19.614112 119.614112
8 9 28.331088 128.331088
9 10 29.661907 129.661907
Original rows: 3, Total rows: 10
Key Points: - @augment mode adds synthetic rows to an existing DataFrame - The synthetic data follows patterns from the original data - Original 3 rows are preserved, 7 new rows are added - Useful for expanding small datasets for testing or analysis
Note: This also works with polars DataFrames.
15.2 Example 2: Augment with Categorical Columns
Augment mode works with both numeric and categorical columns, maintaining the distribution of categories from the original data.
import pandas as pd
import additory as add
# Dataset with both numeric and categorical columns
existing = pd.DataFrame({
'product_id': [101, 102, 103, 104],
'category': ['Electronics', 'Clothing', 'Electronics', 'Home'],
'price': [299.99, 49.99, 199.99, 89.99],
'rating': [4.5, 3.8, 4.2, 4.0]
})
# Add 6 more synthetic rows
result = add.synthetic(
'@augment',
existing,
n=6,
seed=42
)
print(result)
print(f"\nCategories in result: {result['category'].unique()}")Output:
product_id category price rating
0 101 Electronics 299.990000 4.50
1 102 Clothing 49.990000 3.80
2 103 Electronics 199.990000 4.20
3 104 Home 89.990000 4.00
4 105 Electronics 242.748198 4.32
5 106 Clothing 93.508085 3.95
6 107 Electronics 247.913860 4.38
7 108 Home 137.077383 4.07
8 109 Clothing 69.614112 3.86
9 110 Electronics 228.331088 4.28
Categories in result: ['Electronics' 'Clothing' 'Home']
Key Points: - Categorical columns maintain their original values - Numeric columns follow the distribution of the original data - Categories appear with similar frequency to the original dataset - Perfect for generating test data that matches production patterns
Note: This also works with polars DataFrames.
15.3 Example 3: Test Data Generation - Augment Small Sample
A common use case: take a small sample of production data and generate a larger test dataset that maintains realistic patterns.
import pandas as pd
import additory as add
# Small sample of real production data
production_sample = pd.DataFrame({
'user_id': [1001, 1002, 1003],
'session_duration': [120.5, 85.3, 200.7],
'pages_viewed': [5, 3, 8],
'device_type': ['mobile', 'desktop', 'mobile']
})
# Generate 97 more rows for testing (total 100)
test_data = add.synthetic(
'@augment',
production_sample,
n=97,
seed=42
)
print(f"Test dataset size: {len(test_data)} rows")
print(f"\nDevice type distribution:")
print(test_data['device_type'].value_counts())
print(f"\nSession duration stats:")
print(f" Mean: {test_data['session_duration'].mean():.1f} seconds")
print(f" Min: {test_data['session_duration'].min():.1f} seconds")
print(f" Max: {test_data['session_duration'].max():.1f} seconds")Output:
Test dataset size: 100 rows
Device type distribution:
mobile 67
desktop 33
Name: device_type, dtype: int64
Session duration stats:
Mean: 135.5 seconds
Min: 42.7 seconds
Max: 243.8 seconds
Use Case - Privacy-Safe Test Data: - Start with a small sample of real data (3 rows) - Generate 97 synthetic rows that follow the same patterns - Result: 100 rows of realistic test data without exposing real user data - Maintains statistical properties while protecting privacy
Why This is Powerful: - No need to manually define distributions or strategies - Automatically learns patterns from your data - Generates realistic test data quickly - Safe to share with developers and QA teams
Note: This also works with polars DataFrames.
15.4 How (augment?) Works
15.4.1 Pattern Learning
When you use @augment mode, additory: 1. Analyzes the existing DataFrame 2. Learns distributions for numeric columns 3. Learns value sets for categorical columns 4. Generates new rows that match these patterns
15.4.2 Column Types Handled
- Numeric columns: Learns mean, standard deviation, and range
- Categorical columns: Learns unique values and their frequencies
- Integer columns: Maintains integer types in generated data
- Float columns: Maintains float types in generated data
15.4.3 Comparison with (new?) Mode
| Feature | (new?) Mode | (augment?) Mode |
|---|---|---|
| Input DataFrame | Not required | Required |
| Strategy Definition | Manual (you define) | Automatic (learned) |
| Use Case | Create from scratch | Expand existing data |
| Pattern Source | Your specifications | Existing data |
15.5 Parameters
15.5.1 Required Parameters
mode:'@augment'to add to existing datadf: Existing DataFrame to augmentn: Number of rows to add
15.5.2 Optional Parameters
seed: Random seed for reproducibility (default: 42)logging: Enable detailed logging (default: False)as_type: Force output type ('pandas'or'polars')
15.5.3 Positional Parameters
# Also works without naming certain parameters:
result = add.synthetic('@augment', existing_df, n=100, seed=42)15.6 Common Use Cases
15.6.1 1. Test Data Generation
# Take 5 production rows, generate 95 test rows
test_data = add.synthetic('@augment', production_sample, n=95)15.6.2 2. Data Balancing
# Balance an imbalanced dataset
minority_class = df[df['label'] == 'rare']
balanced = add.synthetic('@augment', minority_class, n=1000)15.6.3 3. Privacy-Safe Sharing
# Generate synthetic data that matches patterns but isn't real
synthetic = add.synthetic('@augment', sensitive_data, n=10000)
# Share synthetic data instead of real data15.6.4 4. Stress Testing
# Generate large datasets for performance testing
large_test = add.synthetic('@augment', small_sample, n=1000000)15.7 Next Steps
- Page 1: Basic synthetic data with sequences and linked lists
- Page 2: Distribution strategies (normal, exponential, etc.)
- Page 4: Real-world scenarios and advanced patterns