13 Basic Synthetic Data Generation
Generate synthetic data from scratch using the @new mode.
13.1 Example 1: Create Synthetic DataFrame
Create a synthetic DataFrame with multiple column types - sequences, distributions, and categorical values.
import pandas as pd
import additory as add
result = add.synthetic(
'@new', # Mode: create from scratch
n=5, # Number of rows to generate
strategy={
'id': {'type': 'sequence', 'start': 1, 'step': 1},
'age': {'distribution': 'uniform', 'min': 18, 'max': 65},
'status': {'values': ['active', 'inactive', 'pending']}
},
seed=42 # For reproducible results
)
print(result)Output:
id age status
0 1 42.748198 inactive
1 2 43.508085 inactive
2 3 47.913860 inactive
3 4 37.077383 active
4 5 19.614112 inactive
Key Points: - @new mode creates a DataFrame from scratch (no input DataFrame needed) - n parameter specifies the number of rows - seed parameter ensures reproducible results - Mix different strategies for different columns
Note: This also works with polars DataFrames.
13.2 Example 2: Linked Lists - Symptoms and Medications
Linked lists ensure realistic combinations by maintaining valid relationships between values. Perfect for symptoms→medications, products→categories, or any hierarchical data.
import pandas as pd
import additory as add
result = add.synthetic(
'@new',
n=6,
strategy={
'patient_id': {'type': 'sequence', 'start': 1, 'step': 1},
'symptom': {
'type': 'linked_list',
'levels': [
['Headache', 'Headache', 'Nausea', 'Nausea', 'Fever', 'Fever'],
['Ibuprofen', 'Acetaminophen', 'Ondansetron', 'Ginger', 'Aspirin', 'Paracetamol']
]
}
},
seed=42
)
print(result)Output:
patient_id symptom
0 1 Nausea Ginger
1 2 Nausea Ginger
2 3 Nausea Ondansetron
3 4 Headache Ibuprofen
4 5 Nausea Ondansetron
5 6 Fever Aspirin
How Linked Lists Work: - Each level is a list of values with the same length - Values at the same index across levels form valid combinations - Index 0: Headache → Ibuprofen - Index 1: Headache → Acetaminophen - Index 2: Nausea → Ondansetron - Index 3: Nausea → Ginger - Index 4: Fever → Aspirin - Index 5: Fever → Paracetamol
Why This is Powerful: - Ensures only valid combinations (no “Headache → Aspirin” if not in your list) - Simple to define - just list valid combinations at matching indices - Generates realistic test data automatically
Note: This also works with polars DataFrames.
13.3 Example 3: Linked Lists - Geographic Hierarchy
Linked lists work with any number of levels. Here’s a 3-level example showing country→state→city relationships, including cities with the same name in different locations (Salem appears in both USA and India).
import pandas as pd
import additory as add
result = add.synthetic(
'@new',
n=6,
strategy={
'location_id': {'type': 'sequence', 'start': 1, 'step': 1},
'location': {
'type': 'linked_list',
'levels': [
['USA', 'USA', 'USA', 'India', 'India', 'India'],
['Oregon', 'Massachusetts', 'Virginia', 'TamilNadu', 'Karnataka', 'UttarPradesh'],
['Salem', 'Salem', 'Salem', 'Salem', 'Bangalore', 'Lucknow']
]
}
},
seed=42
)
print(result)Output:
location_id location
0 1 India TamilNadu Salem
1 2 India TamilNadu Salem
2 3 USA Virginia Salem
3 4 USA Oregon Salem
4 5 USA Virginia Salem
5 6 India Karnataka Bangalore
Valid Combinations: - USA → Oregon → Salem - USA → Massachusetts → Salem - USA → Virginia → Salem - India → TamilNadu → Salem - India → Karnataka → Bangalore - India → UttarPradesh → Lucknow
Key Insight: Salem appears 4 times in the levels (indices 0-3), but each Salem is correctly paired with its country and state. This demonstrates how linked lists handle ambiguous values by maintaining the full context.
Note: This also works with polars DataFrames.
13.4 Strategy Types
13.4.1 Sequence
{'type': 'sequence', 'start': 1, 'step': 1}Generates sequential integers: 1, 2, 3, 4, …
13.4.2 Uniform Distribution
{'distribution': 'uniform', 'min': 18, 'max': 65}Generates random numbers uniformly distributed between min and max.
13.4.3 Categorical
{'values': ['active', 'inactive', 'pending']}Randomly selects from the provided list of values.
13.4.4 Linked List
{
'type': 'linked_list',
'levels': [
['Level1_Value1', 'Level1_Value2', ...],
['Level2_Value1', 'Level2_Value2', ...],
...
]
}Maintains valid relationships between hierarchical values. All levels must have the same length.
13.5 Parameters
13.5.1 Required Parameters
mode:'@new'to create from scratchn: Number of rows to generatestrategy: Dictionary mapping column names to generation strategies
13.5.2 Optional Parameters
seed: Random seed for reproducibility (default: 42)logging: Enable detailed logging (default: False)as_type: Force output type ('pandas'or'polars')
13.5.3 Positional Parameters
# Also works without naming certain parameters:
result = add.synthetic('@new', n=100, strategy={...}, seed=42)13.6 Next Steps
- Page 2: More distribution strategies (normal, lognormal, exponential, etc.)
- Page 3: Augment mode - add synthetic rows to existing DataFrames
- Page 4: Real-world scenarios and advanced patterns