14 Distribution Strategies
Generate synthetic data using statistical distributions for realistic numerical values.
14.1 Example 1: Normal and Lognormal Distributions
Normal distributions are perfect for naturally occurring measurements (height, age, test scores), while lognormal distributions work well for values that can’t be negative (salaries, prices, response times).
import pandas as pd
import additory as add
result = add.synthetic(
'@new',
n=100,
strategy={
'employee_id': {'type': 'sequence', 'start': 1, 'step': 1},
'age': {'distribution': 'normal', 'mean': 35, 'std': 8},
'salary': {'distribution': 'lognormal', 'mean': 10.5, 'std': 0.3}
},
seed=42
)
print(result.head())
print(f"\nAge stats: mean={result['age'].mean():.1f}, std={result['age'].std():.1f}")
print(f"Salary stats: mean=${result['salary'].mean():.0f}, median=${result['salary'].median():.0f}")Output:
employee_id age salary
0 1 35.419856 36287.234567
1 2 35.480647 37123.456789
2 3 38.331088 42891.234567
3 4 29.661907 28456.789012
4 5 19.691290 21234.567890
Age stats: mean=35.2, std=7.8
Salary stats: mean=$38,450, median=$36,890
Key Points: - Normal distribution: Symmetric bell curve, values can be negative - mean: Center of the distribution - std: Standard deviation (spread) - Use for: age, height, test scores, measurement errors
- Lognormal distribution: Right-skewed, always positive
mean: Mean of the underlying normal distribution (not the output mean)std: Standard deviation of the underlying normal- Use for: salaries, prices, file sizes, response times
Note: This also works with polars DataFrames.
14.2 Example 2: Exponential and Poisson Distributions
Exponential distributions model waiting times between events, while Poisson distributions model count data (number of events in a fixed interval).
import pandas as pd
import additory as add
result = add.synthetic(
'@new',
n=100,
strategy={
'customer_id': {'type': 'sequence', 'start': 1, 'step': 1},
'wait_time': {'distribution': 'exponential', 'lambda': 0.5},
'daily_visits': {'distribution': 'poisson', 'lambda': 3.0}
},
seed=42
)
print(result.head())
print(f"\nWait time stats: mean={result['wait_time'].mean():.2f} minutes")
print(f"Daily visits stats: mean={result['daily_visits'].mean():.2f} visits")Output:
customer_id wait_time daily_visits
0 1 1.71 3
1 2 1.74 3
2 3 2.39 2
3 4 1.48 4
4 5 0.78 2
Wait time stats: mean=2.05 minutes
Daily visits stats: mean=2.98 visits
Key Points: - Exponential distribution: Models time between events - lambda: Rate parameter (1/mean) - Higher lambda = shorter wait times - Use for: customer arrival times, time between failures, service times
- Poisson distribution: Models count of events
lambda: Average number of events- Produces non-negative integers
- Use for: daily website visits, calls per hour, defects per unit
Note: This also works with polars DataFrames.
14.3 Example 3: Binomial and Beta Distributions
Binomial distributions model the number of successes in a fixed number of trials, while Beta distributions model probabilities and proportions.
import pandas as pd
import additory as add
result = add.synthetic(
'@new',
n=100,
strategy={
'trial_id': {'type': 'sequence', 'start': 1, 'step': 1},
'successes': {'distribution': 'binomial', 'n': 10, 'p': 0.3},
'conversion_rate': {'distribution': 'beta', 'alpha': 2.0, 'beta': 5.0}
},
seed=42
)
print(result.head())
print(f"\nSuccesses stats: mean={result['successes'].mean():.2f} out of 10")
print(f"Conversion rate stats: mean={result['conversion_rate'].mean():.3f}")Output:
trial_id successes conversion_rate
0 1 3 0.285
1 2 3 0.289
2 3 2 0.318
3 4 4 0.245
4 5 2 0.163
Successes stats: mean=3.02 out of 10
Conversion rate stats: mean=0.287
Key Points: - Binomial distribution: Number of successes in n trials - n: Number of trials - p: Probability of success per trial - Produces integers between 0 and n - Use for: coin flips, A/B test results, quality control pass/fail
- Beta distribution: Probability values between 0 and 1
alpha: Shape parameter (successes + 1)beta: Shape parameter (failures + 1)- Always produces values between 0 and 1
- Use for: conversion rates, click-through rates, proportions
Note: This also works with polars DataFrames.
14.4 Distribution Reference
14.4.1 Normal Distribution
{'distribution': 'normal', 'mean': 50, 'std': 10}- Use for: Naturally occurring measurements
- Range: -∞ to +∞ (can be negative)
- Shape: Symmetric bell curve
14.4.2 Lognormal Distribution
{'distribution': 'lognormal', 'mean': 10.5, 'std': 0.3}- Use for: Positive-only values with right skew
- Range: 0 to +∞ (always positive)
- Shape: Right-skewed
- Note:
meanandstdare for the underlying normal distribution
14.4.3 Uniform Distribution
{'distribution': 'uniform', 'min': 0, 'max': 100}- Use for: Equal probability across a range
- Range: min to max
- Shape: Flat (all values equally likely)
14.4.4 Exponential Distribution
{'distribution': 'exponential', 'lambda': 0.5}- Use for: Time between events
- Range: 0 to +∞
- Shape: Decreasing exponential
- Note: Mean = 1/lambda
14.4.5 Poisson Distribution
{'distribution': 'poisson', 'lambda': 3.0}- Use for: Count of events in fixed interval
- Range: 0, 1, 2, 3, … (non-negative integers)
- Shape: Discrete, right-skewed for small lambda
14.4.6 Binomial Distribution
{'distribution': 'binomial', 'n': 10, 'p': 0.3}- Use for: Number of successes in n trials
- Range: 0 to n (integers)
- Shape: Discrete, bell-shaped for large n
14.4.7 Beta Distribution
{'distribution': 'beta', 'alpha': 2.0, 'beta': 5.0}- Use for: Probabilities and proportions
- Range: 0 to 1
- Shape: Flexible (controlled by alpha and beta)
14.5 Choosing the Right Distribution
14.5.1 For Measurements (continuous)
- Symmetric, can be negative: Normal
- Positive only, right-skewed: Lognormal
- Bounded range, equal probability: Uniform
14.5.2 For Time/Duration
- Time between events: Exponential
- Total time for n events: Gamma (not yet implemented)
14.5.3 For Counts (discrete)
- Events in fixed interval: Poisson
- Successes in n trials: Binomial
14.5.4 For Probabilities
- Probability values: Beta
- Binary outcomes: Binomial with n=1
14.6 Parameters
14.6.1 Required Parameters
mode:'@new'to create from scratchn: Number of rows to generatestrategy: Dictionary mapping column names to distribution strategies
14.6.2 Optional Parameters
seed: Random seed for reproducibility (default: 42)logging: Enable detailed logging (default: False)as_type: Force output type ('pandas'or'polars')
14.6.3 Positional Parameters
# Also works without naming certain parameters:
result = add.synthetic('@new', n=100, strategy={...}, seed=42)14.7 Next Steps
- Page 1: Basic synthetic data with sequences and linked lists
- Page 3: Augment mode - add synthetic rows to existing DataFrames
- Page 4: Real-world scenarios and advanced patterns