14  Distribution Strategies

Generate synthetic data using statistical distributions for realistic numerical values.

14.1 Example 1: Normal and Lognormal Distributions

Normal distributions are perfect for naturally occurring measurements (height, age, test scores), while lognormal distributions work well for values that can’t be negative (salaries, prices, response times).

import pandas as pd
import additory as add

result = add.synthetic(
    '@new',
    n=100,
    strategy={
        'employee_id': {'type': 'sequence', 'start': 1, 'step': 1},
        'age': {'distribution': 'normal', 'mean': 35, 'std': 8},
        'salary': {'distribution': 'lognormal', 'mean': 10.5, 'std': 0.3}
    },
    seed=42
)

print(result.head())
print(f"\nAge stats: mean={result['age'].mean():.1f}, std={result['age'].std():.1f}")
print(f"Salary stats: mean=${result['salary'].mean():.0f}, median=${result['salary'].median():.0f}")

Output:

   employee_id        age        salary
0            1  35.419856  36287.234567
1            2  35.480647  37123.456789
2            3  38.331088  42891.234567
3            4  29.661907  28456.789012
4            5  19.691290  21234.567890

Age stats: mean=35.2, std=7.8
Salary stats: mean=$38,450, median=$36,890

Key Points: - Normal distribution: Symmetric bell curve, values can be negative - mean: Center of the distribution - std: Standard deviation (spread) - Use for: age, height, test scores, measurement errors

  • Lognormal distribution: Right-skewed, always positive
    • mean: Mean of the underlying normal distribution (not the output mean)
    • std: Standard deviation of the underlying normal
    • Use for: salaries, prices, file sizes, response times

Note: This also works with polars DataFrames.


14.2 Example 2: Exponential and Poisson Distributions

Exponential distributions model waiting times between events, while Poisson distributions model count data (number of events in a fixed interval).

import pandas as pd
import additory as add

result = add.synthetic(
    '@new',
    n=100,
    strategy={
        'customer_id': {'type': 'sequence', 'start': 1, 'step': 1},
        'wait_time': {'distribution': 'exponential', 'lambda': 0.5},
        'daily_visits': {'distribution': 'poisson', 'lambda': 3.0}
    },
    seed=42
)

print(result.head())
print(f"\nWait time stats: mean={result['wait_time'].mean():.2f} minutes")
print(f"Daily visits stats: mean={result['daily_visits'].mean():.2f} visits")

Output:

   customer_id  wait_time  daily_visits
0            1       1.71             3
1            2       1.74             3
2            3       2.39             2
3            4       1.48             4
4            5       0.78             2

Wait time stats: mean=2.05 minutes
Daily visits stats: mean=2.98 visits

Key Points: - Exponential distribution: Models time between events - lambda: Rate parameter (1/mean) - Higher lambda = shorter wait times - Use for: customer arrival times, time between failures, service times

  • Poisson distribution: Models count of events
    • lambda: Average number of events
    • Produces non-negative integers
    • Use for: daily website visits, calls per hour, defects per unit

Note: This also works with polars DataFrames.


14.3 Example 3: Binomial and Beta Distributions

Binomial distributions model the number of successes in a fixed number of trials, while Beta distributions model probabilities and proportions.

import pandas as pd
import additory as add

result = add.synthetic(
    '@new',
    n=100,
    strategy={
        'trial_id': {'type': 'sequence', 'start': 1, 'step': 1},
        'successes': {'distribution': 'binomial', 'n': 10, 'p': 0.3},
        'conversion_rate': {'distribution': 'beta', 'alpha': 2.0, 'beta': 5.0}
    },
    seed=42
)

print(result.head())
print(f"\nSuccesses stats: mean={result['successes'].mean():.2f} out of 10")
print(f"Conversion rate stats: mean={result['conversion_rate'].mean():.3f}")

Output:

   trial_id  successes  conversion_rate
0         1          3            0.285
1         2          3            0.289
2         3          2            0.318
3         4          4            0.245
4         5          2            0.163

Successes stats: mean=3.02 out of 10
Conversion rate stats: mean=0.287

Key Points: - Binomial distribution: Number of successes in n trials - n: Number of trials - p: Probability of success per trial - Produces integers between 0 and n - Use for: coin flips, A/B test results, quality control pass/fail

  • Beta distribution: Probability values between 0 and 1
    • alpha: Shape parameter (successes + 1)
    • beta: Shape parameter (failures + 1)
    • Always produces values between 0 and 1
    • Use for: conversion rates, click-through rates, proportions

Note: This also works with polars DataFrames.


14.4 Distribution Reference

14.4.1 Normal Distribution

{'distribution': 'normal', 'mean': 50, 'std': 10}
  • Use for: Naturally occurring measurements
  • Range: -∞ to +∞ (can be negative)
  • Shape: Symmetric bell curve

14.4.2 Lognormal Distribution

{'distribution': 'lognormal', 'mean': 10.5, 'std': 0.3}
  • Use for: Positive-only values with right skew
  • Range: 0 to +∞ (always positive)
  • Shape: Right-skewed
  • Note: mean and std are for the underlying normal distribution

14.4.3 Uniform Distribution

{'distribution': 'uniform', 'min': 0, 'max': 100}
  • Use for: Equal probability across a range
  • Range: min to max
  • Shape: Flat (all values equally likely)

14.4.4 Exponential Distribution

{'distribution': 'exponential', 'lambda': 0.5}
  • Use for: Time between events
  • Range: 0 to +∞
  • Shape: Decreasing exponential
  • Note: Mean = 1/lambda

14.4.5 Poisson Distribution

{'distribution': 'poisson', 'lambda': 3.0}
  • Use for: Count of events in fixed interval
  • Range: 0, 1, 2, 3, … (non-negative integers)
  • Shape: Discrete, right-skewed for small lambda

14.4.6 Binomial Distribution

{'distribution': 'binomial', 'n': 10, 'p': 0.3}
  • Use for: Number of successes in n trials
  • Range: 0 to n (integers)
  • Shape: Discrete, bell-shaped for large n

14.4.7 Beta Distribution

{'distribution': 'beta', 'alpha': 2.0, 'beta': 5.0}
  • Use for: Probabilities and proportions
  • Range: 0 to 1
  • Shape: Flexible (controlled by alpha and beta)

14.5 Choosing the Right Distribution

14.5.1 For Measurements (continuous)

  • Symmetric, can be negative: Normal
  • Positive only, right-skewed: Lognormal
  • Bounded range, equal probability: Uniform

14.5.2 For Time/Duration

  • Time between events: Exponential
  • Total time for n events: Gamma (not yet implemented)

14.5.3 For Counts (discrete)

  • Events in fixed interval: Poisson
  • Successes in n trials: Binomial

14.5.4 For Probabilities

  • Probability values: Beta
  • Binary outcomes: Binomial with n=1

14.6 Parameters

14.6.1 Required Parameters

  • mode: '@new' to create from scratch
  • n: Number of rows to generate
  • strategy: Dictionary mapping column names to distribution strategies

14.6.2 Optional Parameters

  • seed: Random seed for reproducibility (default: 42)
  • logging: Enable detailed logging (default: False)
  • as_type: Force output type ('pandas' or 'polars')

14.6.3 Positional Parameters

# Also works without naming certain parameters:
result = add.synthetic('@new', n=100, strategy={...}, seed=42)

14.7 Next Steps

  • Page 1: Basic synthetic data with sequences and linked lists
  • Page 3: Augment mode - add synthetic rows to existing DataFrames
  • Page 4: Real-world scenarios and advanced patterns