add.synthetic()

Generate synthetic data from scratch

What does add.synthetic() do?

The add.synthetic() function generates synthetic data from scratch using various strategies. Perfect for creating test data, mock datasets, or generating sample data for development.

Common use cases:

📋 Table of Contents

📖 Parameters

Parameter Type Required Description
df str ✅ Yes Use '@new' to create data from scratch
n_rows int ❌ No Number of rows to generate (default: 5)
strategy dict ✅ Yes Dictionary mapping column names to generation strategies
seed int ❌ No Random seed for reproducibility (default: None)
💡 Note: This is an alpha release. The function currently supports create mode ('@new') with generative strategies.

🚀 Example 1: Simplest Usage (No Parameters)

Scenario: Generate a simple dataset with sequential IDs.

Generate 5 rows with auto-incrementing IDs
import additory as add

# Generate 5 rows (default) with incrementing IDs starting from 1
df = add.synthetic(
    '@new',
    strategy={'id': 'increment:start=1'}
)

print(df)
Output
   id
0   1
1   2
2   3
3   4
4   5

📦 Example 2: With Strategies

Scenario: Generate a more complex dataset with multiple columns using different strategies.

Generate 10 rows with multiple strategies
import additory as add

# Generate 10 rows with different strategies for each column
df = add.synthetic(
    '@new',
    n_rows=10,
    strategy={
        'id': 'increment:start=1',
        'age': 'range:18-65',
        'status': 'choice:[active,inactive,pending]'
    },
    seed=42
)

print(df)
Output
   id  age    status
0   1   52    active
1   2   33  inactive
2   3   45   pending
3   4   28    active
4   5   61  inactive
5   6   22   pending
6   7   39    active
7   8   56  inactive
8   9   31   pending
9  10   48    active

Available Strategies:

🔗 Example 3: Linked Lists

Scenario: Generate data with semantic relationships using linked lists. Perfect for creating related data like adverse events with medications and severity levels.

Define a linked list with column names
import additory as add

# Define a linked list with explicit column names
# Format: [Column_Names:[name1,name2,name3]]
# Then: [primary_key, [related_values1], [related_values2]]
AE_CM_SEV = [
    ['Column_Names:[adverse_event,medication,severity]'],
    ['Headache', ['Aspirin', 'Ibuprofen'], ['mild', 'moderate']],
    ['Nausea', ['Ondansetron'], ['severe']]
]

# Generate 10 rows using the linked list
df = add.synthetic(
    '@new',
    n_rows=10,
    strategy={'col1': 'lists@AE_CM_SEV'},
    seed=42
)

print(df)
Output
  adverse_event  medication severity
0      Headache     Aspirin     mild
1      Headache     Aspirin     mild
2      Headache   Ibuprofen     mild
3        Nausea  Ondansetron   severe
4      Headache     Aspirin moderate
5      Headache   Ibuprofen     mild
6        Nausea  Ondansetron   severe
7      Headache     Aspirin     mild
8      Headache   Ibuprofen moderate
9        Nausea  Ondansetron   severe

How Linked Lists Work:

💡 Tip: Linked lists preserve semantic relationships. In this example, "Headache" will only appear with "Aspirin" or "Ibuprofen", never with "Ondansetron" (which is linked to "Nausea").

⚠️ Important Notes

Alpha Release: This is an alpha version. Currently supports create mode ('@new') only.
Linked Lists Scope: Linked list variables must be defined in the same scope (cell/function) as the synthetic() call.
Seed for Reproducibility: Use the seed parameter to get consistent results across runs.

🎯 Quick Reference

Basic syntax templates
# Simple increment
df = add.synthetic('@new', strategy={'id': 'increment:start=1'})

# Multiple strategies
df = add.synthetic('@new', n_rows=10, strategy={
    'id': 'increment:start=1',
    'age': 'range:18-65',
    'status': 'choice:[active,inactive]'
}, seed=42)

# Linked lists
MY_LIST = [
    ['Column_Names:[col1,col2,col3]'],
    ['A', ['B'], ['C']]
]
df = add.synthetic('@new', n_rows=5, strategy={'col1': 'lists@MY_LIST'}, seed=42)