add.synth()

Generate synthetic data from schema files

What does add.synth() do?

The add.synth() function generates realistic synthetic data based on schema files (.toml format). Define your data structure once, then generate thousands of realistic rows instantly.

Common use cases:

📋 Table of Contents

📖 Parameters

Parameter Type Required Description
schema_path str ✅ Yes Path to the .toml schema file defining the data structure
rows int ❌ No Number of rows to generate (default: 1000)
engine str or None ❌ No Output engine: "pandas" or "polars" (default: from config)

🚀 Example 1: Simple Customer Data (Simplest)

Scenario: You need a customer dataset for testing your e-commerce application.

Create customer schema file (customer.toml)
# Save this as customer.toml
[metadata]
name = "customer_data"
description = "E-commerce customer dataset"

[columns.customer_id]
type = "increment"
start = 1

[columns.first_name]
type = "list"
source = "us.first_names"

[columns.last_name]
type = "list"
source = "us.last_names"

[columns.email]
type = "pattern"
pattern = "{first_name}.{last_name}@{domain}"
domain = ["gmail.com", "yahoo.com", "hotmail.com"]

[columns.age]
type = "range"
min = 18
max = 75

[columns.income]
type = "range"
min = 25000
max = 150000

[columns.region]
type = "choice"
values = ["North", "South", "East", "West", "Central"]
Generate customer data
import additory as add

# Generate 500 customers from the schema
customers = add.synth("customer.toml", rows=500)

print("Generated customer data:")
print(customers.head())
print(f"\nDataset shape: {customers.shape}")
print(f"Columns: {list(customers.columns)}")
Output
Generated customer data:
   customer_id first_name last_name              email  age  income   region
0            1       John     Smith    john.smith@gmail.com   34   67000    North
1            2       Jane       Doe     jane.doe@yahoo.com   28   45000    South
2            3      Mike     Brown   mike.brown@hotmail.com   42   89000     East
3            4     Sarah       Lee     sarah.lee@gmail.com   31   72000     West
4            5       Tom    Wilson   tom.wilson@yahoo.com   38   58000  Central

Dataset shape: (500, 7)
Columns: ['customer_id', 'first_name', 'last_name', 'email', 'age', 'income', 'region']

🏥 Example 2: Complex Healthcare Data

Scenario: You need realistic patient data for healthcare application testing with complex relationships.

Create healthcare schema file (patients.toml)
# Save this as patients.toml
[metadata]
name = "patient_data"
description = "Healthcare patient dataset"

[columns.patient_id]
type = "pattern"
pattern = "P{:06d}"
start = 1

[columns.first_name]
type = "list"
source = "us.first_names"

[columns.last_name]
type = "list"
source = "us.last_names"

[columns.date_of_birth]
type = "date_range"
start = "1940-01-01"
end = "2005-12-31"

[columns.gender]
type = "weighted_choice"
values = ["Male", "Female", "Other"]
weights = [48, 48, 4]

[columns.blood_type]
type = "weighted_choice"
values = ["O+", "A+", "B+", "AB+", "O-", "A-", "B-", "AB-"]
weights = [37, 36, 8, 3, 7, 6, 2, 1]

[columns.height_cm]
type = "normal"
mean = 170
std = 10
min = 140
max = 210

[columns.weight_kg]
type = "normal"
mean = 75
std = 15
min = 40
max = 150

[columns.diagnosis]
type = "choice"
values = ["Hypertension", "Diabetes", "Asthma", "Arthritis", "Depression", "Healthy"]

[columns.admission_date]
type = "date_range"
start = "2023-01-01"
end = "2024-12-31"
Generate patient data with polars output
import additory as add

# Generate 1000 patients using polars engine
patients = add.synth("patients.toml", rows=1000, engine="polars")

print("Generated patient data:")
print(patients.head())
print(f"\nDataset shape: {patients.shape}")

# Show some statistics
print(f"\nGender distribution:")
print(patients['gender'].value_counts())

print(f"\nBlood type distribution:")
print(patients['blood_type'].value_counts())
Output
Generated patient data:
shape: (5, 9)
┌────────────┬────────────┬───────────┬───────────────┬────────┬───────────┬───────────┬──────────┬────────────────┐
│ patient_id ┆ first_name ┆ last_name ┆ date_of_birth ┆ gender ┆ blood_type┆ height_cm ┆ weight_kg┆ diagnosis      │
│ ---        ┆ ---        ┆ ---       ┆ ---           ┆ ---    ┆ ---       ┆ ---       ┆ ---      ┆ ---            │
│ str        ┆ str        ┆ str       ┆ date          ┆ str    ┆ str       ┆ f64       ┆ f64      ┆ str            │
╞════════════╪════════════╪═══════════╪═══════════════╪════════╪═══════════╪═══════════╪══════════╪════════════════╡
│ P000001    ┆ John       ┆ Smith     ┆ 1975-03-15    ┆ Male   ┆ O+        ┆ 175.2     ┆ 78.5     ┆ Hypertension   │
│ P000002    ┆ Jane       ┆ Doe       ┆ 1982-07-22    ┆ Female ┆ A+        ┆ 162.8     ┆ 65.2     ┆ Diabetes       │
│ P000003    ┆ Mike       ┆ Brown     ┆ 1968-11-10    ┆ Male   ┆ B+        ┆ 180.1     ┆ 85.7     ┆ Healthy        │
│ P000004    ┆ Sarah      ┆ Lee       ┆ 1990-01-18    ┆ Female ┆ O-        ┆ 168.5     ┆ 62.3     ┆ Asthma         │
│ P000005    ┆ Tom        ┆ Wilson    ┆ 1955-09-03    ┆ Male   ┆ AB+       ┆ 172.9     ┆ 79.8     ┆ Arthritis      │
└────────────┴────────────┴───────────┴───────────────┴────────┴───────────┴───────────┴──────────┴────────────────┘

Dataset shape: (1000, 9)

Gender distribution:
shape: (3, 2)
┌────────┬───────┐
│ gender ┆ count │
│ ---    ┆ ---   │
│ str    ┆ u32   │
╞════════╪═══════╡
│ Male   ┆ 485   │
│ Female ┆ 479   │
│ Other  ┆ 36    │
└────────┴───────┘

📋 Schema File Types

increment: Sequential numbers (1, 2, 3...)
list: Random selection from predefined lists (names, cities, etc.)
choice: Random selection from custom values
weighted_choice: Random selection with custom probabilities
range: Random numbers between min and max
normal: Normally distributed numbers (mean, std deviation)
pattern: Text patterns with placeholders
date_range: Random dates between start and end dates

⚠️ Important Notes

Schema Files: Must be in TOML format with proper column definitions.
Built-in Lists: Includes realistic names, cities, countries, and more.
Output Formats: Supports pandas and polars DataFrames.
Performance: Can generate millions of rows efficiently.

🎯 Quick Reference

Basic syntax templates
# Basic usage
df = add.synth("schema.toml")

# Specify number of rows
df = add.synth("schema.toml", rows=5000)

# Use polars output
df = add.synth("schema.toml", rows=1000, engine="polars")

# Large dataset
df = add.synth("schema.toml", rows=1000000)