Generate synthetic data from schema files
What does add.synth() do?
The add.synth() function generates realistic synthetic data based on schema files (.toml format). Define your data structure once, then generate thousands of realistic rows instantly.
Common use cases:
| Parameter | Type | Required | Description |
|---|---|---|---|
| schema_path | str | ✅ Yes | Path to the .toml schema file defining the data structure |
| rows | int | ❌ No | Number of rows to generate (default: 1000) |
| engine | str or None | ❌ No | Output engine: "pandas" or "polars" (default: from config) |
Scenario: You need a customer dataset for testing your e-commerce application.
# Save this as customer.toml
[metadata]
name = "customer_data"
description = "E-commerce customer dataset"
[columns.customer_id]
type = "increment"
start = 1
[columns.first_name]
type = "list"
source = "us.first_names"
[columns.last_name]
type = "list"
source = "us.last_names"
[columns.email]
type = "pattern"
pattern = "{first_name}.{last_name}@{domain}"
domain = ["gmail.com", "yahoo.com", "hotmail.com"]
[columns.age]
type = "range"
min = 18
max = 75
[columns.income]
type = "range"
min = 25000
max = 150000
[columns.region]
type = "choice"
values = ["North", "South", "East", "West", "Central"]
import additory as add
# Generate 500 customers from the schema
customers = add.synth("customer.toml", rows=500)
print("Generated customer data:")
print(customers.head())
print(f"\nDataset shape: {customers.shape}")
print(f"Columns: {list(customers.columns)}")
Generated customer data:
customer_id first_name last_name email age income region
0 1 John Smith john.smith@gmail.com 34 67000 North
1 2 Jane Doe jane.doe@yahoo.com 28 45000 South
2 3 Mike Brown mike.brown@hotmail.com 42 89000 East
3 4 Sarah Lee sarah.lee@gmail.com 31 72000 West
4 5 Tom Wilson tom.wilson@yahoo.com 38 58000 Central
Dataset shape: (500, 7)
Columns: ['customer_id', 'first_name', 'last_name', 'email', 'age', 'income', 'region']
Scenario: You need realistic patient data for healthcare application testing with complex relationships.
# Save this as patients.toml
[metadata]
name = "patient_data"
description = "Healthcare patient dataset"
[columns.patient_id]
type = "pattern"
pattern = "P{:06d}"
start = 1
[columns.first_name]
type = "list"
source = "us.first_names"
[columns.last_name]
type = "list"
source = "us.last_names"
[columns.date_of_birth]
type = "date_range"
start = "1940-01-01"
end = "2005-12-31"
[columns.gender]
type = "weighted_choice"
values = ["Male", "Female", "Other"]
weights = [48, 48, 4]
[columns.blood_type]
type = "weighted_choice"
values = ["O+", "A+", "B+", "AB+", "O-", "A-", "B-", "AB-"]
weights = [37, 36, 8, 3, 7, 6, 2, 1]
[columns.height_cm]
type = "normal"
mean = 170
std = 10
min = 140
max = 210
[columns.weight_kg]
type = "normal"
mean = 75
std = 15
min = 40
max = 150
[columns.diagnosis]
type = "choice"
values = ["Hypertension", "Diabetes", "Asthma", "Arthritis", "Depression", "Healthy"]
[columns.admission_date]
type = "date_range"
start = "2023-01-01"
end = "2024-12-31"
import additory as add
# Generate 1000 patients using polars engine
patients = add.synth("patients.toml", rows=1000, engine="polars")
print("Generated patient data:")
print(patients.head())
print(f"\nDataset shape: {patients.shape}")
# Show some statistics
print(f"\nGender distribution:")
print(patients['gender'].value_counts())
print(f"\nBlood type distribution:")
print(patients['blood_type'].value_counts())
Generated patient data:
shape: (5, 9)
┌────────────┬────────────┬───────────┬───────────────┬────────┬───────────┬───────────┬──────────┬────────────────┐
│ patient_id ┆ first_name ┆ last_name ┆ date_of_birth ┆ gender ┆ blood_type┆ height_cm ┆ weight_kg┆ diagnosis │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ date ┆ str ┆ str ┆ f64 ┆ f64 ┆ str │
╞════════════╪════════════╪═══════════╪═══════════════╪════════╪═══════════╪═══════════╪══════════╪════════════════╡
│ P000001 ┆ John ┆ Smith ┆ 1975-03-15 ┆ Male ┆ O+ ┆ 175.2 ┆ 78.5 ┆ Hypertension │
│ P000002 ┆ Jane ┆ Doe ┆ 1982-07-22 ┆ Female ┆ A+ ┆ 162.8 ┆ 65.2 ┆ Diabetes │
│ P000003 ┆ Mike ┆ Brown ┆ 1968-11-10 ┆ Male ┆ B+ ┆ 180.1 ┆ 85.7 ┆ Healthy │
│ P000004 ┆ Sarah ┆ Lee ┆ 1990-01-18 ┆ Female ┆ O- ┆ 168.5 ┆ 62.3 ┆ Asthma │
│ P000005 ┆ Tom ┆ Wilson ┆ 1955-09-03 ┆ Male ┆ AB+ ┆ 172.9 ┆ 79.8 ┆ Arthritis │
└────────────┴────────────┴───────────┴───────────────┴────────┴───────────┴───────────┴──────────┴────────────────┘
Dataset shape: (1000, 9)
Gender distribution:
shape: (3, 2)
┌────────┬───────┐
│ gender ┆ count │
│ --- ┆ --- │
│ str ┆ u32 │
╞════════╪═══════╡
│ Male ┆ 485 │
│ Female ┆ 479 │
│ Other ┆ 36 │
└────────┴───────┘
# Basic usage
df = add.synth("schema.toml")
# Specify number of rows
df = add.synth("schema.toml", rows=5000)
# Use polars output
df = add.synth("schema.toml", rows=1000, engine="polars")
# Large dataset
df = add.synth("schema.toml", rows=1000000)