add.augment()

Generate additional data rows or create data from scratch

What does add.augment() do?

The add.augment() function has three powerful modes: augment existing data with more rows, create entirely new datasets from scratch, or load sample data for testing.

Three modes:

📋 Table of Contents

📖 Parameters

Parameter Type Required Description
df DataFrame or str ✅ Yes DataFrame to augment, "@new" to create, or "@sample" for sample data
n_rows int ❌ No Number of rows to generate (default: 5)
strategy str or dict ❌ No "auto" for augment mode, dict for create mode (default: "auto")
seed int or None ❌ No Random seed for reproducible results
output_format str ❌ No Output format: "pandas", "polars", "cudf" (default: "pandas")

🚀 Example 1: Augment Existing Data (Simplest)

Scenario: You have a small customer dataset and want to generate more similar customers for testing.

Setup: Create sample customer data
import pandas as pd
import additory as add

# Small customer dataset
customers = pd.DataFrame({
    'customer_id': [1, 2, 3],
    'age': [25, 35, 45],
    'income': [50000, 75000, 90000],
    'region': ['North', 'South', 'East']
})

print("Original customer data:")
print(customers)
Augment with more similar customers
# Generate 10 more customers similar to existing ones
result = add.augment(customers, n_rows=10)

print(f"\nAugmented data ({len(result)} rows):")
print(result)
Output
Augmented data (13 rows):
   customer_id  age  income region
0            1   25   50000  North
1            2   35   75000  South
2            3   45   90000   East
3            4   28   52000  North
4            5   38   78000  South
5            6   42   87000   East
6            7   31   68000  North
7            8   29   55000  South
8            9   47   92000   East
9           10   26   51000  North
10          11   36   76000  South
11          12   44   89000   East
12          13   33   71000  North

🎯 Example 2: Create Data from Scratch

Scenario: You need to create a completely new dataset with specific column types and patterns.

Create employee data from scratch
import pandas as pd
import additory as add

# Define strategy for each column
strategy = {
    'employee_id': 'increment:start=1',       # Sequential IDs starting from 1
    'name': 'choice:[John Smith,Jane Doe,Mike Brown,Sarah Lee,Tom Wilson]',  # Pick from list
    'age': 'range:22-65',                     # Ages between 22 and 65
    'department': 'choice:[HR,IT,Sales,Marketing]',  # Pick from list
    'salary': 'range:40000-120000'            # Salary range
}

# Create 50 employees from scratch
result = add.augment("@new", n_rows=50, strategy=strategy)

print("Created employee data:")
print(result.head(10))  # Show first 10 rows
Output
Created employee data:
   employee_id        name  age department  salary
0            1  John Smith   28         IT   65000
1            2   Jane Doe   34      Sales   72000
2            3  Mike Brown   45         HR   58000
3            4  Sarah Lee   29  Marketing   69000
4            5  Tom Wilson   38         IT   85000
5            6  John Smith   31      Sales   71000
6            7   Jane Doe   42         HR   62000
7            8  Mike Brown   27  Marketing   67000
8            9  Sarah Lee   35         IT   78000
9           10  Tom Wilson   29      Sales   73000

📦 Example 3: Load Sample Data

Scenario: You need some realistic sample data quickly for testing or demos.

Load built-in sample data
import additory as add

# Load 100 rows of sample data
sample_data = add.augment("@sample", n_rows=100)

print("Sample data loaded:")
print(sample_data.head())
print(f"\nDataset shape: {sample_data.shape}")
print(f"Columns: {list(sample_data.columns)}")
Output
Sample data loaded:
   customer_id        name  age  income    region
0            1  John Smith   28   52000     North
1            2   Jane Doe   34   68000     South
2            3  Mike Brown   45   85000      East
3            4  Sarah Lee   29   47000      West
4            5  Tom Wilson   38   72000     North

Dataset shape: (100, 5)
Columns: ['customer_id', 'name', 'age', 'income', 'region']

🎨 Strategy Options for Create Mode

increment:start=N: Sequential numbers starting from N (e.g., "increment:start=1")
choice:[A,B,C]: Random selection from list (e.g., "choice:[Red,Blue,Green]")
range:min-max: Random numbers in range (e.g., "range:18-65")

⚠️ Important Notes

Minimum Rows: Augment mode requires at least 3 rows in the input DataFrame.
Data Types: Works with pandas, polars, and cuDF DataFrames.
Reproducibility: Use the seed parameter for consistent results.
Smart Augmentation: In augment mode, analyzes patterns in your data to generate realistic new rows.

🎯 Quick Reference

Basic syntax templates
# Augment existing data
result = add.augment(df, n_rows=100)

# Create from scratch
result = add.augment("@new", n_rows=50, strategy={'id': 'increment:start=1', 'name': 'choice:[John,Jane,Bob]'})

# Load sample data (if available)
result = add.augment("@sample", n_rows=1000)

# With reproducible seed
result = add.augment(df, n_rows=100, seed=42)