add.onehotencoding()

Convert categorical columns to one-hot encoded columns

What does add.onehotencoding() do?

The add.onehotencoding() function converts categorical columns into binary (0/1) columns, creating one new column for each unique category. This is essential for machine learning algorithms that require numeric input.

Common use cases:

📋 Table of Contents

📖 Parameters

Parameter Type Required Description
df DataFrame ✅ Yes The dataframe containing categorical columns to encode
columns str ✅ Yes Column name to encode (single column)

🚀 Example 1: Encode a Single Column (Simplest)

Scenario: You have a customer dataset and want to encode the region column.

Setup: Create sample data
import pandas as pd
import additory as add

# Customer data with categorical columns
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'region': ['North', 'South', 'East', 'North', 'West'],
    'tier': ['Gold', 'Silver', 'Bronze', 'Gold', 'Silver'],
    'status': ['Active', 'Inactive', 'Active', 'Active', 'Inactive'],
    'age': [25, 35, 45, 30, 40]
})

print("Original data:")
print(customers)
Encode the region column
# Encode the region column
result = add.onehotencoding(customers, columns='region', max_cardinality_ratio=1.0)

print("\nAfter one-hot encoding:")
print(result)
Output
   customer_id     tier    status  age  region_East  region_North  region_South  region_West
0            1     Gold    Active   25            0             1             0            0
1            2   Silver  Inactive   35            0             0             1            0
2            3   Bronze    Active   45            1             0             0            0
3            4     Gold    Active   30            0             1             0            0
4            5   Silver  Inactive   40            0             0             0            1

🎯 Example 2: Keep Original Column

Scenario: You want to encode a column but keep the original for reference.

Setup: Product survey data
import pandas as pd
import additory as add

# Survey data
survey = pd.DataFrame({
    'response_id': [1, 2, 3, 4, 5],
    'product_rating': ['Excellent', 'Good', 'Fair', 'Excellent', 'Good'],
    'recommend': ['Yes', 'Yes', 'No', 'Yes', 'Maybe']
})

print("Original survey data:")
print(survey)
Encode but keep original column
# Encode the recommend column but keep the original
result = add.onehotencoding(
    survey, 
    columns='recommend',
    drop_original=False
)

print("\nAfter encoding (original kept):")
print(result)
Output
   response_id product_rating recommend  recommend_Maybe  recommend_No  recommend_Yes
0            1      Excellent       Yes                0             0              1
1            2           Good       Yes                0             0              1
2            3           Fair        No                0             1              0
3            4      Excellent       Yes                0             0              1
4            5           Good     Maybe                1             0              0

⚠️ Important Notes

Column Naming: New columns are named as original_column_category (e.g., "region_North", "tier_Gold").
Original Columns: The original categorical column is removed by default (use drop_original=False to keep it).
Single Column: Currently encodes one column at a time. To encode multiple columns, call the function multiple times.
Data Types: Works with both pandas and polars DataFrames.
Memory Usage: One-hot encoding can significantly increase the number of columns if you have many categories.

🎯 Quick Reference

Basic syntax templates
# Encode a single column (drops original)
result = add.onehotencoding(df, columns='column_name')

# Keep the original column
result = add.onehotencoding(df, columns='column_name', drop_original=False)

# Encode multiple columns (call multiple times)
result = add.onehotencoding(df, columns='region')
result = add.onehotencoding(result, columns='tier')
result = add.onehotencoding(result, columns='status')