add.onehotencoding()

Convert categorical columns to one-hot encoded columns

What does add.onehotencoding() do?

The add.onehotencoding() function converts categorical columns into binary (0/1) columns, creating one new column for each unique category. This is essential for machine learning algorithms that require numeric input.

Common use cases:

📋 Table of Contents

📖 Parameters

Parameter Type Required Description
df DataFrame ✅ Yes The dataframe containing categorical columns to encode
columns list or None ❌ No Specific columns to encode. If None, auto-detects categorical columns

🚀 Example 1: Auto-detect Categories (Simplest)

Scenario: You have a customer dataset with categorical columns and want to encode all of them automatically.

Setup: Create sample data
import pandas as pd
import additory as add

# Customer data with categorical columns
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'region': ['North', 'South', 'East', 'North', 'West'],
    'tier': ['Gold', 'Silver', 'Bronze', 'Gold', 'Silver'],
    'status': ['Active', 'Inactive', 'Active', 'Active', 'Inactive'],
    'age': [25, 35, 45, 30, 40]
})

print("Original data:")
print(customers)
Auto-detect and encode all categorical columns
# Let additory automatically detect and encode categorical columns
result = add.onehotencoding(customers)

print("\nAfter one-hot encoding:")
print(result)
Output
   customer_id  age  region_East  region_North  region_South  region_West  tier_Bronze  tier_Gold  tier_Silver  status_Active  status_Inactive
0            1   25            0             1             0            0            0          1            0              1                0
1            2   35            0             0             1            0            0          0            1              0                1
2            3   45            1             0             0            0            1          0            0              1                0
3            4   30            0             1             0            0            0          1            0              1                0
4            5   40            0             0             0            1            0          0            1              0                1

🎯 Example 2: Specify Columns to Encode

Scenario: You only want to encode specific categorical columns, not all of them.

Setup: Product survey data
import pandas as pd
import additory as add

# Survey data with multiple categorical columns
survey = pd.DataFrame({
    'response_id': [1, 2, 3, 4, 5],
    'product_rating': ['Excellent', 'Good', 'Fair', 'Excellent', 'Good'],
    'recommend': ['Yes', 'Yes', 'No', 'Yes', 'Maybe'],
    'purchase_intent': ['Definitely', 'Probably', 'Unlikely', 'Definitely', 'Maybe'],
    'age_group': ['25-34', '35-44', '45-54', '25-34', '35-44'],
    'comments': ['Great product!', 'Could be better', 'Not for me', 'Love it!', 'It\'s okay']
})

print("Original survey data:")
print(survey)
Encode only specific columns
# Only encode rating and recommendation columns (skip comments and age_group)
result = add.onehotencoding(
    survey, 
    columns=['product_rating', 'recommend']
)

print("\nAfter encoding specific columns:")
print(result)
Output
   response_id purchase_intent age_group      comments  product_rating_Excellent  product_rating_Fair  product_rating_Good  recommend_Maybe  recommend_No  recommend_Yes
0            1      Definitely     25-34  Great product!                          1                    0                    0                0             0              1
1            2        Probably     35-44  Could be better                         0                    0                    1                0             0              1
2            3        Unlikely     45-54      Not for me                          0                    1                    0                0             1              0
3            4      Definitely     25-34       Love it!                           1                    0                    0                0             0              1
4            5           Maybe     35-44      It's okay                           0                    0                    1                1             0              0

⚠️ Important Notes

Column Naming: New columns are named as original_column_category (e.g., "region_North", "tier_Gold").
Original Columns: The original categorical columns are removed from the result.
Data Types: Works with both pandas and polars DataFrames.
Memory Usage: One-hot encoding can significantly increase the number of columns if you have many categories.

🎯 Quick Reference

Basic syntax templates
# Auto-detect all categorical columns
result = add.onehotencoding(df)

# Encode specific columns only
result = add.onehotencoding(df, columns=['column1', 'column2'])

# Single column encoding
result = add.onehotencoding(df, columns=['status'])