Skip to content

Feature Categorization

Feature categorization is the first step in Pilz's algorithm. Each feature is binned into n_cat categories to enable multi-dimensional correlation analysis.

Why Categorization?

Before we can build correlation tables for feature combinations, we need discrete bins:

flowchart TB
    subgraph Raw_Data
        R[Continuous values:1.2, 3.5, 7.8, 12.4, ...]
    end

    subgraph Binning
        B[Bin into n_cat categories]
    end

    subgraph Correlation_Ready
        C[Category 0, 1, 2, ...]
    end

    R --> B
    B --> C

    style B fill:#e0f0ff```

![Diagram](images/feature_categorization_1.svg)

## Two Types of Binning

### Numerical Features: Quantile Binning

For continuous values, Pilz creates equal-size bins based on quantiles:

```mermaid
flowchart LR
    A[Start] --> B[Process]
    B --> C[End]

    style A fill:#e0f0ff
    style C fill:#ccffcc```

![Diagram](images/feature_categorization_2.svg)

mermaid

flowchart TB

    subgraph Input

        I[Values: 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

    end



    subgraph Sorted

        S[1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

    end



    subgraph Cumulative

        C[1, 3, 6, 11, 19, 32, 53, 87, 142, 231]

    end



    subgraph Bins_n_cat_2

        M1[Median: 13.5]

        B1["Bin 0: ≤ 13.5(5 values)"]

        B2["Bin 1: > 13.5(5 values)"]

    end



    I --> S

    S --> C

    C --> M1

    M1 --> B1

    M1 --> B2```

al values, Pilz groups by **target rate similarity**:

```mermaid
flowchart LR
    A[Start] --> B[Process]
    B --> C[End]

    style A fill:#e0f0ff
    style C fill:#ccffcc```

![Diagram](images/feature_categorization_3.svg)

mermaid

flowchart TB

    subgraph "Step 1: Calculate Target Rate"

        T1[Category: admin100 samples, 80 target] --> TR1[80%]

        T2[Category: technician50 samples, 25 target] --> TR2[50%]

        T3[Category: blue-collar100 samples, 10 target] --> TR3[10%]

    end



    subgraph "Step 2: Sort by Rate"

        S1[Sort: 10%, 50%, 80%]

    end



    subgraph "Step 3: Create n_cat Bins"

        B1["Bin 2: admin (80%)"]

        B2["Bin 1: technician (50%)"]

        B3["Bin 0: blue-collar (10%)"]

    end



    TR1 --> S1

    TR2 --> S1

    TR3 --> S1

    S1 --> B1

    S1 --> B2

    S1 --> B3



    style S1 fill:#e0f0ff```

ph Coarse

        C["n_cat=2"] --> R1[Fast, general]

    end



    subgraph Medium

        M["n_cat=5"] --> R2[Balanced]

    end



    subgraph Fine

        F["n_cat=10"] --> R3[Detailed, may overfit]

    end```

## Building Correlation Tables

After binning, Pilz builds correlation tables for feature combinations:

```mermaid
flowchart LR
    A[Start] --> B[Process]
    B --> C[End]

    style A fill:#e0f0ff
    style C fill:#ccffcc```

![Diagram](images/feature_categorization_4.svg)

mermaid

flowchart LR

    subgraph Coarse

        C["n_cat=2"] --> R1[Fast, general]

    end



    subgraph Medium

        M["n_cat=5"] --> R2[Balanced]

    end



    subgraph Fine

        F["n_cat=10"] --> R3[Detailed, may overfit]

    end```

_df["[feature.name, \"weight\", \"target_weight\"]"].sort(feature.name)



    # Calculate cumulative weights

    data["cum_weight"] = data["weight"].cum_sum()

    total = data["weight"].sum()



    # Find quantile boundaries

    cuts = []

    FOR i IN 1..n_cat:

        target_q = (i / n_cat) * total

        cut = INTERPOLATE(data, target_q)



```mermaid
flowchart TB
    subgraph Binned_Features
        F1[X: Bin 0, Bin 1]
        F2[Y: Bin 0, Bin 1]
    end

    subgraph Correlation_Table
        T1["X=0, Y=0: T=15, NT=85"]
        T2["X=0, Y=1: T=45, NT=55"]
        T3["X=1, Y=0: T=60, NT=40"]
        T4["X=1, Y=1: T=80, NT=20"]
    end

    F1 --> T1 & T2
    F2 --> T1 & T3

    style Correlation_Table fill:#ccffcc```

![Diagram](images/feature_categorization_5.svg)

rouped.sort("rate")

    # Separate large and small categories
    large = []    # > 1/n of data
    small = []    # <= 1/n of data

    threshold = grouped.weight.sum() / n_cat
    FOR row IN grouped:
        IF row.weight > threshold:
            large.append(row)
        ELSE:
            small.append(row)

    # Assign bins
    mapping = {}
    bin_idx = 0

    # Large categories get their own bin
    FOR cat IN large:
        mapping[cat.value] = bin_idx
        bin_idx += 1

    # Small categories grouped by target rate quantiles
    IF small:
        n_small_bins = n_cat - len(large)
        IF n_small_bins > 0:
            bin_edges = CREATE_QUANTILES(small, n_small_bins)
            FOR cat IN small:
                mapping[cat.value] = ASSIGN_BIN(cat.rate, bin_edges)

    RETURN CategorizedFeature(feature, mapping)

Summary

Feature Type Binning Method Example
Numerical Quantile n_cat=2: split at median
Categorical Target rate grouping Similar rates grouped together

Next Steps