Feature Categorization¶
Feature categorization is the first step in Pilz's algorithm. Each feature is binned into n_cat categories to enable multi-dimensional correlation analysis.
Why Categorization?¶
Before we can build correlation tables for feature combinations, we need discrete bins:
flowchart TB
subgraph Raw_Data
R[Continuous values:1.2, 3.5, 7.8, 12.4, ...]
end
subgraph Binning
B[Bin into n_cat categories]
end
subgraph Correlation_Ready
C[Category 0, 1, 2, ...]
end
R --> B
B --> C
style B fill:#e0f0ff```

## Two Types of Binning
### Numerical Features: Quantile Binning
For continuous values, Pilz creates equal-size bins based on quantiles:
```mermaid
flowchart LR
A[Start] --> B[Process]
B --> C[End]
style A fill:#e0f0ff
style C fill:#ccffcc```

mermaid
flowchart TB
subgraph Input
I[Values: 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
end
subgraph Sorted
S[1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
end
subgraph Cumulative
C[1, 3, 6, 11, 19, 32, 53, 87, 142, 231]
end
subgraph Bins_n_cat_2
M1[Median: 13.5]
B1["Bin 0: ≤ 13.5(5 values)"]
B2["Bin 1: > 13.5(5 values)"]
end
I --> S
S --> C
C --> M1
M1 --> B1
M1 --> B2```
al values, Pilz groups by **target rate similarity**:
```mermaid
flowchart LR
A[Start] --> B[Process]
B --> C[End]
style A fill:#e0f0ff
style C fill:#ccffcc```

mermaid
flowchart TB
subgraph "Step 1: Calculate Target Rate"
T1[Category: admin100 samples, 80 target] --> TR1[80%]
T2[Category: technician50 samples, 25 target] --> TR2[50%]
T3[Category: blue-collar100 samples, 10 target] --> TR3[10%]
end
subgraph "Step 2: Sort by Rate"
S1[Sort: 10%, 50%, 80%]
end
subgraph "Step 3: Create n_cat Bins"
B1["Bin 2: admin (80%)"]
B2["Bin 1: technician (50%)"]
B3["Bin 0: blue-collar (10%)"]
end
TR1 --> S1
TR2 --> S1
TR3 --> S1
S1 --> B1
S1 --> B2
S1 --> B3
style S1 fill:#e0f0ff```
ph Coarse
C["n_cat=2"] --> R1[Fast, general]
end
subgraph Medium
M["n_cat=5"] --> R2[Balanced]
end
subgraph Fine
F["n_cat=10"] --> R3[Detailed, may overfit]
end```
## Building Correlation Tables
After binning, Pilz builds correlation tables for feature combinations:
```mermaid
flowchart LR
A[Start] --> B[Process]
B --> C[End]
style A fill:#e0f0ff
style C fill:#ccffcc```

mermaid
flowchart LR
subgraph Coarse
C["n_cat=2"] --> R1[Fast, general]
end
subgraph Medium
M["n_cat=5"] --> R2[Balanced]
end
subgraph Fine
F["n_cat=10"] --> R3[Detailed, may overfit]
end```
_df["[feature.name, \"weight\", \"target_weight\"]"].sort(feature.name)
# Calculate cumulative weights
data["cum_weight"] = data["weight"].cum_sum()
total = data["weight"].sum()
# Find quantile boundaries
cuts = []
FOR i IN 1..n_cat:
target_q = (i / n_cat) * total
cut = INTERPOLATE(data, target_q)
```mermaid
flowchart TB
subgraph Binned_Features
F1[X: Bin 0, Bin 1]
F2[Y: Bin 0, Bin 1]
end
subgraph Correlation_Table
T1["X=0, Y=0: T=15, NT=85"]
T2["X=0, Y=1: T=45, NT=55"]
T3["X=1, Y=0: T=60, NT=40"]
T4["X=1, Y=1: T=80, NT=20"]
end
F1 --> T1 & T2
F2 --> T1 & T3
style Correlation_Table fill:#ccffcc```

rouped.sort("rate")
# Separate large and small categories
large = [] # > 1/n of data
small = [] # <= 1/n of data
threshold = grouped.weight.sum() / n_cat
FOR row IN grouped:
IF row.weight > threshold:
large.append(row)
ELSE:
small.append(row)
# Assign bins
mapping = {}
bin_idx = 0
# Large categories get their own bin
FOR cat IN large:
mapping[cat.value] = bin_idx
bin_idx += 1
# Small categories grouped by target rate quantiles
IF small:
n_small_bins = n_cat - len(large)
IF n_small_bins > 0:
bin_edges = CREATE_QUANTILES(small, n_small_bins)
FOR cat IN small:
mapping[cat.value] = ASSIGN_BIN(cat.rate, bin_edges)
RETURN CategorizedFeature(feature, mapping)
Summary¶
| Feature Type | Binning Method | Example |
|---|---|---|
| Numerical | Quantile | n_cat=2: split at median |
| Categorical | Target rate grouping | Similar rates grouped together |
Next Steps¶
- How Pilz Works - How splits use these bins
- Multi-Dimensional Splits - Building correlation tables