How Pilz Works¶
This chapter explains Pilz's algorithm in detail. The key innovation is multi-dimensional feature cuts that naturally capture correlations.
Core Innovation: Multi-Dimensional Cuts¶
Unlike traditional trees that split on one feature at a time, Pilz can split on feature combinations directly:
flowchart TB
subgraph "Traditional Tree"
A[Data] --> B{X split?}
B -->|Yes| C[X is high]
B -->|No| D{X is low}
D --> E{Y split?}
E -->|Yes| F[Y is high]
E -->|No| G[Y is low]
end
subgraph "Pilz Multi-Dimensional"
P[Data] --> Q{X AND Ycombination?}
Q --> R["X=0,Y=0"]
Q --> S["X=0,Y=1"]
Q --> T["X=1,Y=0"]
Q --> U["X=1,Y=1"]
end
style Q fill:#ccffcc```

## The Algorithm Step by Step
### Step 1: Feature Binning (Categorization)
Every feature is binned into `n_cat` categories:
```mermaid
flowchart TB
subgraph Input
I1[Categorical: job, city, status]
I2[Numerical: age, balance, score]
end
subgraph Process
B1[Categorize]
end```

mermaid
flowchart TB
subgraph Input
I1[Categorical: job, city, status]
I2[Numerical: age, balance, score]
end
subgraph Process
B1[Categorize]
end
subgraph Output
O1["Bin 0, Bin 1, ..., Bin (n_cat-1)"]
end
I1 --> B1
I2 --> B1
B1 --> O1
style B1 fill:#e0f0ff```
Features:**
- Sort values and calculate quantile boundaries
- With `n_cat=2`, use the median as the single cut point
### Step 2: Build Correlation Tables
For each split, Pilz builds tables for feature combinations:
```mermaid
flowchart LR
subgraph "With n_dims=2"
F1[Feature X] --> T1["X=0, Y=0target=15, non-target=85"]
F2[Feature Y] --> T2["X=0, Y=1target=45, non-target=55"]
T1 --> C[Correlation Table]
T2 --> C
end```

Example for features X, Y with binary bins:
| Combination | Target Count | Non-Target Count | Target Rate |
|-------------|-----
```mermaid
flowchart LR
subgraph "With n_dims=2"
F1[Feature X] --> T1["X=0, Y=0target=15, non-target=85"]
F2[Feature Y] --> T2["X=0, Y=1target=45, non-target=55"]
T1 --> C[Correlation Table]
T2 --> C
end```

is scored:
```mermaid
flowchart TB
subgraph Scoring
S["Target Rate =target / (target + non-target)"]
end
subgraph Classification
H["High rate > 0.5 + neutral_faktor"] --> R[Right: Likely target]
L["Low rate < 0.5 - neutral_faktor"] --> L2[Left: Likely non-target]
M[Medium rate] --> N[Neutral: Uncertain]
end
S --> H
S --> L
S --> M
style R fill:#ccffcc
style L fill:#ffcccc
style N fill:#ffff99```

**Example:**
- Target rate 80% → **Right** (clearly target)
- Target rate 15% → **Left** (clearly non-targe
```mermaid
flowchart TB
subgraph Scoring
S["Target Rate =target / (target + non-target)"]
end
subgraph Classification
H["High rate > 0.5 + neutral_faktor"] --> R[Right: Likely target]
L["Low rate < 0.5 - neutral_faktor"] --> L2[Left: Likely non-target]
M[Medium rate] --> N[Neutral: Uncertain]
end
S --> H
S --> L
S --> M
style R fill:#ccffcc
style L fill:#ffcccc
style N fill:#ffff99```

af has statistically significant data.
### Step 5: Downsampling
Pilz always uses a **balanced downsampled** subset:
```mermaid
flowchart LR
subgraph Original
O1[100K rowsTarget: 10K, Non-target: 90K]
end
subgraph Downsampled
D1[10K rowsTarget: 5K, Non-target: 5K]
end
O1 -->|"Balance target"| D1
O1 -->|"Balance non-target"| D1
style D1 fill:#ffff99```

mermaid
flowchart TB
A[Node] --> B{Minimum events?}
B -->|No| C[Split again]
B -->|Yes| D[Create leaf node]
C --> E[Apply Left filter]
C --> F[Apply Neutral filter]
C --> G[Apply Right filter]
E --> A
F --> A
G --> A
style B fill:#e0f0ff
style D fill:#ccffcc```
-->|"Yes - Clear"| R[Right BranchHigh target rate]
B -->|"No - Unclear"| N[Neutral BranchContinue splitting]
B -->|"Yes - Clear"| L[Left BranchLow target rate]
R --> R2["Target Rate > 0.8"]
N --> N2[Target Rate ~0.5]
L --> L2["Target Rate < 0.2"]
N2 --> R3[Split Again]
N2 --> L3[Split Again]
style N fill:#ffff99
style
```mermaid
flowchart LR
subgraph Original
O1[100K rowsTarget: 10K, Non-target: 90K]
end
subgraph Downsampled
D1[10K rowsTarget: 5K, Non-target: 5K]
end
O1 -->|"Balance target"| D1
O1 -->|"Balance non-target"| D1
style D1 fill:#ffff99```

target = target_class)
spores = train_pilz(target_filter, [], "", settings)
SAVE(Pilz(spores, target_class), tree_idx)
FUNCTION TRAIN_PILZ(target_filter, path_filters, depth, settings):
# 1. Read downsampled data
train_df = READ_DATA(target_filter, path_filters, settings)
# 2. Stop if minimum events reached
IF train_df.size < settings.min_eval_fit:
RETURN [CREATE_L
```mermaid
flowchart TD
A[Data] --> B{Feature combinationclear discrimination?}
B -->|"Yes - Clear"| R[Right BranchHigh target rate]
B -->|"No - Unclear"| N[Neutral BranchContinue splitting]
B -->|"Yes - Clear"| L[Left BranchLow target rate]
R --> R2["Target Rate > 0.8"]
N --> N2[Target Rate ~0.5]
L --> L2["Target Rate < 0.2"]
N2 --> R3[Split Again]
N2 --> L3[Split Again]
style N fill:#ffff99
style N2 fill:#ffff99```

left_filter, neutral_filter, right_filter = best.get_filters()
IF left_filter IS NULL AND right_filter IS NULL:
RETURN [CREATE_LEAF(path_filters, depth, train_df)]
# 6. Recurse on three branches
left_spores = train_pilz(target_filter, path_filters + [left_filter], depth + "l", settings)
neutral_spores = []
right_spores = []
IF neutral_filter:
neutral_spores = train_pilz(target_filter, path_filters + [neutral_filter], depth + "n", settings)
IF right_filter:
right_spores = train_pilz(target_filter, path_filters + [right_filter], depth + "r", settings)
RETURN left_spores + neutral_spores + right_spores
FUNCTION CATEGORIZE(feature, train_df, n_cat):
IF feature.statistical == "categorial":
RETURN CATEGORIZE_CATEGORICAL(feature, train_df, n_cat)
ELSE:
RETURN CATEGORIZE_NUMERICAL(feature, train_df, n_cat)
FUNCTION CATEGORIZE_NUMERICAL(feature, train_df, n_cat):
sorted = SORT_BY(feature.name)
cum_weights = CUMULATIVE_SUM(weights)
cuts = []
FOR i IN 1..n_cat:
quantile = i / n_cat
cut_point = FIND_QUANTILE(sorted, cum_weights, quantile)
cuts.append(cut_point)
RETURN CategorizedFeature(feature, cuts)
FUNCTION FIND_BEST_SPLIT(train_df, settings):
scored_features = []
# Score individual features
FOR feature IN train_df.features:
score = CALCULATE_DISCRIMINATION(feature, train_df)
scored_features.append((feature, score))
scored_features = SORT_BY(scored_features, DESC)
best = scored_features[0]
# Try combinations if n_dims > 1
FOR dim IN 2..settings.n_dims:
FOR combination IN COMBINATIONS(scored_features, dim):
combined = CREATE_COMBINED_FEATURE(combination)
# Build correlation table
table = BUILD_CORRELATION_TABLE(combined, train_df)
# Calculate discrimination score
score = CALCULATE_DISCRIMINATION(table)
IF score > best.score:
best = (combined, score)
RETURN best
Key Distinctions from Traditional Trees¶
| Aspect | Traditional Trees | Pilz |
|--------|------------------|------|
| Feature selection | Single feature per split | Feature combinations |
| Correlation handling | Multiple shallow splits | Single deep cut |
| Correlation table | N/A | Built at each node |
| Downsampling | Sometimes | Always, balanced |
| Neutral branch | No | Yes |
Model Structure¶
A Pilz model consists of spores (leaf nodes):
```mermaid classDiagram class Pilz { +list~Spore~ spores +str target +get_sql }
class Spore {
+list~str~ cut # SQL WHERE conditions
+float score # Target rate at this leaf
+str depth # Path
}
Pilz "1" --> "*" Spore```
Each spore represents a leaf with:
-
cut: List of conditions that lead here
-
score: Confidence (target rate)
-
depth: Path notation (l=left, n=neutral, r=right)
Summary¶
| Step | What Happens |
|------|--------------|
| 1. Bin features | Categorical: group by target rate; Numerical: quantile bins |
| 2. Build correlation table | For each feature/combination, count target vs non-target |
| 3. Determine branches | High rate → Right, Low rate → Left, Medium → Neutral |
| 4. Recurse | Repeat until min_eval_fit or max_depth |
| 5. Downsample | Always use balanced subset |
Next Steps¶
-
Feature Categorization - Deep dive into binning
-
Multi-Dimensional Splits - How n_dims works
-
Training Internals - Complete algorithm
```mermaid classDiagram class Pilz { +list~Spore~ spores +str target +get_sql }
class Spore {
+list~str~ cut # SQL WHERE conditions
+float score # Target rate at this leaf
+str depth # Path
}
Pilz "1" --> "*" Spore```