Training Internals¶
This chapter provides a deep dive into how Pilz training works under the hood. If you want to understand every detail of the algorithm, read on.
Architecture Overview¶
flowchart TB
subgraph Input
DC[DataCard]
TS[TrainSettings]
end
subgraph Training_Service
M[Main Loop] --> T[Train Tree]
T --> C[Categorize]
T --> CO[Counter]
T --> R[Recurse]
end
subgraph Data_Service
DW[Darkwing] --> DB[DuckDB]
DW --> P[Polars]
end
DC --> TS
TS --> M
M --> DW
DW --> DB
DW --> P
style M fill:#e0f0ff
style C fill:#ccffcc
style CO fill:#ffff99
style R fill:#ffcccc```

## Data Flow
```mermaid
sequenceDiagram
participant CLI
participant Train
participant Darkwing
participant DuckDB
participant Pilz
CLI->>Train: run()
Train->>Train: for each target, for n trees```

mermaid
sequenceDiagram
participant CLI
participant Train
participant Darkwing
participant DuckDB
participant Pilz
CLI->>Train: run()
Train->>Train: for each target, for n trees
Train->>Darkwing: read_akt_train(target_filter, path_filters)
Darkwing->>DuckDB: SELECT WHERE target AND filters
DuckDB-->>Darkwing: DataFrame
Darkwing-->>Train: TrainDataframes
Train->>Train: cater() - categorize features
Train->>Train: counter() - find best split
Train->>Train: recurse() - build subtree
Train->>Pilz: create Pilz object
Train-->>CLI: save JSON```
ee_index):
CONTINUE
# Create target filter (positive class)
target_filter = Filter(target = target_class)
# Train tree
spores = train_pilz(
target_filter = target_filter,
path_filters = [],
depth = ""
)
# Create and save Pilz
pilz = Pilz(spores = spores, target = target_class)
SAVE_PILZ(pilz, tree_index)
The train_pilz Function¶
This is the core recursive function:
FUNCTION train_pilz(target_filter, path_filters, depth):
# Step 1: Read data for this node
train_df = DARKWING.read_akt_train(
target_filter = target_filter,
akt_filters = path_filters
)
# Step 2: Check stopping criteria
IF train_df.is_final_size() OR LENGTH(depth) >= settings.max_depth:
RETURN [make_spore(path_filters, depth, train_df)]
# Step 3: Categorize features
CATER(train_df)
# Step 4: Find best split
left_filter, neutral_filter, right_filter = COUNTER(train_df)
# Step 5: Handle no good split
IF left_filter IS NULL AND right_filter IS NULL:
RETURN [make_spore(path_filters, depth, train_df)]
# Step 6: Recurse (three-way split)
left_spores = []
neutral_spores = []
right_spores = []
IF left_filter IS NOT NULL:
left_spores = train_pilz(
target_filter,
path_filters + [left_filter],
depth + "l"
)
IF neutral_filter IS NOT NULL:
neutral_spores = train_pilz(
target_filter,
path_filters + [neutral_filter],
depth + "n"
)
IF right_filter IS NOT NULL:
right_spores = train_pilz(
target_filter,
path_filters + [right_filter],
depth + "r"
)
RETURN left_spores + neutral_spores + right_spores
The Cater Function¶
Categorizes all features:
FUNCTION cater(train_df):
train_df.features = [] # Reset
FOR feature IN dc.features:
# Skip target
IF feature.name == target.name:
CONTINUE
# Categorize based on type
categorized = FEAT_CATER(feature, train_df, settings.n_cat)
# Only keep if useful
IF categorized.is_diff_to_low():
train_df.features.append(categorized)
The Counter Function¶
Finds the best feature or combination to split on:
FUNCTION counter(train_df):
# Step 1: Score all features
scored = []
FOR feature IN train_df.features:
diff = feature.calc_diff()
scored.append((feature, diff))
scored = SORT_BY(scored, diff, DESC)
best = scored[0]
best_diff = best.diff
# Step 2: Try combinations if n_dims > 1
IF settings.n_dims >= 2:
FOR dim IN 2..settings.n_dims:
counter = 0
FOR comb IN COMBINATIONS(scored, dim):
counter = counter + 1
combined = CombinedCategorizedFeature(comb)
combined_diff = combined.calc_diff()
IF combined_diff > best_diff:
best = combined
best_diff = combined_diff
# Early exit
IF settings.calcs_per_dim AND counter >= settings.calcs_per_dim:
BREAK
# Step 3: Return three-way filters
RETURN best.get_left_right_filter()
Data Caching¶
Darkwing provides caching to avoid reloading data:
flowchart LR
subgraph "First Call"
F1[Request] --> L[Load from CSV]
L --> C[Cache in memory]
end
subgraph "Subsequent Calls"
S1[Request] --> H[Check Cache]
H -->|"Hit"| R1[Return cached]
H -->|"Miss"| L
end
style C fill:#ccffcc
style R1 fill:#ccffcc```

## Filter Generation
Filters are created using SymPy for boolean logic:
```mermaid
flowchart LR
A[Start] --> B[Process]
B --> C[End]
style A fill:#e0f0ff
style C fill:#ccffcc```

mermaid
flowchart LR
subgraph "First Call"
F1[Request] --> L[Load from CSV]
L --> C[Cache in memory]
end
subgraph "Subsequent Calls"
S1[Request] --> H[Check Cache]
H -->|"Hit"| R1[Return cached]
H -->|"Miss"| L
end
style C fill:#ccffcc
style R1 fill:#ccffcc```
## Score Calculation
The score for each leaf is calculated as:
FUNCTION make_spore(path_filters, depth, train_df):
target_count = train_df.target_df_count
non_target_count = train_df.non_target_df_count
total = target_count + non_target_count
# Target rate = proportion of
flowchart TB
subgraph Python
P["Feature: balance > 1000"]
end
subgraph SymPy
S["sympy.GreaterThan(balance, 1000)"]
end
subgraph SQL
Q["WHERE balance > 1000"]
end
P --> S
S --> Q
style P fill:#e0f0ff
style S fill:#ffff99
style Q fill:#ccffcc```

yaml
# train_settings.yaml
max_workers: 4 # Number of threads
```mermaid flowchart LR subgraph Sequential S1[Tree 1] --> S2[Tree 2] --> S3[Tree 3] end
subgraph Parallel
P1[Tree 1]
P2[Tree 2]
P3[Tree 3]
P1 & P2 & P3 --> J[Join]
end
style P1 fill:#ccffcc
style P2 fill:#ccffcc
style P3 fill:#ccffcc```
Summary¶
| Component | Description |
|-----------|-------------|
| run() | Main loop: for each target, for n trees |
| train_pilz() | Recursive tree building |
| cater() | Feature categorization |
| counter() | Find best split |
| make_spore() | Create leaf node |
| Darkwing | Data loading and caching |
Next Steps¶
-
SQL Rules - Deploy models to production
-
Settings Reference - All parameters
-
Troubleshooting - Common issues
```mermaid flowchart LR subgraph Sequential S1[Tree 1] --> S2[Tree 2] --> S3[Tree 3] end
subgraph Parallel
P1[Tree 1]
P2[Tree 2]
P3[Tree 3]
P1 & P2 & P3 --> J[Join]
end
style P1 fill:#ccffcc
style P2 fill:#ccffcc
style P3 fill:#ccffcc```