Skip to content

Training Internals

This chapter provides a deep dive into how Pilz training works under the hood. If you want to understand every detail of the algorithm, read on.

Architecture Overview

flowchart TB
    subgraph Input
        DC[DataCard]
        TS[TrainSettings]
    end

    subgraph Training_Service
        M[Main Loop] --> T[Train Tree]
        T --> C[Categorize]
        T --> CO[Counter]
        T --> R[Recurse]
    end

    subgraph Data_Service
        DW[Darkwing] --> DB[DuckDB]
        DW --> P[Polars]
    end

    DC --> TS
    TS --> M
    M --> DW
    DW --> DB
    DW --> P

    style M fill:#e0f0ff
    style C fill:#ccffcc
    style CO fill:#ffff99
    style R fill:#ffcccc```

![Diagram](images/training_internals_1.svg)

## Data Flow

```mermaid
sequenceDiagram
    participant CLI
    participant Train
    participant Darkwing
    participant DuckDB
    participant Pilz

    CLI->>Train: run()
    Train->>Train: for each target, for n trees```

![Diagram](images/training_internals_2.svg)

mermaid

sequenceDiagram

    participant CLI

    participant Train

    participant Darkwing

    participant DuckDB

    participant Pilz



    CLI->>Train: run()

    Train->>Train: for each target, for n trees



    Train->>Darkwing: read_akt_train(target_filter, path_filters)

    Darkwing->>DuckDB: SELECT WHERE target AND filters

    DuckDB-->>Darkwing: DataFrame

    Darkwing-->>Train: TrainDataframes



    Train->>Train: cater() - categorize features

    Train->>Train: counter() - find best split

    Train->>Train: recurse() - build subtree



    Train->>Pilz: create Pilz object

    Train-->>CLI: save JSON```

ee_index):

                CONTINUE



            # Create target filter (positive class)

            target_filter = Filter(target = target_class)



            # Train tree

            spores = train_pilz(

                target_filter = target_filter,

                path_filters = [],

                depth = ""

            )



            # Create and save Pilz

            pilz = Pilz(spores = spores, target = target_class)

            SAVE_PILZ(pilz, tree_index)

The train_pilz Function

This is the core recursive function:

FUNCTION train_pilz(target_filter, path_filters, depth):

    # Step 1: Read data for this node

    train_df = DARKWING.read_akt_train(

        target_filter = target_filter,

        akt_filters = path_filters

    )



    # Step 2: Check stopping criteria

    IF train_df.is_final_size() OR LENGTH(depth) >= settings.max_depth:

        RETURN [make_spore(path_filters, depth, train_df)]



    # Step 3: Categorize features

    CATER(train_df)



    # Step 4: Find best split

    left_filter, neutral_filter, right_filter = COUNTER(train_df)



    # Step 5: Handle no good split

    IF left_filter IS NULL AND right_filter IS NULL:

        RETURN [make_spore(path_filters, depth, train_df)]



    # Step 6: Recurse (three-way split)

    left_spores = []

    neutral_spores = []

    right_spores = []



    IF left_filter IS NOT NULL:

        left_spores = train_pilz(

            target_filter,

            path_filters + [left_filter],

            depth + "l"

        )



    IF neutral_filter IS NOT NULL:

        neutral_spores = train_pilz(

            target_filter,

            path_filters + [neutral_filter],

            depth + "n"

        )



    IF right_filter IS NOT NULL:

        right_spores = train_pilz(

            target_filter,

            path_filters + [right_filter],

            depth + "r"

        )



    RETURN left_spores + neutral_spores + right_spores

The Cater Function

Categorizes all features:

FUNCTION cater(train_df):

    train_df.features = []  # Reset



    FOR feature IN dc.features:

        # Skip target

        IF feature.name == target.name:

            CONTINUE



        # Categorize based on type

        categorized = FEAT_CATER(feature, train_df, settings.n_cat)



        # Only keep if useful

        IF categorized.is_diff_to_low():

            train_df.features.append(categorized)

The Counter Function

Finds the best feature or combination to split on:

FUNCTION counter(train_df):

    # Step 1: Score all features

    scored = []

    FOR feature IN train_df.features:

        diff = feature.calc_diff()

        scored.append((feature, diff))



    scored = SORT_BY(scored, diff, DESC)



    best = scored[0]

    best_diff = best.diff



    # Step 2: Try combinations if n_dims > 1

    IF settings.n_dims >= 2:

        FOR dim IN 2..settings.n_dims:

            counter = 0



            FOR comb IN COMBINATIONS(scored, dim):

                counter = counter + 1



                combined = CombinedCategorizedFeature(comb)

                combined_diff = combined.calc_diff()



                IF combined_diff > best_diff:

                    best = combined

                    best_diff = combined_diff



                # Early exit

                IF settings.calcs_per_dim AND counter >= settings.calcs_per_dim:

                    BREAK



    # Step 3: Return three-way filters

    RETURN best.get_left_right_filter()

Data Caching

Darkwing provides caching to avoid reloading data:

flowchart LR
    subgraph "First Call"
        F1[Request] --> L[Load from CSV]
        L --> C[Cache in memory]
    end

    subgraph "Subsequent Calls"
        S1[Request] --> H[Check Cache]
        H -->|"Hit"| R1[Return cached]
        H -->|"Miss"| L
    end

    style C fill:#ccffcc
    style R1 fill:#ccffcc```

![Diagram](images/training_internals_3.svg)

## Filter Generation

Filters are created using SymPy for boolean logic:

```mermaid
flowchart LR
    A[Start] --> B[Process]
    B --> C[End]

    style A fill:#e0f0ff
    style C fill:#ccffcc```

![Diagram](images/training_internals_4.svg)

mermaid

flowchart LR

    subgraph "First Call"

        F1[Request] --> L[Load from CSV]

        L --> C[Cache in memory]

    end



    subgraph "Subsequent Calls"

        S1[Request] --> H[Check Cache]

        H -->|"Hit"| R1[Return cached]

        H -->|"Miss"| L

    end



    style C fill:#ccffcc

    style R1 fill:#ccffcc```

## Score Calculation

The score for each leaf is calculated as:

FUNCTION make_spore(path_filters, depth, train_df):

target_count = train_df.target_df_count

non_target_count = train_df.non_target_df_count

total = target_count + non_target_count



# Target rate = proportion of
flowchart TB
    subgraph Python
        P["Feature: balance > 1000"]
    end

    subgraph SymPy
        S["sympy.GreaterThan(balance, 1000)"]
    end

    subgraph SQL
        Q["WHERE balance > 1000"]
    end

    P --> S
    S --> Q

    style P fill:#e0f0ff
    style S fill:#ffff99
    style Q fill:#ccffcc```

![Diagram](images/training_internals_5.svg)

yaml

# train_settings.yaml

max_workers: 4  # Number of threads

```mermaid flowchart LR subgraph Sequential S1[Tree 1] --> S2[Tree 2] --> S3[Tree 3] end

subgraph Parallel
    P1[Tree 1]
    P2[Tree 2]
    P3[Tree 3]

    P1 & P2 & P3 --> J[Join]
end

style P1 fill:#ccffcc
style P2 fill:#ccffcc
style P3 fill:#ccffcc```

Diagram

Summary

| Component | Description |

|-----------|-------------|

| run() | Main loop: for each target, for n trees |

| train_pilz() | Recursive tree building |

| cater() | Feature categorization |

| counter() | Find best split |

| make_spore() | Create leaf node |

| Darkwing | Data loading and caching |

Next Steps

```mermaid flowchart LR subgraph Sequential S1[Tree 1] --> S2[Tree 2] --> S3[Tree 3] end

subgraph Parallel
    P1[Tree 1]
    P2[Tree 2]
    P3[Tree 3]

    P1 & P2 & P3 --> J[Join]
end

style P1 fill:#ccffcc
style P2 fill:#ccffcc
style P3 fill:#ccffcc```

Diagram