Skip to content

First Project: Iris Classification

In this tutorial, you'll train your first Pilz model to classify Iris flowers into three species: Setosa, Versicolor, and Virginica.

What You'll Learn

  • How to prepare a dataset for Pilz

  • What each DataCard field means

  • How to interpret training results

  • How to read ROC curves

Prerequisites

Dataset Overview

The Iris dataset is a classic machine learning dataset with:

  • 150 samples (50 per class)

  • 4 features: sepal length, sepal width, petal length, petal width

  • 3 classes: Iris-setosa, Iris-versicolor, Iris-virginica

flowchart TB
    subgraph Features
        F1[sepal_length]
        F2[sepal_width]
        F3[petal_length]
        F4[petal_width]
    end

    subgraph Classes
        C1[Setosa]
        C2[Versicolor]
        C3[Virginica]
    end

    F1 --> C1
    F2 --> C2
    F3 --> C3
    F4 --> C1```

![Diagram](images/first_project_1.svg)

## Step 1: Prepare the Data

First, ensure your Iris CSV has proper headers:

```bash

# Check current format

head -3 iris.csv

# If needed, fix the format:

echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_fixed.csv

tail -n +2 iris.csv >> iris_fixed.csv

mv iris_fixed.csv iris.csv

The data should look like:

sepal_length,sepal_width,petal_length,petal_width,species

5.1,3.5,1.4,0.2,Iris-setosa

4.9,3.0,1.4,0.2,Iris-setosa

...

Step 2: Create the DataCard

The DataCard tells Pilz about your dataset structure:

pilz create-dc --src iris.csv --out iris_dc.yaml

Let's examine the generated file:

# iris_dc.yaml

features:

  - name: sepal_length

    statistical: numerical    # Continuous values

    type: float              # Decimal numbers

  - name: sepal_width

    statistical: numerical

    type: float

  - name: petal_length

    statistical: numerical

    type: float

  - name: petal_width

    statistical: numerical

    type: float

target:

  feature_name: species      # Column to predict

  values:                    # All possible classes

    - Iris-setosa

    - Iris-versicolor

    - Iris-virginica

train_files:

  - iris.csv                 # Training data

test_files:

  - iris.csv                 # Using same file for demo

Understanding the Fields

| Field | Description | Example |

|-------|-------------|---------|

| name | Column header in CSV | sepal_length |

| statistical | Data type | numerical or categorial |

| type | Python type | float, int, string |

| target.feature_name | Column to predict | species |

| target.values | Possible outcomes | [Setosa, Versicolor, Virginica] |

!> Important: Use categorial (not categorical) for categorical features.

Step 3: Configure Training

Create train_settings.yaml:

# How many trees to build per class

n: 1

# Where to save trained models

out_folder: iris_model

# Maximum tree depth (prevents overfitting)

max_depth: 10

# Feature combinations to try at each split

n_dims: 2

# Categories per feature (bins)

n_cat: 3

Parameters Explained

| Parameter | What it does | Recommended for Iris |

|-----------|--------------|---------------------|

| n | Trees per class | 1 (simple dataset) |

| max_depth | Max splits per tree | 10 (enough for 4 features) |

| n_dims | Features per split | 2 (pairs work well) |

| n_cat | Bins per feature | 3 (few unique values) |

Step 4: Train the Model

pilz train --datacard iris_dc.yaml --trainsettings train_settings.yaml

Expected output:

INFO: Training for target Iris-setosa, tree 0

INFO: Training for target Iris-versicolor, tree 0  

INFO: Training for target Iris-virginica, tree 0

INFO: Models saved to iris_model/

What Happened?

flowchart LR
    A[Start] --> B[Process]
    B --> C[End]

    style A fill:#e0f0ff
    style C fill:#ccffcc```

![Diagram](images/first_project_2.svg)

mermaid

flowchart LR

    A[iris.csv] --> B["Filter: class=Setosa"]

    A --> C["Filter: class=Versicolor"]

    A --> D["Filter: class=Virginica"]



    B --> E[Tree 0]

    C --> F[Tree 0]

    D --> G[Tree 0]



    E --> H[iris_model/Setosa/0.json]

    F --> I[iris_model/Versicolor/0.json]

    G --> J[iris_model/Virginica/0.json]```

or/0.json

iris_model/Iris-virginica/0.json

Let's examine one model:

cat iris_model/Iris-setosa/0.json
{

  "spores": ["

    {

      \"cut\": [\"petal_width <= 0.8\""],

      "score": 1.0,

      "depth": "l"

    },

    {

      "cut": ["petal_width > 1.75"],

      "score": 1.0,

      "depth": "rr"

    }

  ],

  "target": "Iris-setosa"

}

Reading the Model

Each "spore" is a leaf node in the tree:

  • cut: SQL WHERE conditions

  • score: Prediction confidence (0-1)

  • depth: Tree path (l=left, r=right, n=neutral)

This model says:

"If petal_width ≤ 0.8, predict Setosa (100% confidence)"

"If petal_width > 1.75, NOT Setosa (100% confidence)"

Step 6: Evaluate the Model

Create eval_settings.yaml:

in_folders:

  - iris_model

out_folder: iris_eval

out_file: iris_predictions.csv

Run evaluation:

pilz eval --datacard iris_dc.yaml --evalsettings eval_settings.yaml

Output Files

mermaid flowchart TB subgraph "iris_eval/" A[Iris-setosa_roc.html] B[Iris-versicolor_roc.html] C[Iris-virginica_roc.html] D[all_roc.html] E[multi_class_result.html] F[iris_predictions.csv] end

Diagram

Step 7: Understanding ROC Curves

Open iris_eval/Iris-setosa_roc.html in your browser. You should see:

```mermaid flowchart LR A[Start] --> B[Process] B --> C[End]

style A fill:#e0f0ff
style C fill:#ccffcc```

Diagram

mermaid

flowchart TB

subgraph "iris_eval/"

    A[Iris-setosa_roc.html]

    B[Iris-versicolor_roc.html]

    C[Iris-virginica_roc.html]

    D[all_roc.html]

    E[multi_class_result.html]

    F[iris_predictions.csv]

end```

end



F1 --> T1

F2 --> T2

F3 --> T3

F4 --> T4

F5 --> T5

F6 --> T6```

```csv

species,Iris-setosa,Iris-setosa_0,Iris-versicolor,Iris-versicolor_0,Iris-virginica,Iris-virginica_0,correct

Iris-setosa,1.0,1.0,0.0,0.0,0.0,0.0,1

Iris-setosa,1.0,1.0,0.0,0.0,0.0,0.0,1

Iris-versicolor,0

```mermaid flowchart LR subgraph FPR F1[0.0] F2[0.2] F3[0.4] F4[0.6] F5[0.8] F6[1.0] end

subgraph TPR
    T1[0.4]
    T2[0.7]
    T3[0.9]
    T4[0.95]
    T5[0.98]
    T6[1.0]
end

F1 --> T1
F2 --> T2
F3 --> T3
F4 --> T4
F5 --> T5
F6 --> T6```

Diagram

](real-world-example.md) - Try a more complex dataset - How Pilz Works - Understand the algorithm - Feature Categorization** - Learn how features are transformed


Questions? Check the FAQ or Troubleshooting.