First Project: Iris Classification¶
In this tutorial, you'll train your first Pilz model to classify Iris flowers into three species: Setosa, Versicolor, and Virginica.
What You'll Learn¶
-
How to prepare a dataset for Pilz
-
What each DataCard field means
-
How to interpret training results
-
How to read ROC curves
Prerequisites¶
-
Pilz installed (see Installation)
-
Iris dataset downloaded
Dataset Overview¶
The Iris dataset is a classic machine learning dataset with:
-
150 samples (50 per class)
-
4 features: sepal length, sepal width, petal length, petal width
-
3 classes: Iris-setosa, Iris-versicolor, Iris-virginica
flowchart TB
subgraph Features
F1[sepal_length]
F2[sepal_width]
F3[petal_length]
F4[petal_width]
end
subgraph Classes
C1[Setosa]
C2[Versicolor]
C3[Virginica]
end
F1 --> C1
F2 --> C2
F3 --> C3
F4 --> C1```

## Step 1: Prepare the Data
First, ensure your Iris CSV has proper headers:
```bash
# Check current format
head -3 iris.csv
# If needed, fix the format:
echo "sepal_length,sepal_width,petal_length,petal_width,species" > iris_fixed.csv
tail -n +2 iris.csv >> iris_fixed.csv
mv iris_fixed.csv iris.csv
The data should look like:
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
...
Step 2: Create the DataCard¶
The DataCard tells Pilz about your dataset structure:
Let's examine the generated file:
# iris_dc.yaml
features:
- name: sepal_length
statistical: numerical # Continuous values
type: float # Decimal numbers
- name: sepal_width
statistical: numerical
type: float
- name: petal_length
statistical: numerical
type: float
- name: petal_width
statistical: numerical
type: float
target:
feature_name: species # Column to predict
values: # All possible classes
- Iris-setosa
- Iris-versicolor
- Iris-virginica
train_files:
- iris.csv # Training data
test_files:
- iris.csv # Using same file for demo
Understanding the Fields¶
| Field | Description | Example |
|-------|-------------|---------|
| name | Column header in CSV | sepal_length |
| statistical | Data type | numerical or categorial |
| type | Python type | float, int, string |
| target.feature_name | Column to predict | species |
| target.values | Possible outcomes | [Setosa, Versicolor, Virginica] |
!> Important: Use categorial (not categorical) for categorical features.
Step 3: Configure Training¶
Create train_settings.yaml:
# How many trees to build per class
n: 1
# Where to save trained models
out_folder: iris_model
# Maximum tree depth (prevents overfitting)
max_depth: 10
# Feature combinations to try at each split
n_dims: 2
# Categories per feature (bins)
n_cat: 3
Parameters Explained¶
| Parameter | What it does | Recommended for Iris |
|-----------|--------------|---------------------|
| n | Trees per class | 1 (simple dataset) |
| max_depth | Max splits per tree | 10 (enough for 4 features) |
| n_dims | Features per split | 2 (pairs work well) |
| n_cat | Bins per feature | 3 (few unique values) |
Step 4: Train the Model¶
Expected output:
INFO: Training for target Iris-setosa, tree 0
INFO: Training for target Iris-versicolor, tree 0
INFO: Training for target Iris-virginica, tree 0
INFO: Models saved to iris_model/
What Happened?¶
flowchart LR
A[Start] --> B[Process]
B --> C[End]
style A fill:#e0f0ff
style C fill:#ccffcc```

mermaid
flowchart LR
A[iris.csv] --> B["Filter: class=Setosa"]
A --> C["Filter: class=Versicolor"]
A --> D["Filter: class=Virginica"]
B --> E[Tree 0]
C --> F[Tree 0]
D --> G[Tree 0]
E --> H[iris_model/Setosa/0.json]
F --> I[iris_model/Versicolor/0.json]
G --> J[iris_model/Virginica/0.json]```
or/0.json
iris_model/Iris-virginica/0.json
Let's examine one model:
{
"spores": ["
{
\"cut\": [\"petal_width <= 0.8\""],
"score": 1.0,
"depth": "l"
},
{
"cut": ["petal_width > 1.75"],
"score": 1.0,
"depth": "rr"
}
],
"target": "Iris-setosa"
}
Reading the Model¶
Each "spore" is a leaf node in the tree:
-
cut: SQL WHERE conditions -
score: Prediction confidence (0-1) -
depth: Tree path (l=left,r=right,n=neutral)
This model says:
"If petal_width ≤ 0.8, predict Setosa (100% confidence)"
"If petal_width > 1.75, NOT Setosa (100% confidence)"
Step 6: Evaluate the Model¶
Create eval_settings.yaml:
Run evaluation:
Output Files¶
mermaid
flowchart TB
subgraph "iris_eval/"
A[Iris-setosa_roc.html]
B[Iris-versicolor_roc.html]
C[Iris-virginica_roc.html]
D[all_roc.html]
E[multi_class_result.html]
F[iris_predictions.csv]
end
Step 7: Understanding ROC Curves¶
Open iris_eval/Iris-setosa_roc.html in your browser. You should see:
```mermaid flowchart LR A[Start] --> B[Process] B --> C[End]
style A fill:#e0f0ff
style C fill:#ccffcc```
mermaid
flowchart TB
subgraph "iris_eval/"
A[Iris-setosa_roc.html]
B[Iris-versicolor_roc.html]
C[Iris-virginica_roc.html]
D[all_roc.html]
E[multi_class_result.html]
F[iris_predictions.csv]
end```
end
F1 --> T1
F2 --> T2
F3 --> T3
F4 --> T4
F5 --> T5
F6 --> T6```
```csv
species,Iris-setosa,Iris-setosa_0,Iris-versicolor,Iris-versicolor_0,Iris-virginica,Iris-virginica_0,correct
Iris-setosa,1.0,1.0,0.0,0.0,0.0,0.0,1
Iris-setosa,1.0,1.0,0.0,0.0,0.0,0.0,1
Iris-versicolor,0
```mermaid flowchart LR subgraph FPR F1[0.0] F2[0.2] F3[0.4] F4[0.6] F5[0.8] F6[1.0] end
subgraph TPR
T1[0.4]
T2[0.7]
T3[0.9]
T4[0.95]
T5[0.98]
T6[1.0]
end
F1 --> T1
F2 --> T2
F3 --> T3
F4 --> T4
F5 --> T5
F6 --> T6```
](real-world-example.md) - Try a more complex dataset - How Pilz Works - Understand the algorithm - Feature Categorization** - Learn how features are transformed
Questions? Check the FAQ or Troubleshooting.