Benchmarking different CF explanation methods
In this notebook, we show runtimes of different model-agnostic explanation methods. Currently, we support three model-agnostic explanation methods: 1. Random-Sampling 2. Genetic Algorithm 3. Querying a KD tree
[1]:
import numpy as np
import timeit
import random
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
import dice_ml
from dice_ml.utils import helpers # helper functions
from dice_ml import Dice
[2]:
%load_ext autoreload
%autoreload 2
Loading dataset
We use the “adult” income dataset from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/adult). For demonstration purposes, we transform the data as described in dice_ml.utils.helpers module.
[3]:
dataset = helpers.load_adult_income_dataset()
[4]:
dataset.head()
[4]:
age | workclass | education | marital_status | occupation | race | gender | hours_per_week | income | |
---|---|---|---|---|---|---|---|---|---|
0 | 28 | Private | Bachelors | Single | White-Collar | White | Female | 60 | 0 |
1 | 30 | Self-Employed | Assoc | Married | Professional | White | Male | 65 | 1 |
2 | 32 | Private | Some-college | Married | White-Collar | White | Male | 50 | 0 |
3 | 20 | Private | Some-college | Single | Service | White | Female | 35 | 0 |
4 | 41 | Self-Employed | Some-college | Married | White-Collar | White | Male | 50 | 0 |
[5]:
d = dice_ml.Data(dataframe=dataset,
continuous_features=['age', 'hours_per_week'], outcome_name='income')
Training the ML model
Currently, the genetic algorithm & KD tree methods work with scikit-learn models. Support for Tensorflow 1&2 and Pytorch will be implemented soon.
[6]:
target = dataset["income"]
# Split data into train and test
datasetX = dataset.drop("income", axis=1)
x_train, x_test, y_train, y_test = train_test_split(datasetX,
target,
test_size=0.2,
random_state=0,
stratify=target)
numerical = ["age", "hours_per_week"]
categorical = x_train.columns.difference(numerical)
# We create the preprocessing pipelines for both numeric and categorical data.
numeric_transformer = Pipeline(
steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(
steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
transformations = ColumnTransformer(
transformers=[
('num', numeric_transformer, numerical),
('cat', categorical_transformer, categorical)])
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', transformations),
('classifier', RandomForestClassifier())])
model = clf.fit(x_train, y_train)
[7]:
m = dice_ml.Model(model=model, backend="sklearn")
Initialize counterfactual generation methods
We now initialize all three counterfactuals generation methods
[8]:
exp_random = Dice(d, m, method="random")
[9]:
exp_genetic = Dice(d, m, method="genetic")
[10]:
exp_KD = Dice(d, m, method="kdtree")
[11]:
query_instances = x_train[4:7]
[12]:
query_instances
[12]:
age | workclass | education | marital_status | occupation | race | gender | hours_per_week | |
---|---|---|---|---|---|---|---|---|
9608 | 27 | Private | School | Single | Blue-Collar | White | Male | 40 |
22027 | 31 | Self-Employed | Some-college | Married | Sales | Other | Male | 60 |
14296 | 26 | Private | HS-grad | Married | Blue-Collar | White | Male | 50 |
Generate Counterfactuals
We now generate counterfactuals of desired_class=0 using all three different methods and check the runtime. You can modify the number of loops (num_loops
), and the number of diverse counterfactuals to generate (k
).
[13]:
num_loops = 2
k = 2
[14]:
elapsed_random = 0
elapsed_kd = 0
elapsed_genetic = 0
for _ in range(num_loops):
for q in query_instances:
if q in d.categorical_feature_names:
query_instances.loc[:, q] = \
[np.unique(random.choice(dataset[q].values)) for _ in query_instances.index]
else:
query_instances.loc[:, q] = \
[np.random.uniform(dataset[q].min(), dataset[q].max()) for _ in query_instances.index]
start_time = timeit.default_timer()
dice_exp_random = exp_random.generate_counterfactuals(query_instances, total_CFs=k,
desired_class=0, verbose=False)
elapsed_random += timeit.default_timer() - start_time
start_time = timeit.default_timer()
dice_exp = exp_genetic.generate_counterfactuals(query_instances, total_CFs=k, desired_class=0,
yloss_type="hinge_loss", verbose=False)
elapsed_genetic += timeit.default_timer() - start_time
start_time = timeit.default_timer()
dice_kd = exp_KD.generate_counterfactuals(query_instances, total_CFs=k, desired_class=0,
verbose=False)
elapsed_kd += timeit.default_timer() - start_time
m_random, s_random = divmod(elapsed_random, 60)
print('For Independent random sampling of features: Total time taken to generate %d' % num_loops,
'sets of %d' % k, 'counterfactuals each: %02d' % m_random, 'min %02d' % s_random, 'sec')
m_kd, s_kd = divmod(elapsed_kd, 60)
print('For querying from a KD tree: Total time taken to generate %d' % num_loops,
'sets of %d' % k, 'counterfactuals each: %02d' % m_kd, 'min %02d' % s_kd, 'sec')
m_genetic, s_genetic = divmod(elapsed_genetic, 60)
print('For genetic algorithm: Total time taken to generate %d' % num_loops,
'sets of %d' % k, 'counterfactuals each: %02d' % m_genetic, 'min %02d' % s_genetic, 'sec')
/tmp/ipykernel_169/3514951616.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
query_instances.loc[:, q] = \
/tmp/ipykernel_169/3514951616.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
query_instances.loc[:, q] = \
100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 2.58it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.13s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.05s/it]
/tmp/ipykernel_169/3514951616.py:11: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
query_instances.loc[:, q] = \
/tmp/ipykernel_169/3514951616.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
query_instances.loc[:, q] = \
100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 2.55it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:03<00:00, 1.06s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00, 1.11it/s]
For Independent random sampling of features: Total time taken to generate 2 sets of 2 counterfactuals each: 00 min 02 sec
For querying from a KD tree: Total time taken to generate 2 sets of 2 counterfactuals each: 00 min 05 sec
For genetic algorithm: Total time taken to generate 2 sets of 2 counterfactuals each: 00 min 06 sec