Mozilla is planning to invest more substantially in privacy-preserving machine learning models for applications such as recommending personalized content and detecting malicious behaviour. As such solutions move towards production, it is essential for us to have confidence in the selection of the model and its parameters for a particular dataset, as well as an accurate view into how it will perform in new instances or as the training data evolves.
While the literature contains a broad array of models, evaluation techniques, and metrics, their choice in practice is often guided by convention or convenience, and their sensitivity to different datasets is not always well-understood. Additionally, to our knowledge there is no existing software tool that provides a comprehensive report on the performance of a given model under consideration.
The eventual goal of this project is to build a standard set of tools that Mozilla can use to evaluate the performance of machine learning models in various contexts on the basis of the following principles:
Importing libraries
# Basic Computations
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format='retina'
# Ml Models
from sklearn import metrics
from sklearn.model_selection import train_test_split
# Dynamic Markdowns
from IPython.display import Markdown as md
#**Adding module’s sub-directory to Pythons path**
import os
import sys
sys.path.insert(0, os.path.abspath('../elie_wanko/modules'))
import helpers, knn, logreg
Attribute Information:
This research employes a binary variable, C24: default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:
C6 - C11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows:
The measurement scale for the repayment status is:
- 1 = pay duly;
- 1 = payment delay for one month;
- 2 = payment delay for two months;
- . . .;
- 8 = payment delay for eight months;
- 9 = payment delay for nine months and above.
Data Preview
There are 30,000 rows, all Non-Null and 25 columns in our DataFrame, all of which are of numerical data(int64). The sample table below shows the first 5 columns of our data set and the next shows some basic information of every column ('column Index', 'column_name', 'Non_Null Counts in each column' and 'data type').
df_data = pd.read_csv("../../datasets/defaults.csv")
df_data.head()
df_data.describe()
df_data.info()
Data Cleaning
# Drop column 'id'
df_data = df_data.drop(columns='id')
# Rename C6 from pay_0 to pay_1 for consistency
df_data = df_data.rename(columns={"pay_0": "pay_1"})
df_data.to_csv("defaults_data.csv", index=False)
Observations:
The main feature of data set is "C24: defaulted", this tells us whether a customer defaulted or not. Our variables C1 to C23 will help us support our investigation into the feature of interest above. Some basic statistical details and questions we can pose include:
Next, we further investigate using various distributions to discover more insights and find out if our assumptions are true.
df_data.describe()
In this section, we investigate distributions of individual variables. We note any unusual points or outliers, clean things up and prepare to look at relationships between variables.
helpers.univ_bar(data=df_data, column='sex', x_title='sex', hue='defaulted',
var_names=['Male', 'Female'], title="Distribution of men and women that defaulted.")
(1 = male; 2 = female)
We can clearly see that women are more at risk to default than men. Despite this fact, the ratio of defaulters in each category is about 7.8:2.2. We will need to investigate further to find the reasons behind this.
helpers.univ_bar(data=df_data, column='marriage', x_title='marriage', hue='defaulted', fig_w=15,
title="Distribution of clients that defaulted based on their marital stauts.")
(1 = married; 2 = single; 3 = others)
From the plot, clients that are single e are more at risk of defaulting. Though this is closely followed by married clients, we could assume that this difference is because they support each other financially on the long run.
helpers.univ_bar(data=df_data, column='education', x_title='education', hue='defaulted', fig_w=15,
title="Distribution of clients that defaulted based on their level of education.")
(1 = graduate school; 2 = university; 3 = high school; 4 = others)
Our graph shows that clients in university, closely followed by thos is graduate school are more likely to default. This is cloud be caused accumalated loan payments to pay school fees.
helpers.univ_hist(data=df_data, column='age', fig_w=20, bins=20, title="Distribution of clients age.")
With a rigt skewed distribution, we can see that most clients in their late 20s and early 30s are more at risk to default, which makes sense as this was similarly observed in the previous graph depicting the default rate per level of education.
helpers.univ_hist(data=df_data, column='limit_bal', fig_w=20, bins=20, title="Distribution of clients account balance.")
Here the graph depics that most of our clients are defaulters. However, we are limited quite limited to whom exactly they are and why they defaulted. To further investigated to the reasons behind this we will explore a few bivarte distribution.
In this section, we investigate relationships between pairs of major variables ("limit_bal", "sex", "education", "marriage", "age") and the financial status ("pay_1", "bill_amt1", "pay_amt1") of clients accounts in the month of September in our data.
sns.pairplot(df_data, kind='reg', hue="defaulted", vars=['limit_bal', 'sex', 'education', 'marriage', 'age'], plot_kws={'scatter_kws': {'alpha': 0.5}})
sns.pairplot(df_data, kind='reg', hue="defaulted", vars=['pay_1', 'pay_2', 'pay_3', 'pay_4', 'pay_5', 'pay_6',
'bill_amt1', 'bill_amt2', 'bill_amt3', 'bill_amt4', 'bill_amt5', 'bill_amt6',
'pay_amt1', 'pay_amt2', 'pay_amt3', 'pay_amt4', 'pay_amt5', 'pay_amt6'],
plot_kws={'scatter_kws': {'alpha': 0.5}})
We can observe similar trend across all the following pairwise features History of past payment
pay
, Amount of bill statementbill_amt
and Amount of previous statementpay_mat
. This is indication that we could easily implement a Principal component analysis (PCA) as a dimensionality-reduction technique and discover the variance-covariance structure of a set of variables through linear combinations. A PCA will help us improve algorithm Performance, reduce overfitting and in visualization and understanding the data in high dimensions. However, caution needs to be taken as in reducing the dimentionality too much might result in independent variables becomeingless interpretable and thus information loss. Moreover, we must standardize our data before implementing PCA, otherwise PCA will not be able to find the optimal Principal Components.
We need to prepare our the data in the correct input format for the sklearn models considered. As such, we split our data into independent and dependent(terget) attributes and further into training and test subsets. The subsets
tuble below will consists of four variables in the following order:
.fit
).predict
).fit
)x_test
# Split independent and target features
independ_attrs, target_attrs = helpers.independ_target_attr_split(df_data)
# Split train and test data subsets
subsets = train_test_split(independ_attrs, target_attrs, test_size=0.3, random_state=1)
Now we investigate two of the most popular methods used in machine learning that can be distinguish by their methology for analysis(lazy and eager learning).
K-Nearest Neighbor method(KNNs) are the most popular lazy algorithm. The idea is to build a classification model once the test instance is received and this model will only learn a selection of training patterns, the most relevant for the test instance. When the query instance is received, a set of similar related patterns is retrieved from the available training patterns set and it is used to classify the new instance. To select these similar patterns, a distance measure is used having nearby points higher relevance. Lazy methods generally work by selecting the k nearest input patterns to the query points, in terms of the Euclidean distance. Afterwards, the classification or prediction of the new instances is based on the selected patterns.
Logistic Regression: Most machine learning algorithms are eager methods in the sense that a model is generated with the complete training data set and, afterwards, this model is used to generalize the new test instances. Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. Although many more complex extensions exist, it is one of the most situable approach to binary classification in financial services.
After a series of trial and errors, the following choice of hyper parameters seem to give the best possible results.
x_test
. However, this doesn't provide us situable data to evaluate our metrics, so we include a threshold
parameter of defaulting in our model. The idea here is simple, any client with a probablity of defaulting above the threshold will default and vice versa. On our pericular data set the best results where obtianed at a probablity of 0.35.# Classifier
knn_pred, knn_true = knn.classfier(
subsets=subsets,
n_neighbors = 7,
threshold = 0.35)
# Metrics
print("accuracy_score : {:.4f}".format(metrics.accuracy_score(knn_true, knn_pred)))
print("precision_score : {:.4f}".format(metrics.precision_score(knn_true, knn_pred)))
print("recall_score : {:.4f}".format(metrics.recall_score(knn_true, knn_pred)))
print("f1_score : {:.4f}".format(metrics.f1_score(knn_true, knn_pred)))
helpers.confusion_matrix(true=knn_true, pred=knn_pred)
tol
: The tolerance value of 0.000014 seem to produce the best results before our training ovefits.# Classifier
lg_pred, lg_true = logreg.classfier(
tol=0.000014,
subsets=subsets,
data=df_data,
solver='liblinear')
# Metrics
print("accuracy_score : {:.4f}".format(metrics.accuracy_score(lg_true, lg_pred)))
print("precision_score : {:.4f}".format(metrics.precision_score(lg_true, lg_pred)))
print("recall_score : {:.4f}".format(metrics.recall_score(lg_true, lg_pred)))
print("f1_score : {:.4f}".format(metrics.f1_score(lg_true, lg_pred)))
helpers.confusion_matrix(true=lg_true, pred=lg_pred)
After this preliminary analysis, we are going to investigate furhter with the logistic Regression(eager learning) model as this gives not only the best results and can explain better which features weighs more on the balance in our predictions. The choice of hyper parameters were based as follows:
To further improve our model, we could also consider using the Weight of Evidence(WoE) method which can help us have more control over the features we use during training. WoE consists of two steps:
The WoE transformation has the following advanatges:
Nonetheless, caution needs to be taken when implementing WoE since this methos also has some drawbacks: