- Know how to apply Random Forests to classification tasks
- Understand the performance metrics associated with binary classification
- Gain an introduction to fast.ai's data preprocessing functions
This lesson is inspired by the following textbooks and online courses:
- Chapter 3 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron
- Chapter 7 of Data Science for Business by Provost and Fawcett
- Lessons 1 - 4 of Jeremy Howard's fantastic online course Introduction to Machine Learning for Coders
You may also find the following blog post useful:
Figure reference: https://s16353.pcdn.co/wp-content/uploads/2018/06/Churn.png
We will explore IBM's telecommunications dataset and determine which attributes are most informative for predicting customer retention (also known as customer churn). As described by IBM, the problem setting is as follows:
A telecommunications company is concerned about the number of customers leaving their landline business for cable competitors. They need to understand who is leaving. Imagine that you’re an analyst at this company and you have to find out who is leaving and why.
The kind of questions we'd like to find answers to are: Which customers are likely to leave? Which attributes influence customers who leave?
As noted above, in this lesson we will analyse IBM's customer churn dataset:
churn.csv
The dataset includes information about:
- Customers who left within the last month – the column is called
Churn
- Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
- Customer account information – how long they’ve been a customer (tenure), contract, payment method, paperless billing, monthly charges, and total charges
- Demographic info about customers – gender, whether they're a senior citizen or not, and if they have partners and dependents
%load_ext autoreload
# reload all modules every time before executing Python code
%autoreload 2
# render plots in notebook
%matplotlib inline
# !pip install dslectures --upgrade
import pandas as pd
import numpy as np
from dslectures.core import (
get_dataset,
display_large,
convert_strings_to_categories,
rf_feature_importance,
plot_feature_importance,
plot_dendogram,
)
from dslectures.structured import proc_df
from pathlib import Path
# data viz
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve
from sklearn.tree import plot_tree
sns.set(color_codes=True)
sns.set_palette(sns.color_palette("muted"))
# ml magic
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
import scipy
from scipy.cluster import hierarchy as hc
get_dataset("churn.csv")
We also make use of the pathlib
library to handle our filepaths:
DATA = Path('../data/')
!ls {DATA}
churn_data = pd.read_csv(DATA / "churn.csv")
Sometimes you will find that the dataset has too many columns to be displayed with the standard DataFrame.head()
method and just shows ...
for intermediate columns:
churn_data.head()
To fix that we can configure the options in pandas which we can wrap inside a simple function:
display_large(churn_data.head())
Alternatively, you can take the transpose to see all the columns more easily:
churn_data.head().T
len(churn_data)
churn_data.shape
In this case, we see that we have 7043 customers and 21 variables or attributes that describe their telecom subscription. Let's have a look at the columns:
churn_data.columns
Churn
and thus we have a classification problem (rather than regression) because the target is a category (Yes or No) rather than a coninuous number.Whenever we see an ID column like Id
, it is useful to perform a sanity check that each value is unique. Otherwise it may be possible that you have duplicates in your data that can bias your models and hence conclusions.
churn_data["customerID"].nunique()
Good! The number of unique IDs matches the number of rows in our DataFrame.
churn_data.dtypes
Hmm, TotalCharges
is of type object (i.e. string) even though it is clearly a float. Since null values or NaNs don't produce this behaviour, there are presumably empty strings lurking in this column. Let's test this hypothesis using DataFrame.value_counts()
:
churn_data["TotalCharges"].value_counts()
We will deal with this empty strings in the preprocessing steps below.
Recall in our housing analysis, we needed to perform three main steps to bring out DataFrame to a form suitable for training a Random Forest on:
- Convert strings to categorical data type
- Fill missing values
- Numericalise the DataFrame and create a features matrix $X$ and target vector $y$
- Create train and validation sets
Let's perform each of those steps below.
First we convert all the string columns to pandas' categorical data type:
convert_strings_to_categories(churn_data)
churn_data.dtypes
This is almost correct, although a closer look at SeniorCitizen
reveals that it refers to a binary feature and thus should also be categorical:
churn_data["SeniorCitizen"].unique()
We can fix this easily by simply changing the data type:
churn_data["SeniorCitizen"] = churn_data["SeniorCitizen"].astype("category")
churn_data.dtypes
A quick way to test for missing values is to apply the isna
method from pandas and calculate the sum of missing values in our DataFrame:
(churn_data.isna().sum() / len(churn_data)).sort_values(ascending=False)
In this case, it looks like we're lucky and have a pre-cleaned dataset!
Now that we have done some basic preprocessing, the final step is to numericalise the pandas.DataFrame
and create the feature matrix $X$ and target vector $y$. In previous lessons we created some functions to automate these steps. Below we use fast.ai's utility function proc_df
to wrap all these steps into a single step:
X, y, nas = proc_df(churn_data, "Churn")
display_large(X.head())
X.shape, y.shape
For future use we can save our processed quantities:
churn_processed = X.join(pd.Series(y, name="Churn"))
churn_processed.to_csv(DATA / "churn_processed.csv", index=False)
X_train, X_valid, y_train, y_valid = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"{len(X_train)} train rows + {len(X_valid)} valid rows")
Evaluating classifiers is often significantly trickier than evaluating a regressor. One way to do this is to compare the accuracy of each classifier, where
$$ \mbox{accuracy} = \frac{\mbox{Number of correct decisions made}}{\mbox{Total number of decisions made}} $$
In general, however, accuracy is not the preferred performance measures for classifiers, especially when you are dealing with skewed datasets (i.e. when some classes are much more frequent than others). For our churn example, supose we build a model that generates 75% accuracy. Is this any good? Let's have a look at the distribution of churn in the data:
sns.countplot(x="Churn", data=churn_data)
plt.show()
churn_data["Churn"].value_counts(normalize=True)
From the plot and numbers we see that the "No Churn" and "Churn" classes appear in approximately a 3:1 ratio. If we built a dumb classifier that just classifies every single customer as "No Churn", then we would be right about 73.5% of the time! In practice skews of 99:1 are common, for which a report of 99% accuracy is somewhat meaningless.
A much better way to evaluate the performance of a classifier is to look at the confusion matrix. Recall that a confusion matrix for a problem involving $n$ classes is an $n\times n$ matrix with the rows labelled by the actual classes and the columns with the predicted classes. Our churn example is a two-class problem ("Churn" vs "No Churn"), so the confusion matrix is $2\times 2$.
If we denote the true classes as $\mathbf{p}$(positive) and $\mathbf{n}$(egative), and the classes predicted by the model as $\mathbf{Y}$(es) and $\mathbf{N}$(o) then the confusion matrix has the form:
N | Y | |
---|---|---|
n | True negatives | False positives |
p | False negatives | True positives |
The main diagonal contains the counts of correct decisions. The errors of the classifier are the false negatives (positives classified as negative) and false positives (negatives classified as positive).
You should know
In the Data Science for Business textbook, the confusion matrix is given a different layout, namely the columns a relabelled by the actual classes and the rows by the predicted classes:
p | n | |
---|---|---|
Y | True positives | False positives |
P | False negatives | True negatives |
The layout adopted above and in this notebook is the one produced by scikit-learn, since we'd like to make use of the in-built functions included in that library.
Let's begin by creating a baseline Random Forest model to build upon. First we need a scoring function for classifiers, similar to our $R^2$ and RMSE function for regression:
def print_scores(fitted_model):
res = {
"Accuracy on train:": accuracy_score(fitted_model.predict(X_train), y_train),
"ROC AUC on train:": roc_auc_score(
y_train, fitted_model.predict_proba(X_train)[:, 1]
),
"Accuracy on valid:": accuracy_score(fitted_model.predict(X_valid), y_valid),
"ROC AUC on valid:": roc_auc_score(
y_valid, fitted_model.predict_proba(X_valid)[:, 1]
),
}
if hasattr(fitted_model, "oob_score_"):
res["OOB accuracy:"] = fitted_model.oob_score_
for k, v in res.items():
print(k, round(v, 3))
model = RandomForestClassifier(n_estimators=10, n_jobs=-1, random_state=42)
model.fit(X_train, y_train)
print_scores(model)
ROC Curves and AUC
Note that in addition to accuracy, we also show a second value, the Area Under the ROC Curve (AUC). The AUC is a good summary statistic of the predictiveness of a binary classifier. It varies from zero to one. A value of 0.5 corresponds to randomness (the classifier cannot distinguish at all between "churn" and "no churn") and a value of 1.0 means that it is perfect.
The "ROC" refers to the Receiver Operating Characteristic (ROC) curve which plots the true positive rate
$$ \mbox{TPR} = \frac{\mbox{TP}}{\mbox{TP} + \mbox{FP}} \,, \qquad \mbox{TP (FP)} = \mbox{number of true (false) positives}\,,$$
against the false positive rate FPR, where the FPR is the ratio of negative instances that are incorrectly classified as positive. In general there is a tradeoff between these two quantities: the higher the TPR, the more false positives (FPR) the classifier produces. A good classifiers stays as close to the top-left corner of a ROC curve plot as possible.
In scikit-learn we can visualise the ROC curve of an estimator using the plotting API:
plot_roc_curve(model, X_valid, y_valid)
plt.show()
Next we plot the confusion matrix using scikit-learn's plotting API:
class_names = churn_data["Churn"].cat.categories
plot_confusion_matrix(
model,
X_valid,
y_valid,
display_labels=class_names,
cmap=plt.cm.Blues,
normalize="true",
)
plt.grid(None)
From the confusion matrix we see that our baseline model is able to identify churners only 42% of the time and incorrectly classifies people who churned 58% of the time. We can do better, but first let's inspect how a single decision tree is making decisions on this data.
Now that we have a baseline model, let's make a single tree so we can gain some insight into how the decisions are being made:
model = RandomForestClassifier(
n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1, random_state=42
)
model.fit(X_train, y_train)
feature_names = X_train.columns
# we need to specify the background color because of a bug in sklearn
fig, ax = plt.subplots(figsize=(30, 10), facecolor="k")
# generate tree plot
plot_tree(model.estimators_[0], filled=True, feature_names=feature_names, ax=ax)
plt.show()
The main difference here compared to our housing example is that the splitting criterion is no longer the mean squared error, but instead is something known as the Gini index:
$$ G = 1 - \sum_{i=1}^n p_i^2 $$
where $p_i$ is the probability of an object being classified to a particular class (in our case "Yes" or "No"). For classification tasks, the goal is to minimise the Gini index across each split, which amounts to finding which segments are most "pure".
From the figure, we can already start seeing some features that might be interesting for predicting churn, e.g. TotalCharges
, tenure
, and TechSupport
seem like good indicators.
Our baseline model has an accuracy of 0.771 and ROC AUC score of 0.801 on the validation set. Let's now examine whether we can improve this by tuning the:
- number of trees in the forest
- minimum number of samples per leaf
- maximum number of features per split
In our previous lessons we manually inspected how the performance evolved when we changed these hyperparameters one at a time. Instead we can automate this process using scikit-learn's GridSearchCV
to search for the best combination of hyperparameter values:
from sklearn.model_selection import GridSearchCV
# define range of values for each hyperparameter
param_grid = [
{
"n_estimators": [10, 20, 40, 80, 100],
"max_features": [0.5, 1.0, "sqrt", "log2"],
"min_samples_leaf": [1, 3, 5, 10, 25],
}
]
# instantiate baseline model
model = RandomForestClassifier(n_estimators=10, n_jobs=-1, random_state=42)
# initialise grid search with cross-validation
grid_search = GridSearchCV(
model, param_grid=param_grid, cv=3, scoring="roc_auc", n_jobs=-1
)
%time grid_search.fit(X, y)
Once the search is finished, we can get the best combination of parameters as follows:
best_params = grid_search.best_params_
best_params
Similarly we can get the best model in the search:
best_model = grid_search.best_estimator_
best_model
Let's see how this model performs on our validation set in terms of metrics and the confusion matrix:
print_scores(best_model)
plot_confusion_matrix(
best_model,
X_valid,
y_valid,
display_labels=class_names,
cmap=plt.cm.Blues,
normalize="true",
)
plt.grid(None)
In terms of the AUC score, we see about a 10% boost over our baseline model - not bad! Our confusion matrix has also visibly improved.
As we did in the housing example, we now examine which features were deemed to be important for our Random Forest model:
feature_importance = rf_feature_importance(best_model, X)
# peek at top 10
feature_importance[:10]
plot_feature_importance(feature_importance)
plt.show()
From the plot, it seems that the feature importance flattens out around 0.005, so let's use that as a threshold to remove some uninformative features:
feature_importance_threshold = 0.008
cols_to_keep = feature_importance[
feature_importance["Importance"] > feature_importance_threshold
]["Column"]
len(cols_to_keep)
cols_to_keep
Let's retrain with this subset and see if the score drop or not (for reference, we have accuracy of 0.807 and AUC of 0.871 on the validation set):
X_keep = X.copy()[cols_to_keep]
X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)
model = RandomForestClassifier(
n_estimators=best_params["n_estimators"],
min_samples_leaf=best_params["n_estimators"],
max_features=best_params["max_features"],
n_jobs=-1,
oob_score=True,
random_state=42,
)
model.fit(X_train, y_train)
print_scores(model)
The score has not changed significantly, so let's continue and replot the feature importances:
feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance)
plt.show()
We can also examine whether one-hot encoding produces a better representation of the data, and hence a better model. As we did in the housing example, we need to create a new feature matrix and target vector, along with a threshold on the maximum number of categories allowed:
X2, y2, nas = proc_df(churn_data, "Churn", max_n_cat=7)
X_train, X_valid, y_train, y_valid = train_test_split(
X2, y2, test_size=0.2, random_state=42
)
model = RandomForestClassifier(
n_estimators=best_params["n_estimators"],
min_samples_leaf=best_params["min_samples_leaf"],
max_features=best_params["max_features"],
n_jobs=-1,
oob_score=True,
random_state=42,
)
model.fit(X_train, y_train)
print_scores(model)
feature_importance = rf_feature_importance(model, X2)
plot_feature_importance(feature_importance[:30])
plt.show()
This encoding actually gives us a slightly better model and has the added bonus of being slightly more interpretable since we can easily read off which characteristics about the customers are influencing the prediction.
Let's trim the features as before and see if we can improve this model further still:
feature_importance_threshold = 0.007
cols_to_keep = feature_importance[
feature_importance["Importance"] > feature_importance_threshold
]["Column"]
len(cols_to_keep)
X_keep = X2.copy()[cols_to_keep]
X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)
model = RandomForestClassifier(
n_estimators=best_params["n_estimators"],
min_samples_leaf=best_params["min_samples_leaf"],
max_features=best_params["max_features"],
n_jobs=-1,
oob_score=True,
random_state=42,
)
model.fit(X_train, y_train)
print_scores(model)
OK, the model has not improved, but we have managed to reduce the number of features.
feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance)
plt.show()
We still have a lot of features to contend with, so let's see if any of them are redundant. As in the housing example, we use clustering to work out which pairs of features are "close" in some sense:
plot_dendogram(X_keep)
def get_oob(df):
model = RandomForestClassifier(
n_estimators=best_params["n_estimators"],
min_samples_leaf=best_params["min_samples_leaf"],
max_features=best_params["max_features"],
n_jobs=-1,
oob_score=True,
random_state=42,
)
X, _ = train_test_split(df, test_size=0.2, random_state=42)
model.fit(X, y_train)
return model.oob_score_
get_oob(X_keep)
for c in (
"TechSupport_Yes",
"OnlineSecurity_Yes",
"Contract_Two year",
"tenure",
"OnlineSecurity_No internet service",
"DeviceProtection_No internet service",
"StreamingTV_No internet service",
"InternetService_No",
"TechSupport_No",
"OnlineSecurity_No",
"MonthlyCharges",
"InternetService_Fiber optic",
):
print(c, get_oob(X_keep.drop(c, axis=1)))
cols_to_drop = [
"OnlineSecurity_Yes",
"DeviceProtection_No internet service",
"StreamingTV_No internet service",
"OnlineSecurity_No",
]
get_oob(X_keep.drop(cols_to_drop, axis=1))
Good, our OOB accuracy has not changed and we've managed to remove a few features.
X_keep.drop(cols_to_drop, axis=1, inplace=True)
X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)
model = RandomForestClassifier(
n_estimators=best_params["n_estimators"],
min_samples_leaf=best_params["min_samples_leaf"],
max_features=best_params["max_features"],
n_jobs=-1,
oob_score=True,
random_state=42,
)
model.fit(X_train, y_train)
print_scores(model)
feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance)
plt.show()
One thing we consistently see in our feature importance is the role of the tenure
attribute. One way to gain a deeper insight into this attribute is to consider converting it to a categorical variable. Continuous data is often discretised or separated into "bins" for analysis. This is typically done when you want to group people into discrete age buckets or in the case of our data, discrete tenure
buckets. Let's have a look at the distribution of tenure
values to get an idea about how to discretise the data:
sns.distplot(X_keep["tenure"], kde=False)
plt.show()
From the plot, we could reasonably divide the tenure
values into bins of 0 to 6, 7 to 20, 21 to 40, 41 to 60, and finally 61 and over. To do so we use the cut
function from pandas:
bins = [0, 6, 20, 40, 60, 100]
group_names = ["<=6", "7-20", "21-40", "41-60", "60+"]
X_keep["tenure_categories"] = pd.cut(
X_keep["tenure"], bins, labels=group_names, include_lowest=True
)
Note that the object pandas returns is also a special Categorical
object. We can also examine the value counts:
X_keep["tenure_categories"].value_counts()
Finally let's drop the tenure
column since it's redundant:
X_keep.drop(["tenure"], axis=1, inplace=True)
Since tenure_cats
is a string, we also need to one-hot encode it like the rest of our features:
X_keep = pd.get_dummies(X_keep)
Finally we train our model:
X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)
model = RandomForestClassifier(
n_estimators=1000,
min_samples_leaf=best_params["min_samples_leaf"],
max_features=best_params["max_features"],
n_jobs=-1,
oob_score=True,
random_state=42,
)
model.fit(X_train, y_train)
print_scores(model)
Although categorising the tenure
feature did not boost performance, it does help considerably with the interpretation:
feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance)
plt.show()
From the plot we can see that the following 3 features are the most important indicators of churn:
- Customers with month to month contracts
- Customers who have been with the Telco for less than 6 months
- Customers who do not have tech support
From our analysis we could advise the client to focus on customers with these subscriptions and thereby minimise the risk that they churn.
Let's wrap up by plotting the confusion matrix of our final model:
plot_confusion_matrix(
model,
X_valid,
y_valid,
display_labels=class_names,
cmap=plt.cm.Blues,
normalize="true",
)
plt.grid(None)
From this figure we can conclude that:
- If our model predicts that a customer will churn, then we can expect it to be correct 51% of the time (true positive)
- If our model predict that a customer will not churn, the we can expect the prediction to be correct 91% of the time (true negative)
Exercise
Use the techniques in this lesson to solve Kaggle's famous Titanic Challenge. As in the housing challenge from lesson 6, the goal is not to build a perfect model, but to get familiar with building models on new datasets.