- Understand how to interpret feature importance plots for Random Forest models.
- Know how to drop uninformative features to build simpler models
This lesson is adapted (with permission) from Jeremy Howard's fantastic online course Introduction to Machine Learning for Coders, in particular:
Below are a few relevant articles that may be of general interest:
In this lesson we will analyse the preprocessed table of clean housing data and their addresses that we prepared in lesson 3:
housing_processed.csv
Figure reference: https://bit.ly/3djjWc6
A nice explanation for what it means to interpret a model's predictions is given in the Beware Default Random Forest Importances article:
Training a model that accurately predicts outcomes is great, but most of the time you don't just need predictions, you want to be able to interpret your model. For example, if you build a model of house prices, knowing which features are most predictive of price tells us which features people are willing to pay for.
In this lesson we will focus on one specific aspect of interpretability for Random Forests, namely feature importance which is a technique that (with care) can be used to identify the most informative features in a dataset.
%load_ext autoreload
# reload all modules every time before executing Python code
%autoreload 2
# render plots in notebook
%matplotlib inline
# !pip install dslectures --upgrade
import pandas as pd
import numpy as np
from dslectures.core import get_dataset, convert_strings_to_categories, rmse, fill_missing_values_with_median
from pathlib import Path
# data viz
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
sns.set_palette(sns.color_palette("muted"))
# ml magic
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
import scipy
from scipy.cluster import hierarchy as hc
get_dataset('housing_processed.csv')
We also make use of the pathlib
library to handle our filepaths:
DATA = Path('../data/')
!ls {DATA}
housing_data = pd.read_csv(DATA/'housing_processed.csv'); housing_data.head()
With the data loaded, we can recreate our train/validation splits as before:
X = housing_data.drop('median_house_value', axis=1)
y = housing_data['median_house_value']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'{len(X_train)} train rows + {len(X_valid)} valid rows')
To simplify the evaluation of our models, we'll reuse our scoring function from lesson 5:
def print_rf_scores(fitted_model):
"""Generates RMSE and R^2 scores from fitted Random Forest model."""
yhat_train = fitted_model.predict(X_train)
R2_train = fitted_model.score(X_train, y_train)
yhat_valid = fitted_model.predict(X_valid)
R2_valid = fitted_model.score(X_valid, y_valid)
scores = {
"RMSE on train:": rmse(y_train, yhat_train),
"R^2 on train:": R2_train,
"RMSE on valid:": rmse(y_valid, yhat_valid),
"R^2 on valid:": R2_valid,
}
if hasattr(fitted_model, "oob_score_"):
scores["OOB R^2:"] = fitted_model.oob_score_
for score_name, score_value in scores.items():
print(score_name, round(score_value, 3))
Recall that to make predictions with our Random Forest models, we take the average value in each leaf node as we pass each row in the validation set through the tree. However, we would also like to estimate our confidence in these predictions - how can we achieve this?
One way to do this is to calculate the standard deviation of the predictions of the trees. Conceptually, the idea is that if the standard deviation is high, each tree is generating very different predictions and may indicate the model has not learnt the most important features of the data.
To get started, let's use our baseline model from the previous lesson:
model = RandomForestRegressor(n_estimators=40, max_features='sqrt', n_jobs=-1, oob_score=True, random_state=42)
model.fit(X_train, y_train)
print_rf_scores(model)
As before, we concatenate all the predictions from each tree into a single array:
preds = np.stack([t.predict(X_valid) for t in model.estimators_])
# calculate mean and standard deviation for single observation
np.mean(preds[:,0]), np.std(preds[:,0])
preds.shape
Next, let's create a copy of the validation dataset and add the mean predictions and their standard deviation as new columns.
We can use these new columns to drill-down into the predictions of each individual, categorical feature. Let's examine ocean_proximity_INLAND
which denotes whether a housing district is inland or not:
sns.countplot(y='ocean_proximity_INLAND', data=valid_copy);
We can calculate the predictions and standard deviation per category by applying a group by operation in pandas, followed by taking the mean.
cols = ['ocean_proximity_INLAND', 'median_house_value', 'preds_mean', 'preds_std']
preds_quality = valid_copy[cols].groupby('ocean_proximity_INLAND', as_index=False).mean()
preds_quality
From the table, we can see that the predictions and ground truth are close to each other on average, while the standard deviation varies somewhat for each category. We can visualise this table in terms of bar plots as follows:
fig, (ax0, ax1) = plt.subplots(nrows=2, ncols=1, figsize=(12,7), sharex=True)
# plot ground truth
preds_quality.plot('ocean_proximity_INLAND', 'median_house_value', 'barh', ax=ax0)
# put legend outside plot
ax0.legend(loc='upper left', bbox_to_anchor=(1.0, 0.5))
# plot preds
preds_quality.plot('ocean_proximity_INLAND', 'preds_mean', 'barh', xerr='preds_std', alpha=0.6, ax=ax1)
# put legend outside plot
ax1.legend(loc='upper left', bbox_to_anchor=(1.0, 0.5))
plt.show()
The error bars in the plot indicate how confident the model is at predicting each category. Alternatively, we can compare the distribution of values to inspect how close the predictions match the ground truth. For example, for housing districts where ocean_proximity_INLAND
is 0 we have:
sample = valid_copy.copy().loc[valid_copy["ocean_proximity_INLAND"] == 0]
sample_mean = sample['preds_mean'].mean()
sample_std = sample['preds_mean'].std()
lower_bound = sample_mean - sample_std
upper_bound = sample_mean + sample_std
sns.distplot(
sample["median_house_value"], kde=False,
)
sns.distplot(sample["preds_mean"], kde=False)
plt.axvline(
x=sample_mean,
linestyle="--",
linewidth=2.5,
label="Mean of predictions",
c="k",
)
plt.axvline(
x=lower_bound,
linestyle="--",
linewidth=2.5,
label="Lower bound 68% CI",
c="g",
)
plt.axvline(
x=upper_bound,
linestyle="--",
linewidth=2.5,
label="Upper bound 68% CI",
c="purple",
)
plt.legend(bbox_to_anchor=(1.01, 1), loc="upper left");
In general, we expect our models to perform best on categories that are most frequent in the data. One way to validate this hypothesis is by calculating the ratio of the standard deviation of the predictions to the predictions themselves:
preds_quality = valid_copy[cols].groupby("ocean_proximity_INLAND", as_index=True).mean()
(preds_quality["preds_std"] / preds_quality["preds_mean"]).sort_values(ascending=False)
What the above tells us is that our predictions are less confident (i.e. higher variance) for housing districts that are inland - indeed looking at our bar plot we see these categories are under-represented in the data!
In general, confidence intervals serve two main purposes:
- We can identify which categories the model is less confident about and investigate further
- We can identify which rows in the data the model is not confident about. This is particularly important when deploying models to production, where e.g. we need to decide how to evaluate the model's predictions for a single housing district.
One drawback with the confidence interval analysis is that we need to drill-down into each feature to see where the model is making mistakes. In practice, we can get a global view by ranking each feature in terms of its importance to the model's predictions. In scikit-learn, the Random Forest model has an attribute called feature_importances_
that we can use to rank each feature:
def rf_feature_importance(fitted_model, df):
return pd.DataFrame(
{"Column": df.columns, "Importance": fitted_model.feature_importances_}
).sort_values("Importance", ascending=False)
Let's use this function to calculate the feature importance for our fitted model:
feature_importance = rf_feature_importance(model, X)
# peek at top 10 features
feature_importance[:10]
From the table we see that median_income
, ocean_proximity_INLAND
, and population_per_household
are the most important features - this is not entirely surprising since income and house location seem to be good indicators of house value. We can also plot the feature importance to gain a visual understanding:
def plot_feature_importance(feature_importance):
return sns.barplot(y="Column", x="Importance", data=feature_importance, color='b')
plot_feature_importance(feature_importance);
In nearly every real-world dataset, this is what the feature importance looks like: a handful of columns are very important, while most are not. The powerful aspect of this approach is that is focuses our attention on which features we should investigate further and which ones we can safely ignore.
From the feature importance plot above, we can see there are only a handful of informative features - let's use this insight to make a simpler model by dropping uninformative columns from our data:
feature_importance_threshold = 0.03
cols_to_keep = feature_importance[
feature_importance['Importance'] > feature_importance_threshold
]['Column']
len(cols_to_keep)
X_keep = X.copy()[cols_to_keep]
X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)
X_keep.head()
As a sanity check, let's ensure the model's prediction have not gotten worse with the reduced data (recall we had $R^2 = 0.91$ on the validation set):
model = RandomForestRegressor(
n_estimators=40, min_samples_leaf=1, n_jobs=-1, oob_score=True, random_state=42
)
model.fit(X_train, y_train)
print_rf_scores(model)
feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance);
We've now got a model that isn't really more predictive than our baseline, but it's much simpler - it has just 9 features instead of 19 and we now know that median_income
and ocean_proximity_INLAND
are particularly important features to focus on.
Looking at our feature importance plots, we can see that there are some features that seem to be related to each other (e.g. the square footage features like rooms_per_household
and bedrooms_per_room
). Features like this are potentially measuring the same thing, so we can use a special technique called hierarchical clustering to produce something called a dendogram that will tell us which pairs of features are similar:
def plot_dendogram(X):
"""Plots a dendogram to see which features are related."""
# calculate correlation coefficient
corr = np.round(scipy.stats.spearmanr(X).correlation, 4)
# convert to distance matrix
corr_condensed = hc.distance.squareform(1 - corr)
# perform clustering
z = hc.linkage(corr_condensed, method="average")
# plot dendogram
fig = plt.figure(figsize=(16, 10))
dendrogram = hc.dendrogram(
z, labels=X.columns, orientation="left", leaf_font_size=16
)
plt.show()
plot_dendogram(X_keep)
From the plot we see that quantities like latitude
and postal_code
are grouped together and similar (as we might expect). Note that we used Spearman's rank correlation coefficient to calculate notions of similarity - this is useful for finding non-linear correlations that may be missed by Pearson's correlation coefficient.
To examine these correlations a bit deeper, let's create a function that trains a Random Forest on subsets of the data where one of the columns is removed and see in which cases the OOB score does not get worse:
def get_oob(df):
model = RandomForestRegressor(
n_estimators=40, max_features='sqrt', n_jobs=-1, oob_score=True, random_state=42
)
X, _ = train_test_split(df, test_size=0.2, random_state=42)
model.fit(X, y_train)
return model.oob_score_
get_oob(X_keep)
for column in (
"longitude",
"population_per_household",
"housing_median_age",
"bedrooms_per_room",
"latitude",
"postal_code",
"rooms_per_household",
"median_income",
):
print(column, get_oob(X_keep.drop(column, axis=1)))
Here we're looking for columns where the OOB score did not drop much, say around the third decimal place. Let's see what happens to our OOB score when we drop these candidate columns:
cols_to_drop = [
'bedrooms_per_room',
"rooms_per_household",
]
get_oob(X_keep.drop(cols_to_drop, axis=1))
The OOB score increases fractionally which is good since we're looking for a simpler model - let's drop these columns and run the full model again:
X_keep.drop(cols_to_drop, axis=1, inplace=True)
X_train, X_valid = train_test_split(X_keep, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=40, max_features='sqrt', n_jobs=-1, oob_score=True, random_state=42)
model.fit(X_train, y_train)
print_rf_scores(model)
feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance);
Now that we have simplified our model, let's create a huge Random Forest to see if we can squeeze a bit more performance:
model = RandomForestRegressor(
n_estimators=1000, max_features='sqrt', n_jobs=-1, oob_score=True, random_state=42
)
model.fit(X_train, y_train)
print_rf_scores(model)
Our final model achieves around 5% better performance compared to our naive model with 10 trees from lesson 4, but it is much simpler - we've reduced the features from 19 down to 7! Moreover we've identified which features are most important, let's plot them out for the final visual comparison:
feature_importance = rf_feature_importance(model, X_keep)
plot_feature_importance(feature_importance);
Finally let's save this model to disk for later use or if we want to deploy it to production:
import pickle
with open(DATA/'housing_model.pkl', mode='wb') as file:
pickle.dump(model, file)
with open(DATA/'housing_model.pkl', mode='rb') as file:
model = pickle.load(file)
print_rf_scores(model)
with
statement to read or write data in Python as this automatically takes care of closing the file once we have finished with it. The mode
argument controls how we want to open the file, where ’rb’
(read) and ’wb’
(write) correspond to opening in binary mode.Exercise #3
By this stage you've now learned all the key steps needed to clean data, train a machine learning model, and interpret the results! Try applying the techniques you have learned to Kaggle's House Prices: Advanced Regression Techniques competition. Don't spend too much time trying to build a perfect model - the purpose of this exercise is to get familiar with downloading a new dataset, understanding the evaluation metric, and submitting your predictions to the Kaggle platform. To get started:
When you have your predictions ready, submit them here for evaluation on the public leaderboard!
Hint #1: Kaggle provides 2 datasets called train.csv
and test.csv
, where the latter is missing the target column SalePrice
. Your task is to build a regression model on train.csv
and then use that model to make predictions on test.csv
.
Hint #2: A closer look at Kaggle's evaluation metric shows that we actually want the RMSE between the logarithms of the predicted and actual house prices:
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)
How might you handle this? You may find NumPy's log function to be useful here.
Hint #3: You may find the convert_strings_to_categories
function from lesson 3 to be useful. You can import it from the dslectures
library as follows
from dslectures.core import convert_strings_to_categories
Hint #4: When you use the median to fill missing values in the training set, you should use the same median for the test set. See p.61 of Hands-On Machine Learning with Scikit-Learn and TensorFlow by A. Geron.