Introduction to dimensionality reduction and visualisation.

Binderslides

Learning objectives

In this lecture we will have a look at dimensionality reduction and data visualisation. When we look at the feature vector of a dataset it can contain many features and each feature corresponds to its own dimension. In the first part of the notebook we will explore a method to reduce the dimensionality of a dataset. In the second part we will look at a an algorithm to extract information from high-dimensional data and display its structure in two dimensions. Both these methods fall into the cateogory of unsupersvised algorithms. The learning objectives are to be able to answer the following questions:

  • What is dimensionality reduction?
  • When is dimensionality reduction useful?

References

  • Chapter 8: Dimensionality Reduction, Section PCA of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron

Homework

As homework read the references, work carefully through the notebook and solve the exercises. This lecture covers several complex topics and it is important that you familiarise yourself by experimenting with the notebook.

Principal component analysis

First we have a look at a method to reduce the dimensions of a dataset. There are two main reasons why we would like to reduce the dimensionality of such datasets:

  1. Some machine learning algorithms struggle with high dimensional data.
  2. It is hard to visualize high dimensional data. In practice it is hard to visualise datasets that have more than two or three dimensions.

There is one very common approach to reduce the dimensionality of data: principal component analysis (PCA). The idea is that not all axis contain the same amount of variance and that there are even combination of axis that contain most variance. PCA seeks to find new coordinate axis such that the variance along the axis is maximised and ordered (the first principal component conatains most variance).

Figure: PCA can look intimidating at the beginning.

Imports

import pandas as pd
import numpy as np
from dslectures.core import get_dataset, convert_strings_to_categories, rmse, fill_missing_values_with_median
from pathlib import Path
from tqdm import tqdm

# data viz
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.image as mpimg
import seaborn as sns

sns.set(color_codes=True)
sns.set_palette(sns.color_palette("muted"))

# ml magic
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_validate

#dslecture
from dslectures.core import rmse
# !pip install --upgrade dslectures

Load the data

We will use the processed housing dataset for the principal component analysis. First we have to load it.

get_dataset('housing_processed.csv')
Dataset already exists at '../data/housing_processed.csv' and is not downloaded again.
data_path = Path('../data/')
housing_data = pd.read_csv(data_path/'housing_processed.csv')
housing_data.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value city postal_code rooms_per_household bedrooms_per_household bedrooms_per_room population_per_household ocean_proximity_INLAND ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN ocean_proximity_ISLAND
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 69 94705 6.984127 1.023810 0.146591 2.555556 0 0 1 0 0
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 620 94611 6.238137 0.971880 0.155797 2.109842 0 0 1 0 0
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 620 94618 8.288136 1.073446 0.129516 2.802260 0 0 1 0 0
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 620 94618 5.817352 1.073059 0.184458 2.547945 0 0 1 0 0
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 620 94618 6.281853 1.081081 0.172096 2.181467 0 0 1 0 0
housing_data.shape
(19443, 20)
housing_data.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value city postal_code rooms_per_household bedrooms_per_household bedrooms_per_room population_per_household ocean_proximity_INLAND ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY ocean_proximity_NEAR OCEAN ocean_proximity_ISLAND
count 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000 19443.000000
mean -119.560363 35.646739 28.435118 2617.678548 538.136964 1442.129970 501.427352 3.675099 191793.406162 541.629224 93030.145605 5.340245 1.091741 0.214812 3.095953 0.331482 0.439953 0.106774 0.121535 0.000257
std 2.002697 2.145335 12.504584 2179.553070 420.168532 1140.254218 383.064222 1.569687 96775.724042 260.704512 1853.684352 2.190405 0.429728 0.056667 10.679036 0.470758 0.496394 0.308833 0.326756 0.016035
min -124.350000 32.550000 1.000000 2.000000 2.000000 3.000000 2.000000 0.499900 14999.000000 1.000000 85344.000000 0.846154 0.333333 0.100000 0.750000 0.000000 0.000000 0.000000 0.000000 0.000000
25% -121.760000 33.930000 18.000000 1438.500000 299.000000 799.000000 282.000000 2.526500 116700.000000 328.000000 91706.000000 4.412378 1.006140 0.177906 2.449692 0.000000 0.000000 0.000000 0.000000 0.000000
50% -118.490000 34.260000 29.000000 2111.000000 436.000000 1181.000000 411.000000 3.446400 173400.000000 545.000000 92860.000000 5.180451 1.048276 0.204545 2.841155 0.000000 0.000000 0.000000 0.000000 0.000000
75% -117.990000 37.730000 37.000000 3119.000000 644.000000 1746.500000 606.000000 4.579750 247100.000000 770.000000 94606.000000 5.963796 1.097701 0.240414 3.308208 1.000000 1.000000 0.000000 0.000000 0.000000
max -114.310000 41.950000 52.000000 39320.000000 6445.000000 35682.000000 6082.000000 15.000100 499100.000000 977.000000 96161.000000 132.533333 34.066667 1.000000 1243.333333 1.000000 1.000000 1.000000 1.000000 1.000000

We drop the two columns that contain categorical data with many categories. We could create extra columns for each categoriy but since there are many of them that would create a lot of extra features.

housing_data.drop(['city', 'postal_code'], axis=1, inplace=True)

Then we split the data into features and labels as usual:

X = housing_data.drop('median_house_value', axis=1)
y = housing_data['median_house_value']
feature_labels = X.columns

Baseline

First we train the a Random Forest with the settings we tuned in lesson 6 as a baseline for future model.

rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', n_jobs=-1, random_state=42)
                             
results = cross_validate(rf, X, y,
                         cv=5,
                         return_train_score=True,
                         scoring='neg_root_mean_squared_error')
rmse_full = -np.mean(results["test_score"])

We get an RMSE of roughly 60'000:

print(f'RMSE: {rmse_full}')
RMSE: 60301.958010516806

Standardisation

PCA works by finding new axes in the dataset that cover the most variance. Since variance is the average squared distance of each sample to the sample mean this depends on the scale of the feature. To illustrate this think of two columns that contain the height of the person measured in centimeter and meter. Although height described in these two columns are the same the variance will be 10'000 times larger in the centimeter column (100^2) and the standard deviation 100 times larger.

rand_values = np.random.randn(1000)
rand_values.var()
1.0254503528227774
(100*rand_values).var()
10254.503528227773

For this reason it is important to scale the features such that they are comparable to each other. A common approach to do this is to transform the data such that the mean is zero and the standard deviation is one. The following formula achieves this:

$x_{std}=\frac{x_{old}-\mu}{\sigma}$

Scikit-learn provides a function to do this for us called StandardScaler. We can do this in one line with the function .fit_transform(). Similar to the function fit_predict() it combines the process of fitting which means calculating the mean and standard deviation of the dataset and then transforming the data by applying the above described formula.

X_std = StandardScaler().fit_transform(X)
X_std.shape
(19443, 17)

We can see that the mean is zero (or close enough) and the standard deviation is one:

X_std.mean(axis=0)
array([-4.44386137e-16,  2.19854194e-15,  9.35549763e-17, -2.48505406e-17,
       -1.02325755e-16,  7.74752147e-17, -1.24252703e-17, -1.46179650e-16,
       -9.94021623e-17,  1.57874022e-16,  7.01662322e-17,  2.92359301e-17,
        1.60797615e-16,  1.43256057e-16,  6.13954532e-17, -3.14286248e-17,
        1.16943720e-17])
X_std.std(axis=0)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Principal component analysis

Now we apply principal component analysis to the standardised features. After initialising a PCA object we can use the .fit_transform() to find the principal components and then transform the dataset into these new coordinates.

pca = PCA()
X_pca = pca.fit_transform(X_std)

We can see that the transformed dataset still has the same shape. However the columns are no longer the features but the coordinates in the new principal component coordinates.

X_pca.shape
(19443, 17)

Explained variance

The explained variance is an important concept in principal component analysis. It tells us how much variance of the whole dataset is contained along one principal component axis. The component with the more variation can encode more information than features with little variation. The explained variance is stored in the pca object and we can use it to visualise the distribution. The explained_variance_ratio_ tells us the percentage of variance each component contains from the total variance.

pca.explained_variance_ratio_
array([2.32152692e-01, 1.58210508e-01, 1.27146991e-01, 1.00792055e-01,
       7.93230339e-02, 7.26018573e-02, 5.93200071e-02, 5.87781344e-02,
       4.47812090e-02, 3.66692458e-02, 1.63533745e-02, 7.52994128e-03,
       2.65596682e-03, 1.50076877e-03, 1.26555424e-03, 9.18660939e-04,
       1.01541990e-32])
pca_labels = [f'principal component {i+1}' for i in range(len(pca.explained_variance_ratio_))]
barplot = sns.barplot(x=pca_labels, y=pca.explained_variance_ratio_)
barplot.set_xticklabels(pca_labels, rotation=90)
plt.title('explained variance ratio')
plt.show()

We can see that already the first three components together account for 50% of the total variance in the dataset. We can visualise this more systematically by plotting the cumulative sum of the explained variance rations:

sns.lineplot(np.arange(len(pca.explained_variance_ratio_)), np.cumsum(pca.explained_variance_ratio_))
plt.ylabel('total explained variance')
plt.xlabel('number of included PCA components')
plt.show()
/Users/leandro/git/dslectures/env/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

We see that with 7 of the 17 axes we already explain 90% percent of the variance in the dataset and that after 10 we almost reach 100%. That means the components 10-17 probably don't contain much information and can be discarded.

PCA vector composition

Naturally, we would like to know what these principal components mean. For example the first component contains 25% of the total variance. What information is stored in that component? The pca object also contains the components_ which define the transformation between the original dataset and the transformed one. We can print a few of the components and interpret them:

def print_tabular(labels, values):
    df = pd.DataFrame(data=values, columns=labels)
    display(df.T)
print('The first three principal components:')
print_tabular(feature_labels, pca.components_[:3])
The first three principal components:
0 1 2
longitude 0.096240 -0.440492 0.279899
latitude -0.093538 0.511034 -0.231581
housing_median_age -0.228990 -0.102574 -0.203651
total_rooms 0.480823 0.110303 0.003569
total_bedrooms 0.481068 0.046098 -0.114091
population 0.466057 -0.002745 -0.121771
households 0.482901 0.023411 -0.149648
median_income 0.083829 0.066170 0.381905
rooms_per_household 0.029542 0.299368 0.519937
bedrooms_per_household 0.006354 0.204508 0.343374
bedrooms_per_room -0.029537 -0.230823 -0.379778
population_per_household -0.002497 -0.003664 0.003476
ocean_proximity_INLAND -0.010714 0.326173 0.037222
ocean_proximity_<1H OCEAN 0.054867 -0.404136 0.152350
ocean_proximity_NEAR BAY -0.067325 0.233822 -0.269103
ocean_proximity_NEAR OCEAN -0.003977 -0.076523 -0.030817
ocean_proximity_ISLAND -0.006251 -0.009071 0.001835

The way to read this table is to think of it as a recipe to construct the first three principal components coordinates from the original features. Given a new datapoint we can for example calculate the first principal component by multiplying the longitude by the value in column 0 plus the latitude multiplied by the corresponding value in column 0 etc.. This weighted sum of the original featuers will yield the first coordinate in the PCA coordinate system.

${x_{PC,i}} = \sum_{i=0}^{n-1} w_{i,j} \cdot x_{feat, j}$

From this we can see that the first component mainly consists of the three features total_rooms, total_bedrooms, and households:

$x_{PC,0} = 0.096 \cdot longitude + (-0.093) \cdot lattidue + (-0.229) \cdot housing\_median\_value + \dots$

Visualisation

We can now use the principle components to visualise the data. We have a closer look at the first two components.

sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], alpha=0.1)
plt.show()
jplot = sns.jointplot(x=X_pca[:,0], y=X_pca[:,1], kind="hex", color="#4CB391",xlim=[-5,10], ylim=[-5,10])
jplot.set_axis_labels('Principal component 1', 'Principal component 2')
<seaborn.axisgrid.JointGrid at 0x7fa212479f98>

We observe that there seem to be two visible clusters along these two axis. We could now investigate these clusters further and might find some property of the dataset that might be useful to us or our customer. Let's do it!

Exercise 1

Create plots of other principal components and investigate if you find any interesting features.

Advanced: Cluster investigation

In this section we have a closer look at the clusters we observed in the previous plot. The clustering seems to mainly happen along the second principle component. Let's plot a fine-grained histogram of that component:

plt.hist(X_pca[:,1], bins=np.linspace(-5, 5, 50))
plt.show()

We can see that the clusters seem to be seperated around -0.5. We can create a mask for the data points that are below that threshold.

left_group = X_pca[:,1]<-0.5

Furthermore, we only want to investigate the features that contribute to the second principal component. Therefore, we set another threshold for the minimum absolute feature coefficient.

pc1_features = list(feature_labels[np.abs(pca.components_[1])>0.2])

These are the components that mainly contribute to the second principal component:

pc1_features
['longitude',
 'latitude',
 'rooms_per_household',
 'bedrooms_per_household',
 'bedrooms_per_room',
 'ocean_proximity_INLAND',
 'ocean_proximity_<1H OCEAN',
 'ocean_proximity_NEAR BAY']

Unfortunately, the StandardScaler transformed the DataFrame X into an array X_std. Let's transform it to a DataFrame again:

X_std = pd.DataFrame(data=X_std, columns=feature_labels)

With loc and the mask we just created we can now slice out the datapoints of the left and right group. We have a look at the statistics of the relevant features of the two groups with .describe().

X_std.loc[left_group, pc1_features].describe()
longitude latitude rooms_per_household bedrooms_per_household bedrooms_per_room ocean_proximity_INLAND ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY
count 8602.000000 8602.000000 8602.000000 8602.000000 8602.000000 8602.000000 8602.000000 8602.000000
mean 0.767273 -0.858163 -0.236650 -0.099011 0.289184 -0.630324 0.729185 -0.345365
std 0.374053 0.273084 0.523087 0.205740 1.184709 0.389128 0.802984 0.034913
min -1.577731 -1.443513 -2.051770 -1.764901 -1.787272 -0.704163 -0.886320 -0.345741
25% 0.619362 -0.902791 -0.611887 -0.208570 -0.519333 -0.704163 1.128260 -0.345741
50% 0.709243 -0.804902 -0.279346 -0.111908 0.105280 -0.704163 1.128260 -0.345741
75% 0.874025 -0.734981 0.105771 -0.012303 0.870705 -0.704163 1.128260 -0.345741
max 2.501871 1.367309 1.901482 2.484505 13.856510 1.420126 1.128260 2.892336
X_std.loc[~left_group, pc1_features].describe()
longitude latitude rooms_per_household bedrooms_per_household bedrooms_per_room ocean_proximity_INLAND ocean_proximity_<1H OCEAN ocean_proximity_NEAR BAY
count 10841.000000 10841.000000 10841.000000 10841.000000 10841.000000 10841.000000 10841.000000 10841.000000
mean -0.608807 0.680926 0.187775 0.078563 -0.229459 0.500143 -0.578586 0.274036
std 0.919116 0.828470 1.223449 1.321400 0.748986 1.052636 0.724778 1.273936
min -2.391655 -1.406222 -1.655426 -1.667938 -2.026123 -0.704163 -0.886320 -0.345741
25% -1.298101 0.299849 -0.246291 -0.191329 -0.724091 -0.704163 -0.886320 -0.345741
50% -0.943570 0.929138 0.056233 -0.092209 -0.309306 1.420126 -0.886320 -0.345741
75% -0.034773 1.143562 0.386952 0.040083 0.146278 1.420126 -0.886320 -0.345741
max 2.621713 2.938200 58.069790 76.736388 6.038055 1.420126 1.128260 2.892336

Just looking at the feature mean of the two groups we see that there seems to be a significant difference between almost all features. To have a more detailed view let's compare the histograms of each feature:

for col in pc1_features:
    bins = np.linspace(-3, 3, 20)
    fig, ax = plt.subplots(1,1, figsize=(10,5))
    X_std.loc[left_group, col].hist(ax=ax, bins=bins, color='C0', alpha=0.5, label='Group 1')
    X_std.loc[~left_group, col].hist(ax=ax, bins=bins, color='C1', alpha=0.5, label='Group 2')
    plt.legend(loc='best')
    plt.title(col)
    plt.show()

We can see that the most significant difference is between the longitute and latitude. Another interesting observation is that in the first group (blue) the rooms per household seem to be slightly lower and the bedrooms per room slightly higher. We will discuss further below. First we want to look at the longitude and latitude distribution of the two groups. First we make a hexplot to see the hotspots.

import matplotlib.gridspec as gridspec

print("Group 1")
sns.jointplot(
    kind='hex',
    x="longitude",
    y="latitude",
    cmap='magma',

    data=X.loc[left_group],
)
plt.show()

print("Group 2")
sns.jointplot(
    kind='hex',
    x="longitude",
    y="latitude",
    cmap='magma',
    data=X.loc[~left_group],
)
plt.show()
Group 1
Group 2

We see that in the first group there are mainly two hotspots while in the second group the hotspots seem to be further distributed. Recall that we can plot the longitude and latitude on top of a map. Let's investigate what these hot regions correspond to.

california_img=mpimg.imread('images/california.png')
fig, axes = plt.subplots(1, 2, figsize=(12,6))

fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=X.loc[left_group],
    alpha=0.1,
    s=1,
    color="r",
    edgecolor="none", 
    ax=axes[0]
)

axes[0].imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
axes[0].set_title("Group 1")
fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=X.loc[~left_group],
    alpha=0.05,
    s=3,
    color="r",
    edgecolor="none",
    ax=axes[1]

)
axes[1].imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
axes[1].set_title("Group 2")

plt.show()

Comparing to a satellite image of California (see below) we see that the two hotspots in the first group correspond to Los Angeles and San Diego. The second group mainly includes San Francisco, the Silicon Valley, Sacramento, Fresno and Bakersfield. What is interesting to observe is that the second group also includes Riverside which is just next to Los Angeles. That means that the second principal component is not just the diagonal along the coast but specifically encodes Los Angeles and San Diege. If one wanted to build a classifier to determine whether a house is in Los Angeles or an other region of California one would only need to look at the second principle component of the dataset.

Circling back the observation about the rooms per households. It seems that in Los Angeles the appartments have slightly fewer rooms than the rest of California but a similar amount of bedrooms, hence the higher number of bedrooms per rooms. In other words in Los Angeles you would get less extra rooms per household.

Figure: Satellite image of California from Google Maps.

Finally, we can also look at the median value of the houses in the two groups:

bins=50
fig, ax = plt.subplots(1,1, figsize=(10,5))
y.loc[left_group].hist(ax=ax, bins=bins, color='C0', alpha=0.5, label='Group 1')
y.loc[~left_group].hist(ax=ax, bins=bins, color='C1', alpha=0.5, label='Group 2')
plt.legend(loc='best')
plt.title('median value')
plt.show()

One observes that the median value is higher for the first group (Los Angeles & San Diego). However, the long tail in both groups is quite similar. So living seems to be more expensive for low and medium income people while for the upper class the prices seem to be comparable.

Compressed model

Now we want to use PCA to build a model that is more compact than the model that utilizes all features. We add one principal component after the other and measure how well a Random Forest performs with this subset of features. We do this in our standard cross-validation loop.

rmse_valid =[]
for i in tqdm(range(1, 18)):
    rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', n_jobs=-1, random_state=42)
                             
    results = cross_validate(rf, X_pca[:, :i], y, cv=5,
                             return_train_score=True,
                             scoring='neg_root_mean_squared_error')
    rmse_valid.append(-np.mean(results["test_score"]))
100%|██████████| 17/17 [04:18<00:00, 15.18s/it]

As an additional baseline we also create a median regressor:

rmse_median = rmse(y, [np.median(y)]*len(y))

Now we can plot the performances with the full Random Forest and the median regresssor as baselines for all subsets of components.

lineplot = sns.lineplot(x=np.arange(len(rmse_valid))+1 , y=rmse_valid, label='pca')
lineplot.axhline(rmse_median, c='y', linestyle='--', label='median')
lineplot.axhline(rmse_full, c='r', linestyle='--', label='full model')
plt.legend(loc='best')
plt.xlabel('number of PCA components')
plt.ylabel('RMSE');

So we can see that already with 4-7 principal components we get results that are quite close to the full model. As mentioned previously this has two advantages:

  1. Visualising 4 components is still hard but feasable compared to 17
  2. Training a Random Forest with half the features is twice as fast for training and inference.

Looking at the graph we see that there are several steep steps. It seems that adding these components improved the performance significantly. By taking the components with the biggest steps we can try to build a better selection of features:

features = [0,1,3,6,10]
rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', n_jobs=-1, random_state=42)
                             
results = cross_validate(rf, X_pca[:, features], y, cv=5,
                         return_train_score=True,
                         scoring='neg_root_mean_squared_error')
rmse_pca = -np.mean(results["test_score"])
rmse_pca
64700.242256282574

Just with 5 of 17 features we get <10% of the model that used all features. This model is 3-4 times faster to train and to make predictions. In practice there are more systematic ways of finding the best subset of features such as forward and backward selection (see link).

Exercise 2

Make your own selection of principle components (e.g. the first five) and see if you can beat the selection above.

import pandas as pd
df = pd.DataFrame({'genre': [['a','b'], ['c'], []],
                   'artist': ['a1', 'a2', 'a3']})
df
genre artist
0 [a, b] a1
1 [c] a2
2 [] a3
df.explode('genre')
genre artist
0 a a1
0 b a1
1 c a2
2 NaN a3
df.explode('genre').fillna('unk')
genre artist exists
0 a a1 1
0 b a1 1
1 c a2 1
2 unk a3 1
df['exists']=1
df.explode('genre').fillna('unk').pivot(index='artist',columns='genre',values='exists').fillna(0).reset_index()
genre artist a b c unk
0 a1 1.0 1.0 0.0 0.0
1 a2 0.0 0.0 1.0 0.0
2 a3 0.0 0.0 0.0 1.0