Learning objectives

In this lecture we will have a look at dimensionality reduction and data visualisation. When we look at the feature vector of a dataset it can contain many features and each feature corresponds to its own dimension. In the first part of the notebook we will explore a method to reduce the dimensionality of a dataset. In the second part we will look at a an algorithm to extract information from high-dimensional data and display its structure in two dimensions. Both these methods fall into the cateogory of unsupersvised algorithms. The learning objectives are to be able to answer the following questions:

What is dimensionality reduction?
When is dimensionality reduction useful?

References

Chapter 8: Dimensionality Reduction, Section PCA of Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurèlien Geron

Homework

As homework read the references, work carefully through the notebook and solve the exercises. This lecture covers several complex topics and it is important that you familiarise yourself by experimenting with the notebook.

Principal component analysis

First we have a look at a method to reduce the dimensions of a dataset. There are two main reasons why we would like to reduce the dimensionality of such datasets:

Some machine learning algorithms struggle with high dimensional data.
It is hard to visualize high dimensional data. In practice it is hard to visualise datasets that have more than two or three dimensions.

There is one very common approach to reduce the dimensionality of data: principal component analysis (PCA). The idea is that not all axis contain the same amount of variance and that there are even combination of axis that contain most variance. PCA seeks to find new coordinate axis such that the variance along the axis is maximised and ordered (the first principal component conatains most variance).

Figure: PCA can look intimidating at the beginning.

Imports

import pandas as pd
import numpy as np
from dslectures.core import get_dataset, convert_strings_to_categories, rmse, fill_missing_values_with_median
from pathlib import Path
from tqdm import tqdm

# data viz
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.image as mpimg
import seaborn as sns

sns.set(color_codes=True)
sns.set_palette(sns.color_palette("muted"))

# ml magic
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import cross_validate

#dslecture
from dslectures.core import rmse

# !pip install --upgrade dslectures

Load the data

We will use the processed housing dataset for the principal component analysis. First we have to load it.

get_dataset('housing_processed.csv')

Dataset already exists at '../data/housing_processed.csv' and is not downloaded again.

data_path = Path('../data/')

housing_data = pd.read_csv(data_path/'housing_processed.csv')

housing_data.head()

housing_data.shape

(19443, 20)

housing_data.describe()

We drop the two columns that contain categorical data with many categories. We could create extra columns for each categoriy but since there are many of them that would create a lot of extra features.

housing_data.drop(['city', 'postal_code'], axis=1, inplace=True)

Then we split the data into features and labels as usual:

X = housing_data.drop('median_house_value', axis=1)
y = housing_data['median_house_value']
feature_labels = X.columns

Baseline

First we train the a Random Forest with the settings we tuned in lesson 6 as a baseline for future model.

rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', n_jobs=-1, random_state=42)
                             
results = cross_validate(rf, X, y,
                         cv=5,
                         return_train_score=True,
                         scoring='neg_root_mean_squared_error')
rmse_full = -np.mean(results["test_score"])

We get an RMSE of roughly 60'000:

print(f'RMSE: {rmse_full}')

RMSE: 60301.958010516806

Standardisation

PCA works by finding new axes in the dataset that cover the most variance. Since variance is the average squared distance of each sample to the sample mean this depends on the scale of the feature. To illustrate this think of two columns that contain the height of the person measured in centimeter and meter. Although height described in these two columns are the same the variance will be 10'000 times larger in the centimeter column (100^2) and the standard deviation 100 times larger.

rand_values = np.random.randn(1000)

rand_values.var()

1.0254503528227774

(100*rand_values).var()

10254.503528227773

For this reason it is important to scale the features such that they are comparable to each other. A common approach to do this is to transform the data such that the mean is zero and the standard deviation is one. The following formula achieves this:

$x_{std}=\frac{x_{old}-\mu}{\sigma}$

Scikit-learn provides a function to do this for us called StandardScaler. We can do this in one line with the function .fit_transform(). Similar to the function fit_predict() it combines the process of fitting which means calculating the mean and standard deviation of the dataset and then transforming the data by applying the above described formula.

X_std = StandardScaler().fit_transform(X)

X_std.shape

(19443, 17)

We can see that the mean is zero (or close enough) and the standard deviation is one:

X_std.mean(axis=0)

array([-4.44386137e-16,  2.19854194e-15,  9.35549763e-17, -2.48505406e-17,
       -1.02325755e-16,  7.74752147e-17, -1.24252703e-17, -1.46179650e-16,
       -9.94021623e-17,  1.57874022e-16,  7.01662322e-17,  2.92359301e-17,
        1.60797615e-16,  1.43256057e-16,  6.13954532e-17, -3.14286248e-17,
        1.16943720e-17])

X_std.std(axis=0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Principal component analysis

Now we apply principal component analysis to the standardised features. After initialising a PCA object we can use the .fit_transform() to find the principal components and then transform the dataset into these new coordinates.

Note: You should never fit_transform the test set. Use it on the training set and apply the mean and standard deviation with transform to the test set. This is true for all data preprocessing.

pca = PCA()

X_pca = pca.fit_transform(X_std)

We can see that the transformed dataset still has the same shape. However the columns are no longer the features but the coordinates in the new principal component coordinates.

X_pca.shape

(19443, 17)

Explained variance

The explained variance is an important concept in principal component analysis. It tells us how much variance of the whole dataset is contained along one principal component axis. The component with the more variation can encode more information than features with little variation. The explained variance is stored in the pca object and we can use it to visualise the distribution. The explained_variance_ratio_ tells us the percentage of variance each component contains from the total variance.

pca.explained_variance_ratio_

array([2.32152692e-01, 1.58210508e-01, 1.27146991e-01, 1.00792055e-01,
       7.93230339e-02, 7.26018573e-02, 5.93200071e-02, 5.87781344e-02,
       4.47812090e-02, 3.66692458e-02, 1.63533745e-02, 7.52994128e-03,
       2.65596682e-03, 1.50076877e-03, 1.26555424e-03, 9.18660939e-04,
       1.01541990e-32])

pca_labels = [f'principal component {i+1}' for i in range(len(pca.explained_variance_ratio_))]

barplot = sns.barplot(x=pca_labels, y=pca.explained_variance_ratio_)
barplot.set_xticklabels(pca_labels, rotation=90)
plt.title('explained variance ratio')
plt.show()

We can see that already the first three components together account for 50% of the total variance in the dataset. We can visualise this more systematically by plotting the cumulative sum of the explained variance rations:

sns.lineplot(np.arange(len(pca.explained_variance_ratio_)), np.cumsum(pca.explained_variance_ratio_))
plt.ylabel('total explained variance')
plt.xlabel('number of included PCA components')
plt.show()

/Users/leandro/git/dslectures/env/lib/python3.7/site-packages/seaborn/_decorators.py:43: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  FutureWarning

We see that with 7 of the 17 axes we already explain 90% percent of the variance in the dataset and that after 10 we almost reach 100%. That means the components 10-17 probably don't contain much information and can be discarded.

PCA vector composition

Naturally, we would like to know what these principal components mean. For example the first component contains 25% of the total variance. What information is stored in that component? The pca object also contains the components_ which define the transformation between the original dataset and the transformed one. We can print a few of the components and interpret them:

def print_tabular(labels, values):
    df = pd.DataFrame(data=values, columns=labels)
    display(df.T)

print('The first three principal components:')
print_tabular(feature_labels, pca.components_[:3])

The first three principal components:

The way to read this table is to think of it as a recipe to construct the first three principal components coordinates from the original features. Given a new datapoint we can for example calculate the first principal component by multiplying the longitude by the value in column 0 plus the latitude multiplied by the corresponding value in column 0 etc.. This weighted sum of the original featuers will yield the first coordinate in the PCA coordinate system.

${x_{PC,i}} = \sum_{i=0}^{n-1} w_{i,j} \cdot x_{feat, j}$

From this we can see that the first component mainly consists of the three features total_rooms, total_bedrooms, and households:

$x_{PC,0} = 0.096 \cdot longitude + (-0.093) \cdot lattidue + (-0.229) \cdot housing\_median\_value + \dots$

Visualisation

We can now use the principle components to visualise the data. We have a closer look at the first two components.

sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], alpha=0.1)
plt.show()

jplot = sns.jointplot(x=X_pca[:,0], y=X_pca[:,1], kind="hex", color="#4CB391",xlim=[-5,10], ylim=[-5,10])
jplot.set_axis_labels('Principal component 1', 'Principal component 2')

<seaborn.axisgrid.JointGrid at 0x7fa212479f98>

We observe that there seem to be two visible clusters along these two axis. We could now investigate these clusters further and might find some property of the dataset that might be useful to us or our customer. Let's do it!

Exercise 1

Create plots of other principal components and investigate if you find any interesting features.

Advanced: Cluster investigation

In this section we have a closer look at the clusters we observed in the previous plot. The clustering seems to mainly happen along the second principle component. Let's plot a fine-grained histogram of that component:

plt.hist(X_pca[:,1], bins=np.linspace(-5, 5, 50))
plt.show()

We can see that the clusters seem to be seperated around -0.5. We can create a mask for the data points that are below that threshold.

left_group = X_pca[:,1]<-0.5

Furthermore, we only want to investigate the features that contribute to the second principal component. Therefore, we set another threshold for the minimum absolute feature coefficient.

pc1_features = list(feature_labels[np.abs(pca.components_[1])>0.2])

These are the components that mainly contribute to the second principal component:

pc1_features

['longitude',
 'latitude',
 'rooms_per_household',
 'bedrooms_per_household',
 'bedrooms_per_room',
 'ocean_proximity_INLAND',
 'ocean_proximity_<1H OCEAN',
 'ocean_proximity_NEAR BAY']

Unfortunately, the StandardScaler transformed the DataFrame X into an array X_std. Let's transform it to a DataFrame again:

X_std = pd.DataFrame(data=X_std, columns=feature_labels)

With loc and the mask we just created we can now slice out the datapoints of the left and right group. We have a look at the statistics of the relevant features of the two groups with .describe().

X_std.loc[left_group, pc1_features].describe()

X_std.loc[~left_group, pc1_features].describe()

Note: The ~ operator creates the boolean complement. So True becomes False and vice versa.

Just looking at the feature mean of the two groups we see that there seems to be a significant difference between almost all features. To have a more detailed view let's compare the histograms of each feature:

for col in pc1_features:
    bins = np.linspace(-3, 3, 20)
    fig, ax = plt.subplots(1,1, figsize=(10,5))
    X_std.loc[left_group, col].hist(ax=ax, bins=bins, color='C0', alpha=0.5, label='Group 1')
    X_std.loc[~left_group, col].hist(ax=ax, bins=bins, color='C1', alpha=0.5, label='Group 2')
    plt.legend(loc='best')
    plt.title(col)
    plt.show()

We can see that the most significant difference is between the longitute and latitude. Another interesting observation is that in the first group (blue) the rooms per household seem to be slightly lower and the bedrooms per room slightly higher. We will discuss further below. First we want to look at the longitude and latitude distribution of the two groups. First we make a hexplot to see the hotspots.

import matplotlib.gridspec as gridspec

print("Group 1")
sns.jointplot(
    kind='hex',
    x="longitude",
    y="latitude",
    cmap='magma',

    data=X.loc[left_group],
)
plt.show()

print("Group 2")
sns.jointplot(
    kind='hex',
    x="longitude",
    y="latitude",
    cmap='magma',
    data=X.loc[~left_group],
)
plt.show()

Group 1

Group 2

We see that in the first group there are mainly two hotspots while in the second group the hotspots seem to be further distributed. Recall that we can plot the longitude and latitude on top of a map. Let's investigate what these hot regions correspond to.

california_img=mpimg.imread('images/california.png')
fig, axes = plt.subplots(1, 2, figsize=(12,6))

fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=X.loc[left_group],
    alpha=0.1,
    s=1,
    color="r",
    edgecolor="none", 
    ax=axes[0]
)

axes[0].imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
axes[0].set_title("Group 1")
fig = sns.scatterplot(
    x="longitude",
    y="latitude",
    data=X.loc[~left_group],
    alpha=0.05,
    s=3,
    color="r",
    edgecolor="none",
    ax=axes[1]

)
axes[1].imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5)
axes[1].set_title("Group 2")

plt.show()

Comparing to a satellite image of California (see below) we see that the two hotspots in the first group correspond to Los Angeles and San Diego. The second group mainly includes San Francisco, the Silicon Valley, Sacramento, Fresno and Bakersfield. What is interesting to observe is that the second group also includes Riverside which is just next to Los Angeles. That means that the second principal component is not just the diagonal along the coast but specifically encodes Los Angeles and San Diege. If one wanted to build a classifier to determine whether a house is in Los Angeles or an other region of California one would only need to look at the second principle component of the dataset.

Circling back the observation about the rooms per households. It seems that in Los Angeles the appartments have slightly fewer rooms than the rest of California but a similar amount of bedrooms, hence the higher number of bedrooms per rooms. In other words in Los Angeles you would get less extra rooms per household.

Figure: Satellite image of California from Google Maps.

Finally, we can also look at the median value of the houses in the two groups:

bins=50
fig, ax = plt.subplots(1,1, figsize=(10,5))
y.loc[left_group].hist(ax=ax, bins=bins, color='C0', alpha=0.5, label='Group 1')
y.loc[~left_group].hist(ax=ax, bins=bins, color='C1', alpha=0.5, label='Group 2')
plt.legend(loc='best')
plt.title('median value')
plt.show()

One observes that the median value is higher for the first group (Los Angeles & San Diego). However, the long tail in both groups is quite similar. So living seems to be more expensive for low and medium income people while for the upper class the prices seem to be comparable.

Compressed model

Now we want to use PCA to build a model that is more compact than the model that utilizes all features. We add one principal component after the other and measure how well a Random Forest performs with this subset of features. We do this in our standard cross-validation loop.

rmse_valid =[]
for i in tqdm(range(1, 18)):
    rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', n_jobs=-1, random_state=42)
                             
    results = cross_validate(rf, X_pca[:, :i], y, cv=5,
                             return_train_score=True,
                             scoring='neg_root_mean_squared_error')
    rmse_valid.append(-np.mean(results["test_score"]))

100%|██████████| 17/17 [04:18<00:00, 15.18s/it]

As an additional baseline we also create a median regressor:

rmse_median = rmse(y, [np.median(y)]*len(y))

Now we can plot the performances with the full Random Forest and the median regresssor as baselines for all subsets of components.

lineplot = sns.lineplot(x=np.arange(len(rmse_valid))+1 , y=rmse_valid, label='pca')
lineplot.axhline(rmse_median, c='y', linestyle='--', label='median')
lineplot.axhline(rmse_full, c='r', linestyle='--', label='full model')
plt.legend(loc='best')
plt.xlabel('number of PCA components')
plt.ylabel('RMSE');

So we can see that already with 4-7 principal components we get results that are quite close to the full model. As mentioned previously this has two advantages:

Visualising 4 components is still hard but feasable compared to 17
Training a Random Forest with half the features is twice as fast for training and inference.

Looking at the graph we see that there are several steep steps. It seems that adding these components improved the performance significantly. By taking the components with the biggest steps we can try to build a better selection of features:

features = [0,1,3,6,10]

rf = RandomForestRegressor(n_estimators=100, max_features='sqrt', n_jobs=-1, random_state=42)
                             
results = cross_validate(rf, X_pca[:, features], y, cv=5,
                         return_train_score=True,
                         scoring='neg_root_mean_squared_error')
rmse_pca = -np.mean(results["test_score"])

rmse_pca

64700.242256282574

Just with 5 of 17 features we get <10% of the model that used all features. This model is 3-4 times faster to train and to make predictions. In practice there are more systematic ways of finding the best subset of features such as forward and backward selection (see link).

Exercise 2

Make your own selection of principle components (e.g. the first five) and see if you can beat the selection above.

import pandas as pd
df = pd.DataFrame({'genre': [['a','b'], ['c'], []],
                   'artist': ['a1', 'a2', 'a3']})

df

df.explode('genre')

df.explode('genre').fillna('unk')

df['exists']=1

df.explode('genre').fillna('unk').pivot(index='artist',columns='genre',values='exists').fillna(0).reset_index()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	city	postal_code	rooms_per_household	bedrooms_per_household	bedrooms_per_room	population_per_household	ocean_proximity_NEAR BAY
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	69	94705	6.984127	1.023810	0.146591	2.555556	1
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	620	94611	6.238137	0.971880	0.155797	2.109842	1
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	620	94618	8.288136	1.073446	0.129516	2.802260	1
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	620	94618	5.817352	1.073059	0.184458	2.547945	1
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	620	94618	6.281853	1.081081	0.172096	2.181467	1

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	city	postal_code	rooms_per_household	bedrooms_per_household	bedrooms_per_room	population_per_household	ocean_proximity_INLAND	ocean_proximity_<1H OCEAN	ocean_proximity_NEAR BAY	ocean_proximity_NEAR OCEAN	ocean_proximity_ISLAND
count	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000	19443.000000
mean	-119.560363	35.646739	28.435118	2617.678548	538.136964	1442.129970	501.427352	3.675099	191793.406162	541.629224	93030.145605	5.340245	1.091741	0.214812	3.095953	0.331482	0.439953	0.106774	0.121535	0.000257
std	2.002697	2.145335	12.504584	2179.553070	420.168532	1140.254218	383.064222	1.569687	96775.724042	260.704512	1853.684352	2.190405	0.429728	0.056667	10.679036	0.470758	0.496394	0.308833	0.326756	0.016035
min	-124.350000	32.550000	1.000000	2.000000	2.000000	3.000000	2.000000	0.499900	14999.000000	1.000000	85344.000000	0.846154	0.333333	0.100000	0.750000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	-121.760000	33.930000	18.000000	1438.500000	299.000000	799.000000	282.000000	2.526500	116700.000000	328.000000	91706.000000	4.412378	1.006140	0.177906	2.449692	0.000000	0.000000	0.000000	0.000000	0.000000
50%	-118.490000	34.260000	29.000000	2111.000000	436.000000	1181.000000	411.000000	3.446400	173400.000000	545.000000	92860.000000	5.180451	1.048276	0.204545	2.841155	0.000000	0.000000	0.000000	0.000000	0.000000
75%	-117.990000	37.730000	37.000000	3119.000000	644.000000	1746.500000	606.000000	4.579750	247100.000000	770.000000	94606.000000	5.963796	1.097701	0.240414	3.308208	1.000000	1.000000	0.000000	0.000000	0.000000
max	-114.310000	41.950000	52.000000	39320.000000	6445.000000	35682.000000	6082.000000	15.000100	499100.000000	977.000000	96161.000000	132.533333	34.066667	1.000000	1243.333333	1.000000	1.000000	1.000000	1.000000	1.000000

	0	1	2
longitude	0.096240	-0.440492	0.279899
latitude	-0.093538	0.511034	-0.231581
housing_median_age	-0.228990	-0.102574	-0.203651
total_rooms	0.480823	0.110303	0.003569
total_bedrooms	0.481068	0.046098	-0.114091
population	0.466057	-0.002745	-0.121771
households	0.482901	0.023411	-0.149648
median_income	0.083829	0.066170	0.381905
rooms_per_household	0.029542	0.299368	0.519937
bedrooms_per_household	0.006354	0.204508	0.343374
bedrooms_per_room	-0.029537	-0.230823	-0.379778
population_per_household	-0.002497	-0.003664	0.003476
ocean_proximity_INLAND	-0.010714	0.326173	0.037222
ocean_proximity_<1H OCEAN	0.054867	-0.404136	0.152350
ocean_proximity_NEAR BAY	-0.067325	0.233822	-0.269103
ocean_proximity_NEAR OCEAN	-0.003977	-0.076523	-0.030817
ocean_proximity_ISLAND	-0.006251	-0.009071	0.001835

	longitude	latitude	rooms_per_household	bedrooms_per_household	bedrooms_per_room	ocean_proximity_INLAND	ocean_proximity_<1H OCEAN	ocean_proximity_NEAR BAY
count	8602.000000	8602.000000	8602.000000	8602.000000	8602.000000	8602.000000	8602.000000	8602.000000
mean	0.767273	-0.858163	-0.236650	-0.099011	0.289184	-0.630324	0.729185	-0.345365
std	0.374053	0.273084	0.523087	0.205740	1.184709	0.389128	0.802984	0.034913
min	-1.577731	-1.443513	-2.051770	-1.764901	-1.787272	-0.704163	-0.886320	-0.345741
25%	0.619362	-0.902791	-0.611887	-0.208570	-0.519333	-0.704163	1.128260	-0.345741
50%	0.709243	-0.804902	-0.279346	-0.111908	0.105280	-0.704163	1.128260	-0.345741
75%	0.874025	-0.734981	0.105771	-0.012303	0.870705	-0.704163	1.128260	-0.345741
max	2.501871	1.367309	1.901482	2.484505	13.856510	1.420126	1.128260	2.892336

	longitude	latitude	rooms_per_household	bedrooms_per_household	bedrooms_per_room	ocean_proximity_INLAND	ocean_proximity_<1H OCEAN	ocean_proximity_NEAR BAY
count	10841.000000	10841.000000	10841.000000	10841.000000	10841.000000	10841.000000	10841.000000	10841.000000
mean	-0.608807	0.680926	0.187775	0.078563	-0.229459	0.500143	-0.578586	0.274036
std	0.919116	0.828470	1.223449	1.321400	0.748986	1.052636	0.724778	1.273936
min	-2.391655	-1.406222	-1.655426	-1.667938	-2.026123	-0.704163	-0.886320	-0.345741
25%	-1.298101	0.299849	-0.246291	-0.191329	-0.724091	-0.704163	-0.886320	-0.345741
50%	-0.943570	0.929138	0.056233	-0.092209	-0.309306	1.420126	-0.886320	-0.345741
75%	-0.034773	1.143562	0.386952	0.040083	0.146278	1.420126	-0.886320	-0.345741
max	2.621713	2.938200	58.069790	76.736388	6.038055	1.420126	1.128260	2.892336

	genre	artist
0	[a, b]	a1
1	[c]	a2
2	[]	a3

	genre	artist
0	a	a1
0	b	a1
1	c	a2
2	NaN	a3

	genre	artist	exists
0	a	a1	1
0	b	a1	1
1	c	a2	1
2	unk	a3	1

genre	artist	a	b	c	unk
0	a1	1.0	1.0	0.0	0.0
1	a2	0.0	0.0	1.0	0.0
2	a3	0.0	0.0	0.0	1.0

Lesson 11 - Dimensions

Learning objectives

References

Homework

Principal component analysis

Imports

Load the data

Baseline

Standardisation

Principal component analysis

Explained variance

PCA vector composition

Visualisation

Exercise 1

Advanced: Cluster investigation

Compressed model

Exercise 2