PCA – Part 2: A look behind the curtain

In Part 1 of this series, I provided a non-technical intuition about what principal component analysis (PCA) is and how we can use it to find lower dimensional projections of our data that conserve the majority of the signal.

In doing so, I have explicitly or implicitly made a number of assertions:

1. PCA is a projection of the data onto a new set of uncorrelated (orthogonal) basis vectors, which turn out to be the eigenvectors of the covariance matrix of the data
2. The new coordinate system is found by looking for the directions of maximum variance in the data
3. The eigenvector associated with the largest eigenvalue is the direction of maximum variance
4. The new basis vectors can be used to project the data onto a lower dimensional subspace that captures the majority of the signal of the data

In the second installment of this series on PCA, I will justify those assertions mathematically. For this I am relying heavily on this excellent resource. Less technical explanations and visual representations can be found in the reference section.

finding the ideal basis by maximizing the variance

We will first investigate how to find a new set of basis vectors that allows us to re-express the data taking into account its structure. For simplicity, we start with the projection of a single data vector $\vec{x}$, which we want to project onto a unit vector $\hat{w}$.

The component of the data vector $\vec{x}$ along the unit vector $\hat{w}$ is given by their dot product $\vec{x} \cdot \hat{w}$. The result of the dot product is a scalar, so the projected vector $\vec{x_{p}}$ along $\hat{w}$ is a scaled version of $\hat{w}$.

$\vec{x_{p}} = (\vec{x} \cdot \hat{w}) \hat{w}$

We can measure the error of the projection as the squared distance of the projected vector $\vec{x_{p}}$ from the original vector $\vec{x}$. This is also called the residual sum of squares (RSS).

$RSS(\hat{w}) = \| \vec{x} - \vec{x_p} \|^2 = \| \vec{x} - (\vec{x} \cdot \hat{w}) \hat{w} \|^2= \| \vec{x} \|^2 - 2 (\vec{x} \cdot \hat{w})^2 + \| \hat{w}\|^2$

The length of the unit vector $\hat{w}$ is 1 by definition, so the equation simplifies to

$RSS(\hat{w}) = \| \vec{x} \|^2 - 2 (\vec{x} \cdot \hat{w})^2 + 1$

If we want to find the unit vector $\hat{w}$ that minimizes the RSS for $n$ vectors $\vec{x}$, we just sum up the errors from the projections of all $\vec{x_{i}}$ onto $\hat{w}$.

$RSS(\hat{w}) = \sum \limits_{i=1}^{n} \| \vec{x_{i}} \|^2 - 2 (\vec{x_{i}} \cdot \hat{w})^2 + 1$

For minimizing RSS with respect to $\hat{w}$ we only care about the components of the equation that depend on $\hat{w}$. Isolation of the components that do depend on $\hat{w}$ simplifies the problem considerably.

$RSS(\hat{w}) = (n + \sum \limits_{i=1}^{n} \| \vec{x_{i}} \|^2) - 2 \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2$

Notice that the component that depends on $\hat{w}$ has a minus sign. This means that in order to minimize RSS, we need to maximize

$\sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2$

Since $n$ does not depend on $\hat{w}$ it is equivalent to maximize

$\frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2$

which is the sample mean of the square of the dot product between $\vec{x}$ and $\hat{w}$. As we are not dealing with weighted averages, the sample mean is the same as the expected value (E). Recall that the mean of the square minus the square of the mean equals the variance

$Var(z) = E(z^2) - [E(z)]^2$

If we substitute $(\vec{x_{i}} \cdot \hat{w})$ for $z$ and rearrange, we get

$\frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2 = [\frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})]^2 + Var(\vec{x_{i}} \cdot \hat{w})$

Now, let’s assume that all the vectors $\vec{x}$ are centered, so that their mean is 0. If you have ever asked yourself why it is necessary to either center or scale the data before using PCA, this is why! This means that their projections onto $\hat{w}$ also sum up to 0 and we are left with

$Var(\vec{x_{i}} \cdot \hat{w}) = \frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2$

Finding the unit vector $\hat{w}$ that minimizes the projection error of the data vectors $\vec{x_{i}}$ onto $\hat{w}$ can be achieved by maximizing the variance of the dot products of $\vec{x_{i}}$ and $\hat{w}$.

Conclusions

It makes intuitive sense that the dimension associated with the highest variance within the data has the potential to contain the most “information”. To be fair, there is also the most potential for noise. The math tells us that maximizing the variance is equivalent to minimizing the projection error, which makes even more intuitive sense.

the ideal basis for projection is the eigenbasis

The previous section established that we need to maximize the variance of $\vec{x} \cdot \hat{w}$ to find the vector $\hat{w}$ that minimizes the projection error. In this section we will go through the steps of the optimization problem and see that the set of unit vectors $\hat{w}$ that maximize the variance upon projection of the data matrix $\boldsymbol X$ are the eigenvectors of the feature covariance matrix $\boldsymbol Q$ of $\boldsymbol X$.

It is convenient to transition from summation notation to matrix notation at this point. The variance expressed in matrix notation. If we stack up all our data vectors $\vec{x}$ into a $n \times p$ matrix $\boldsymbol X$, the dot product $\boldsymbol X \boldsymbol w$ gives us the projection of the vectors onto $\hat{w}$.

$\sigma_{\hat{w}}^2 = \frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2=\frac{1}{n} (\boldsymbol X \boldsymbol w)^T \boldsymbol X \boldsymbol w$

The $p \times p$ feature covariance matrix $\boldsymbol Q$ is

$\boldsymbol Q=\frac{1}{n}\boldsymbol X^T\boldsymbol X$

We can see that simplification and rearrangement yields the following expression for the variance.

$\sigma_{\hat{w}}^2=\frac{1}{n}\boldsymbol w^T\boldsymbol X^T\boldsymbol X\boldsymbol w=\boldsymbol w^T\frac{\boldsymbol X^T\boldsymbol X}{n}\boldsymbol w=\boldsymbol w^T\boldsymbol Q\boldsymbol w$

In order to set up the optimization problem, we define a function $f$ we want to maximize. In our case it is just the variance.

$f(\boldsymbol w) = \boldsymbol w^T\boldsymbol Q\boldsymbol w$

As we are not looking for all possible vectors $\boldsymbol w$ but only unit vectors $\boldsymbol w$ we set up the following constraint $\|\boldsymbol w\|^2 = \boldsymbol w^T\boldsymbol w=1$ and define the constraint function $g$.

$g(\boldsymbol w) = \boldsymbol w^T \boldsymbol w = 1$

Rearrangement and addition of the Langrange multiplier $\lambda$ results in the constraint function $u(\boldsymbol w, \lambda) = f(\boldsymbol w) - \lambda(g(\boldsymbol w) - 1)$, which we want to optimize.

$u=\boldsymbol w^T\boldsymbol Q\boldsymbol w - \lambda (\boldsymbol w^T\boldsymbol w - 1)$

Partial differentiation of $u$ with respect to $\boldsymbol w$ results in

$\frac{\partial u}{\partial \boldsymbol w} = 2 \boldsymbol Q\boldsymbol w - 2 \lambda \boldsymbol w = 0$

$\boldsymbol Q\boldsymbol w =\lambda\boldsymbol w$

This is the eigenvector equation for the feature covariance matrix $\boldsymbol Q$.

Conclusions

The desired vector $\boldsymbol w$ is an eigenvector of the feature covariance matrix $\boldsymbol Q$. Among the $p$ eigenvectors, the eigenvector with the largest eigenvalue will be the vector that captures the maximum variance or, equivalently, minimizes the residual sum of squares. It is the first principal component.

Sorting the eigenvector/eigenvalue pairs by eigenvalue from largest to smallest, we obtain all $p$ principal components. By definition, eigenvectors are orthogonal to each other and are a basis of the p-dimensional feature space.

The sum of the eigenvalues is the sum of the variance described by each component

$\sigma_{total}^2 = \sum \limits_{i=1}^p \lambda_{i}$

We can now determine the fraction of variance explained by $q$ principal components where $q \leq p$

$\sigma_{explained}^2 / \sigma_{total}^2 = \sum \limits_{i=1}^q \lambda_{i} / \sum \limits_{i=1}^p \lambda_{i}$

Stacking our eigenvectors $\boldsymbol w$ as columns into a $p \times p$ gives us the weight matrix $\boldsymbol W$. The connection to the eigendecomposition can be seen clearly now.

$\sigma^2\boldsymbol I=\lambda\boldsymbol I=\boldsymbol \Lambda =\boldsymbol W^T\boldsymbol Q\boldsymbol W\propto \boldsymbol W^T \boldsymbol X^T \boldsymbol X \boldsymbol W$

Rearranging yields the classical form of the eigendecomposition of a square matrix.

$\boldsymbol X^T \boldsymbol X = \boldsymbol W \boldsymbol \Lambda \boldsymbol W^T$

Relationship to singular value decomposition

Software implementations of PCA often use a matrix factorization called the singular value decomposition (SVD) to compute the eigenvector/eigenvalue pairs of a matrix. SVD is closely related to eigendecomposition but offers the advantage that it is more numerically accurate and can operate on matrices that are not square and does not require the computation of the feature correlation matrix $\boldsymbol Q$.

SVD decomposes a $n \times p$ matrix $\boldsymbol X$ into three components: An $n \times n$ matrix $\boldsymbol U$, a diagonal $n \times p$ matrix $\boldsymbol \Sigma$, and a $p \times p$ matrix $\boldsymbol V$. $\boldsymbol U$ and $\boldsymbol V$ are both unitary matrices meaning that their transpose is also their inverse. The columns of the matrix $\boldsymbol U$ contains the “left singular vectors” and the columns of $\boldsymbol V$ the “right singular vectors”. The diagonal elements of $\boldsymbol \Sigma$ are called the “singular values”, which are related to the eigenvalues.

$\boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T$

We can think of SVD as breaking down the linear transformation performed by $\boldsymbol X$ into a rotation ($\boldsymbol U$), a scaling ($\boldsymbol \Sigma$), and another rotation ($\boldsymbol V$).

The relationship to the eigendecomposition becomes clear when we investigate the SVD of $\boldsymbol X^T \boldsymbol X$

$\boldsymbol X^T \boldsymbol X = (\boldsymbol U \boldsymbol \Sigma \boldsymbol V^T)^T \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T = \boldsymbol V \boldsymbol \Sigma^T \boldsymbol U^T \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T= \boldsymbol V \boldsymbol \Sigma^2 \boldsymbol V^T$

Recall the eigendecomposition $\boldsymbol X^T \boldsymbol X = \boldsymbol W \boldsymbol \Sigma \boldsymbol W^T$. It is evident that the eigenvalues are the squares of the singular values

$\boldsymbol \Lambda = \boldsymbol \Sigma^2$

or, equivalently, the singular values are the square roots of the eigenvalues.

$\boldsymbol \Sigma = \sqrt{\boldsymbol \Lambda}$

Conclusions

Eigendecomposition and SVD are highly related techniques. Both can be used to calculate eigenvectors and eigenvalues of a matrix with the exception that eigendecomposition requires a square matrix and SVD does not.

Projection of the data onto the new basis

One of the promises of PCA was that it finds rotation of the data that is more “natural”. So far we have only identified the new coordinate system (the eigenvectors) and which directions are the most variable (based on the eigenvalues). We have not yet put the data into the new frame of reference. This process is called “projection” or “transformation”.

The projection of the data matrix $\boldsymbol X$ onto its eigenbase $\boldsymbol W$ is given by

$\boldsymbol Y_{n \times p} = \boldsymbol X_{n \times p} \boldsymbol W_{p \times p}$

If we want to project the data on a subspace of $\boldsymbol W$, we just use the first $q$ columns of $\boldsymbol W$. Again we choose $q$ by how much of the variance we want to conserve.

$\boldsymbol Y_{n \times q} = \boldsymbol X_{n \times p} \boldsymbol W_{p \times q}$

Maybe you have determined the eigenvectors and eigenvalues of $\boldsymbol X$ using SVD. If we substitute $\boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T$ into the previous equation, we get

$\boldsymbol Y = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T \boldsymbol W$

$\boldsymbol V$ and $\boldsymbol W$ are both $p \times p$ matrices of the right eigenvectors of $\boldsymbol X$.

$\boldsymbol Y_{n \times p} = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T \boldsymbol V = \boldsymbol U_{n \times n} \boldsymbol \Sigma_{n \times p}$

Alternatively, we could just use $\boldsymbol V$ in place of $\boldsymbol W$

$\boldsymbol Y_{n \times p} = \boldsymbol X_{n \times p} \boldsymbol V_{p \times p}$

Using only $q$ components to project onto a lower dimensional subspace yields

$\boldsymbol Y_{n \times q} = \boldsymbol U_{n \times n} \boldsymbol \Sigma_{n \times q}$

$\boldsymbol Y_{n \times q} = \boldsymbol X_{n \times p} \boldsymbol V_{p \times q}$

Conclusions

When applying PCA in practice, we let the software do the heavy lifting, i.e. let it figure out the eigenvectors and eigenvalues of the data matrix $\boldsymbol X$. Once we have the eigenvector/eigenvalue pairs, obtaining the projection onto the new basis is a simple matrix-matrix product.

Choosing how many components $q$ to use for the projection of your data is the real work for you as a user of PCA. We will face this challenge in the next part of this series, which will cover the practical application of PCA using R.

Mathematics

Lindsay Smith – A Tutorial on Principal Components Analysis

Jeff Jauregui – Principal component analysis with linear algebra

Cosma Shalizi – Principle Components – Mathematics, Example, Interpretation

Visual Explanation

George Dallas – Principle Component Analysis 4 Dummies

Sebastian Raschka – Principle Component Analysis in 3 Simple Steps

PCA series

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets