In Part 1 of this series, I provided a non-technical intuition about what principal component analysis (PCA) is and how we can use it to find lower dimensional projections of our data that conserve the majority of the signal.
In doing so, I have explicitly or implicitly made a number of assertions:
- PCA is a projection of the data onto a new set of uncorrelated (orthogonal) basis vectors, which turn out to be the eigenvectors of the covariance matrix of the data
- The new coordinate system is found by looking for the directions of maximum variance in the data
- The eigenvector associated with the largest eigenvalue is the direction of maximum variance
- The new basis vectors can be used to project the data onto a lower dimensional subspace that captures the majority of the signal of the data
In the second installment of this series on PCA, I will justify those assertions mathematically. For this I am relying heavily on this excellent resource. Less technical explanations and visual representations can be found in the reference section.
finding the ideal basis by maximizing the variance
We will first investigate how to find a new set of basis vectors that allows us to re-express the data taking into account its structure. For simplicity, we start with the projection of a single data vector , which we want to project onto a unit vector .
The component of the data vector along the unit vector is given by their dot product . The result of the dot product is a scalar, so the projected vector along is a scaled version of .
We can measure the error of the projection as the squared distance of the projected vector from the original vector . This is also called the residual sum of squares (RSS).
The length of the unit vector is 1 by definition, so the equation simplifies to
If we want to find the unit vector that minimizes the RSS for vectors , we just sum up the errors from the projections of all onto .
For minimizing RSS with respect to we only care about the components of the equation that depend on . Isolation of the components that do depend on simplifies the problem considerably.
Notice that the component that depends on has a minus sign. This means that in order to minimize RSS, we need to maximize
Since does not depend on it is equivalent to maximize
which is the sample mean of the square of the dot product between and . As we are not dealing with weighted averages, the sample mean is the same as the expected value (E). Recall that the mean of the square minus the square of the mean equals the variance
If we substitute for and rearrange, we get
Now, let’s assume that all the vectors are centered, so that their mean is 0. If you have ever asked yourself why it is necessary to either center or scale the data before using PCA, this is why! This means that their projections onto also sum up to 0 and we are left with
Finding the unit vector that minimizes the projection error of the data vectors onto can be achieved by maximizing the variance of the dot products of and .
It makes intuitive sense that the dimension associated with the highest variance within the data has the potential to contain the most “information”. To be fair, there is also the most potential for noise. The math tells us that maximizing the variance is equivalent to minimizing the projection error, which makes even more intuitive sense.
the ideal basis for projection is the eigenbasis
The previous section established that we need to maximize the variance of to find the vector that minimizes the projection error. In this section we will go through the steps of the optimization problem and see that the set of unit vectors that maximize the variance upon projection of the data matrix are the eigenvectors of the feature covariance matrix of .
It is convenient to transition from summation notation to matrix notation at this point. The variance expressed in matrix notation. If we stack up all our data vectors into a matrix , the dot product gives us the projection of the vectors onto .
The feature covariance matrix is
We can see that simplification and rearrangement yields the following expression for the variance.
In order to set up the optimization problem, we define a function we want to maximize. In our case it is just the variance.
As we are not looking for all possible vectors but only unit vectors we set up the following constraint and define the constraint function .
Rearrangement and addition of the Langrange multiplier results in the constraint function , which we want to optimize.
Partial differentiation of with respect to results in
This is the eigenvector equation for the feature covariance matrix .
The desired vector is an eigenvector of the feature covariance matrix . Among the eigenvectors, the eigenvector with the largest eigenvalue will be the vector that captures the maximum variance or, equivalently, minimizes the residual sum of squares. It is the first principal component.
Sorting the eigenvector/eigenvalue pairs by eigenvalue from largest to smallest, we obtain all principal components. By definition, eigenvectors are orthogonal to each other and are a basis of the p-dimensional feature space.
The sum of the eigenvalues is the sum of the variance described by each component
We can now determine the fraction of variance explained by principal components where
Stacking our eigenvectors as columns into a gives us the weight matrix . The connection to the eigendecomposition can be seen clearly now.
Rearranging yields the classical form of the eigendecomposition of a square matrix.
Relationship to singular value decomposition
Software implementations of PCA often use a matrix factorization called the singular value decomposition (SVD) to compute the eigenvector/eigenvalue pairs of a matrix. SVD is closely related to eigendecomposition but offers the advantage that it is more numerically accurate and can operate on matrices that are not square and does not require the computation of the feature correlation matrix .
SVD decomposes a matrix into three components: An matrix , a diagonal matrix , and a matrix . and are both unitary matrices meaning that their transpose is also their inverse. The columns of the matrix contains the “left singular vectors” and the columns of the “right singular vectors”. The diagonal elements of are called the “singular values”, which are related to the eigenvalues.
We can think of SVD as breaking down the linear transformation performed by into a rotation (), a scaling (), and another rotation ().
The relationship to the eigendecomposition becomes clear when we investigate the SVD of
Recall the eigendecomposition . It is evident that the eigenvalues are the squares of the singular values
or, equivalently, the singular values are the square roots of the eigenvalues.
Eigendecomposition and SVD are highly related techniques. Both can be used to calculate eigenvectors and eigenvalues of a matrix with the exception that eigendecomposition requires a square matrix and SVD does not.
Projection of the data onto the new basis
One of the promises of PCA was that it finds rotation of the data that is more “natural”. So far we have only identified the new coordinate system (the eigenvectors) and which directions are the most variable (based on the eigenvalues). We have not yet put the data into the new frame of reference. This process is called “projection” or “transformation”.
The projection of the data matrix onto its eigenbase is given by
If we want to project the data on a subspace of , we just use the first columns of . Again we choose by how much of the variance we want to conserve.
Maybe you have determined the eigenvectors and eigenvalues of using SVD. If we substitute into the previous equation, we get
and are both matrices of the right eigenvectors of .
Alternatively, we could just use in place of
Using only components to project onto a lower dimensional subspace yields
When applying PCA in practice, we let the software do the heavy lifting, i.e. let it figure out the eigenvectors and eigenvalues of the data matrix . Once we have the eigenvector/eigenvalue pairs, obtaining the projection onto the new basis is a simple matrix-matrix product.
Choosing how many components to use for the projection of your data is the real work for you as a user of PCA. We will face this challenge in the next part of this series, which will cover the practical application of PCA using R.
Lindsay Smith – A Tutorial on Principal Components Analysis
Jeff Jauregui – Principal component analysis with linear algebra
Cosma Shalizi – Principle Components – Mathematics, Example, Interpretation
George Dallas – Principle Component Analysis 4 Dummies
Sebastian Raschka – Principle Component Analysis in 3 Simple Steps
Part 1: An Intuition
Part 2: A Look Behind The Curtain
Part 3: In the Trenches
Part 4: Potential Pitfalls
Part 5: Eigenpets