In the first three parts of this series on principal component analysis (PCA), we have talked about what PCA can do for us, what it is mathematically, and how to apply it in practice. Today, I will briefly discuss some of the potential caveats of PCA.
INformation and Noise
PCA looks for the dimensions with highest variance within the data and assumes that high variance is a proxy for “information”. This assumption is usually warranted otherwise PCA would not be useful.
In cases of unsupervised learning, that is if we have no class labels of the data available, looking for structure within the data based on the data itself is our only choice. In a sense, we cannot tell what parts of the data are information and what parts are noise.
If we have class labels available (supervised learning), we could in principle look for dimensions of variance that optimally separate the classes from each other. PCA does not do that. It is “class agnostic” and thus treats “information”-variance and “noise”-variance the same way.
It is possible that principle components associated with small eigenvalues nevertheless carry the most information. In other words, the size of the eigenvalue and the information content are not necessarily correlated. When choosing the number of components to project our data, we could thus lose important information. Luckily, such situations rarely happen in practice. Or we just never realize …
There are other techniques related to PCA that attempt to find dimensions of the data that optimally separate the data based on class labels. The most famous is Fisher’s “Linear Discriminant Analysis” (LDA) and its non-linear cousins “Quadratic Discriminant Analysis” (QDA).
In Part 3 of this series, we have looked at a data set containing a multitude of motion detection measurements of humans doing various activities. We used PCA to find a lower dimensional representation of those measurements that approximate the data well.
Each of the original measurements were quite tangible (despite their sometimes cryptic names) and therefore interpretable. After PCA, we are left with linear combinations of those original features, which may or may not be interpretable. It is far from guaranteed that the eigenvectors correspond to “real” entities, they may just be convenient summaries of the data.
We will rarely be able to say the first principle component means “X” and the second principle component means “Y”, however tempting it may be based on our preconceived notions of the data. A good example of that is mentioned in Cosma Shalizi’s excellent notes on PCA. Cavalli-Sforza et al. analyzed the distribution of human genes using PCA and interpreted the principal components as patterns of human migration and population expansion. Later, November and Stephens showed that similar patterns could be obtained using simulated data with spatial correlation. As humans are genetically more similar to humans they close to (at least historically), genetic data is necessarily spatially correlated and thus PCA will uncover such structures, even if they do not represent “real” events or are liable to misinterpretation.
Linear algebra tells us that eigenvectors are orthogonal to each other. A set of orthogonal vectors form a basis of an -dimensional subspace. The principle components are eigenvectors of the covariance matrix and the set of principle components form a basis for our data. We also say that the principle components are “uncorrelated”. This becomes obvious when we remember that matrix decomposition is sometimes called “diagonalization”. In the eigendecomposition, the matrix containing the eigenvalues has zeros everywhere but on its diagonal, which contains the eigenvalues.
Variance and covariance are measures in the L2 norm, which means that they involve the second moment or square. Being uncorrelated in the L2 norm does not mean that there is no “correlation” in higher norms, in other words the absence of correlation does not imply independence. In statistics, higher order norms are skew (“tailedness” or third moment) and kurtosis (“peakedness” or fourth moment). Techniques related to PCA such as Independent Component Analysis (IDA) can be used to extract two separate, but convolved signals (“independent components”) from each other based on higher order norms.
The distinction between correlation and independence is a technical point when it comes to the practical application of PCA but certainly worth being aware of.
Cosma Shalizi – Principal Components: Mathematics, Example, Interpretation
Cavalli-Sforza et al. – The History and Geography of Human Genes (1994)
Novembre & Stephens – Interpreting principal component analyses of spatial genetic variation (2008)
Part 1: An Intuition
Part 2: A Look Behind The Curtain
Part 3: In the Trenches
Part 4: Potential Pitfalls
Part 5: Eigenpets