# PCA – Part 1: An intuition

Pee-See-Ay. Just the sound of these three letters inspires awe in an experimental biologist. I am speaking from personal experience. Let’s grab a hammer and chisel and get to work on that pedestal.

In this first post of a series on principal component analysis (PCA), my goal is to avoid any mathematical formalism and provide an intuition about what PCA is and how this technique can be useful when analyzing data.

In follow-up posts, I will explain the mathematical basis of PCA, show practical examples of applying PCA using R, and discuss some points to keep in mind when using PCA.

### When to do PCA?

Most explanations of PCA start out with what it is mathematically and how it can be calculated. We will get to that later. I usually find it more instructive to first understand in which situations a method like PCA should be applied. Once I know what the practical applications are, the mathematics tend to make a lot more sense.

My own work is in genomics, so the first example I can think of is generally related to gene expression. Imagine a number of patient samples for which you have measured the transcript levels of all expressed genes. The number of patient samples (<100) is usually much lower than the number of transcripts you measure (>10000). In other words, the number of observations (samples) is smaller than the number of features (transcripts).

For most types of analyses, especially machine learning algorithms, this is not a favorable starting point. When faced with a large number of features, it is safe to assume that a good number of them contain very little information, such as measurements on genes that are not differentially expressed, or that there is redundancy (co-linearity) between features, e.g. genes that vary together. In any event, too many features increase computation time and colinear features can even produce unstable results in some algorithms.

In such a situation, PCA can be extremely useful because one of its strengths is to optimize the features based on the “hidden” structure of the data. PCA aims to find a new representation of the data by extracting combinations of features (components) from the data that are uncorrelated with each other and ordered by “importance”. This allows us to approximate the data with a reduced number of features. Dimensional reduction makes PCA a valuable tool not only for preparing data for machine learning but also for exploratory data analysis and data visualization.

### What does pca do?

Mathematically, the new features PCA provides us with are linear combinations of the old features. That means each of the old features weighs in – to different extents – to make up the new features. In PCA jargon, the weights of the original features are called “loadings“. It is a little bit like saying, take 1 cup of flour, 1 egg, 1 cup of milk, 3 teaspoons of baking powder, 1 teaspoon of sugar, and 1 teaspoon of salt and call it “pancake”. So our new feature (pancake) is made up of a combination of different amounts (the “loadings”) of the old features (the ingredients).

An important aspect of PCA is that the new features are “orthogonal” to each other, which is a more general way of expressing what I have called “uncorrelated” before. Orthogonality is a concept of linear algebra, which defines two vectors as orthogonal if their dot product is 0. It is related to “perpendicular”, which is commonly understood as having a “right” angle between to entities, such as the sides of a square or the x- and y-axis of our Cartesian coordinate system. In fact, you can think of PCA as a rotation of the data into a new coordinate system that “fits” the data more naturally. The axes of the new coordinate system are the eigenvectors of the data matrix. Eigenvectors are particular (“eigen”) to a given matrix that do not change direction under linear transformation by that matrix. Eigenvectors may, however, change length and/or orientation. The amount of scaling under linear transformation is called the eigenvalue. Each eigenvector of a matrix is associated with its own eigenvalue.

### How does pca work?

In general, a square matrix with $m$ rows and columns has $m$ eigenvector/eigenvalue pairs. How does PCA know which ones are the most important? The key assumption of PCA is that the directions within the data that show the most variance contain the most information and, thus, are likely the most important. We find the eigenvectors with the highest variance along their direction by eigenvector decomposition of the covariance or correlation matrix of the data. The eigenvector with the largest eigenvalue explains the most variance. It is also called the “principal component“. Further components explain progressively less variance and are generally considered less “important”. The sum of all components explain the total variance of the data, so nothing is lost by applying the PCA itself. The original data can be reconstituted by rotating back to the original feature space.

Now we are in a position to understand why PCA can be used for dimensional reduction. If the first couple eigenvalues are much larger than the rest, they capture the majority of the variance, and a projection of the data onto the first couple of components will result in a good approximation of the original data. We implicitly assume that we have concentrated the “signal” in the first couple of components, and the rest capture merely the “noise” of the data. There is no strict cut-off for what constitutes the majority of variance. Depending on the field of research, people use numbers between 60% and 95%.

### Recap

• PCA is a data analysis method that uses linear combinations of existing feature vectors to re-expresses your data in terms of uncorrelated, orthogonal components.
• The components are selected and sorted based on how much variance they explain within the data. The sum of all components explain the total variance.
• PCA can be used to reduce the number of features within a data set by choosing only those components that explain the majority of variance and projecting the data onto a lower dimensional subspace. This can result in a good approximation of the original data.

### PCA SERIES

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets