PCA – Part 5: Eigenpets

In this post scriptum to my series on Principal Component Analysis (PCA) I will show how PCA can be applied to image analysis. Given a number of images of faces, PCA decomposes those images into “eigenfaces” that are the basis of some facial recognition algorithms. Eigenfaces are the eigenvectors of the image data matrix. I have shown in Part 3 of this series that the eigenvectors that capture a given amount of variance of the data can be used to obtain an approximation of the original data using fewer dimensions. Likewise, we can approximate the image of a human face by a weighted combination of eigenfaces.

From images to matrices

A digital image is just a matrix of numbers. Fair game for PCA. You might ask yourself, though, how to coerce many matrices into a single data matrix with samples in the rows and features in the columns? Just stack the columns of the image matrices to get a single long vector and then stack the image vectors to obtain the data matrix. For example, a 64 by 64 image matrix would result in a 4096-element image vector, and 100 such image vectors would be stacked into a 100 by 4096 data matrix.

The more elements a matrix has, the more computationally expensive it becomes to do the matrix factorization that yields the eigenvectors and eigenvalues. As the number of images (samples) is usually much smaller than the number of pixels (features), it is more efficient to compute the eigenvectors of the transpose of the data matrix with the pixels in the rows and the images in the columns.

Fortunately, there is an easy way to get from the eigenvectors of the covariance matrix \boldsymbol A^T \boldsymbol A to those of the covariance matrix \boldsymbol A \boldsymbol A^T. The eigenvector equation of \boldsymbol A^T \boldsymbol A is

\boldsymbol A^T \boldsymbol A \vec{v} = \lambda \vec{v}

Multiplying \boldsymbol A to the left on both sides gives us important clues about the relationship between the eigenvectors and eigenvalues between the two matrices.

\boldsymbol A \boldsymbol A^T (\boldsymbol A \vec{v}) = \lambda (\boldsymbol A \vec{v})

The eigenvalues of the covariance matrices \boldsymbol A^T \boldsymbol A and \boldsymbol A \boldsymbol A^T are identical. If \boldsymbol A is an m by n matrix and m < n, there will be m nonzero eigenvalues and n-m zero eigenvalues.

To get from an eigenvector \vec{v} of \boldsymbol A^T \boldsymbol A to an eigenvector of \boldsymbol A \boldsymbol A^T, we just need to multiply by \boldsymbol A on the left.

What does an “eigenpet” look like?

For this demonstration, I will not be using images of human faces but, in line with the predominant interests of the internet, faces of cats and dogs. I obtained this data set when I took the course “Computational Methods for Data Analysis” on Coursera and converted the original data to text files to make them more accessible to R. The data can be found on Github.

library(RCurl)
# read data from Github
cats <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/cat.csv"), sep = ",")
dogs <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/dog.csv"), sep = ",")
# combine cats and dogs into single data frame
pets <- cbind(cats, dogs)

The data matrix already is in a convenient format. Each 64 by 64 pixel image has been converted into a 4096 pixel vector and each of the 160 image vector is stacked vertically to obtain a data matrix with dimensions 4096 by 160. As you might have guessed, there are 80 cats and 80 dogs in the data set.

The distinction between what is a sample and what is a feature becomes a little blurred in this analysis. Technically, the 160 images are the samples and the 4096 pixels are the features, so we should do PCA on a 160 by 4096 matrix. However, as discussed above it is more convenient to operate with the transpose of this matrix, which is why image data usually is prepared in its transposed form with features in rows and samples in columns.

As we have seen in theoretically in Part 2 of this series, it is necessary to center (and scale) the features before performing PCA. We are dealing with 8-bit greyscale pixel values that are naturally bounded between 0 and 255, so they already are on the same scale. Moreover, all features measure the same quantity (brightness), so it is customary to center the data by subtracting the “average” image from the data.

# compute "average" pet
pet0 <- rowMeans(pets)
cat0 <- rowMeans(pets[, 1:80])
dog0 <- rowMeans(pets[, 81:160])

So what does the average pet, the average cat, average dog look like?

# create grey scale color map
greys <- gray.colors(256, start = 0, end = 1)
# convenience function to plot pets
show_image <- function(v, n = 64, col = greys) {
    # transform image vector back to matrix
    m <- matrix(v, ncol = n, byrow = TRUE)
    # invert columns to obtain right orientation
    # plot using "image"
    image(m[, nrow(m):1], col = col, axes = FALSE)
}
# plot average pets
for (i in list(pet0, cat0, dog0)) {
    show_image(i)
}

pca_part5_fig1As expected, the average pet has features of both cats and dogs, while the average cat and dog are quite recognizable as members of their respective species.

Let’s run PCA on the data set and see what an “eigenpet” looks like.

# subtract average pet
pets0 <- pets - pet0
# run pca
pc <- prcomp(pets0, center = FALSE, scale. = FALSE)

Ordinarily, we would find the eigenvector matrix in the “rotation” object return to us by “prcomp”. Remember, that we did PCA on the transpose of the data matrix, so we have to do some additional work to get to the eigenvectors of the original data. I have shown above that to get from the eigenvectors of \boldsymbol A^T \boldsymbol A to the eigenvectors of \boldsymbol A \boldsymbol A^T, we just need to multiply by \boldsymbol A on the left. \boldsymbol A, in this case, is our centered data matrix “pets0”.

Note that the unscaled eigenvectors of the eigenfaces are equivalent to the projection of the data onto the eigenvectors of the transposed data matrix. Therefore, we do not have to explicitly compute them but can use the “pc$x” object.

# obtain unscaled eigenvectors of eigenfaces
u_unscaled <- as.matrix(pets0) %*% pc$rotation
# this turns out to be the same as the projection of the data 
# stored in "pc$x"
all.equal(u_unscaled, pc$x)

Both “u_unscaled” and “pc$x” are unscaled versions of the eigenvectors, which means that they are not unitary matrices. For plotting, this does not matter because the images will be scaled automatically. If scaled eigenvectors are important, it is more convenient to use singular value decomposition and use the left eigenvalues stored in the matrix “u”.

# singular value decomposition of data
sv <- svd(pets0)
# left eigenvalues are stored in matrix "u"
u <- sv$u

Let’s look at the first couple of “eigenpets” (eigenfaces).

# display first 6 eigenfaces
for (i in seq(6)) {
    show_image(pc$x[, i])
}

pca_part5_fig2Some eigenfaces are definitely more cat-like and others more dog-like. We also see a common issue with eigenfaces. The directions of highest variance are usually associated with the lighting conditions of the original images. Whether they are predominantly light or dark dominates the first couple of components. In this particular case, it appears that the images of cats have stronger contrasts between foreground and background. If we were to do face recognition, we would either preprocess the images to have comparable lighting conditions or exclude the first couple of eigenfaces.

Reconstruction of Pets using “Eigenpets”

Using the eigenfaces we can approximate or completely reconstruct the original images. Let’s see how this looks like with a couple of different variance cut-offs.

# a number of variance cut-offs
vars <- c(0.2, 0.5, 0.8, 0.9, 0.98, 1)
# calculate cumulative explained variance
var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2)
# get the number of components that explain a given amount of variance
npc <- sapply(vars, function(v) which(var_expl >= v)[1])
# reconstruct four cats and four dogs
for (i in seq(79, 82)) {
    for (j in npc) {
        # project data using "j" principal components
        r <- pc$x[, 1:j] %*% t(pc$rotation[, 1:j])
        show_image(r[, i])
        text(x = 0.01, y = 0.05, pos = 4, labels = paste("v =", round(var_expl[j], 2)))
        text(x = 0.99, y = 0.05, pos = 2, labels = paste("pc =", j))
    }
}

pca_part5_fig3Every pet starts out looking like a cat due to the dominance of lighting conditions. Somewhere between 80% and 90% of captured variance, every pet is clearly identifiable as either a cat or a dog. Even at 90% of capture variance, we use less than one third of all components for our approximation of the original image. This may be enough for a machine learning algorithm to tell apart a cat from a dog with high confidence. We see, however, that a lot of the detail of the image is contained in the last two thirds of the components.


Reproducibility

The full R code is available on Github.


Further Reading

Scholarpedia – Eigenfaces

Jeff Jauregui – Principal Component Analysis with Linear Algebra


PCA SERIES

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

Advertisements
PCA – Part 5: Eigenpets

PCA – Part 4: Potential Pitfalls

In the first three parts of this series on principal component analysis (PCA), we have talked about what PCA can do for us, what it is mathematically, and how to apply it in practice. Today, I will briefly discuss some of the potential caveats of PCA.

INformation and Noise

PCA looks for the dimensions with highest variance within the data and assumes that high variance is a proxy for “information”. This assumption is usually warranted otherwise PCA would not be useful.

In cases of unsupervised learning, that is if we have no class labels of the data available, looking for structure within the data based on the data itself is our only choice. In a sense, we cannot tell what parts of the data are information and what parts are noise.

If we have class labels available (supervised learning), we could in principle look for dimensions of variance that optimally separate the classes from each other. PCA does not do that. It is “class agnostic” and thus treats “information”-variance and “noise”-variance the same way.

It is possible that principle components associated with small eigenvalues nevertheless carry the most information. In other words, the size of the eigenvalue and the information content are not necessarily correlated. When choosing the number of components to project our data, we could thus lose important information. Luckily, such situations rarely happen in practice. Or we just never realize …

There are other techniques related to PCA that attempt to find dimensions of the data that optimally separate the data based on class labels. The most famous is Fisher’s “Linear Discriminant Analysis” (LDA) and its non-linear cousins “Quadratic Discriminant Analysis” (QDA).

Interpretability

In Part 3 of this series, we have looked at a data set containing a multitude of motion detection measurements of humans doing various activities. We used PCA to find a lower dimensional representation of those measurements that approximate the data well.

Each of the original measurements were quite tangible (despite their sometimes cryptic names) and therefore interpretable. After PCA, we are left with linear combinations of those original features, which may or may not be interpretable. It is far from guaranteed that the eigenvectors correspond to “real” entities, they may just be convenient summaries of the data.

We will rarely be able to say the first principle component means “X” and the second principle component means “Y”, however tempting it may be based on our preconceived notions of the data. A good example of that is mentioned in Cosma Shalizi’s excellent notes on PCA. Cavalli-Sforza et al. analyzed the distribution of human genes using PCA and interpreted the principal components as patterns of human migration and population expansion. Later, November and Stephens showed that similar patterns could be obtained using simulated data with spatial correlation. As humans are genetically more similar to humans they close to (at least historically), genetic data is necessarily spatially correlated and thus PCA will uncover such structures, even if they do not represent “real” events or are liable to misinterpretation.

Independence

Linear algebra tells us that eigenvectors are orthogonal to each other. A set of n orthogonal vectors form a basis of an n-dimensional subspace. The principle components are eigenvectors of the covariance matrix and the set of principle components form a basis for our data. We also say that the principle components are “uncorrelated”. This becomes obvious when we remember that matrix decomposition is sometimes called “diagonalization”. In the eigendecomposition, the matrix containing the eigenvalues has zeros everywhere but on its diagonal, which contains the eigenvalues.

Variance and covariance are measures in the L2 norm, which means that they involve the second moment or square. Being uncorrelated in the L2 norm does not mean that there is no “correlation” in higher norms, in other words the absence of correlation does not imply independence. In statistics, higher order norms are skew (“tailedness” or third moment) and kurtosis (“peakedness” or fourth moment). Techniques related to PCA such as Independent Component Analysis (IDA) can be used to extract two separate, but convolved signals (“independent components”) from each other based on higher order norms.

The distinction between correlation and independence is a technical point when it comes to the practical application of PCA but certainly worth being aware of.


Further reading

Cosma Shalizi – Principal Components: Mathematics, Example, Interpretation

Cavalli-Sforza et al. – The History and Geography of Human Genes (1994)

Novembre & Stephens – Interpreting principal component analyses of spatial genetic variation (2008)


PCA SERIES

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 4: Potential Pitfalls

PCA – Part 3: In the Trenches

Now that we have an intuition of what principal component analysis (PCA) is and understand some of the mathematics behind it, it is time we make PCA work for us.

Practical examples of PCA typically use Ronald Fisher’s famous “Iris” data set, which contains four measurements of leaf lenghts and widths of three subspecies of Iris flowers. To mix things up a little bit, I will use a data set that is closer to what you would encounter in the real world.

The “Human Activity Recognition Using Smartphones Data Set” available from the UCI Machine Learning Repository contains a total of 561 triaxial acceleration and angular velocity measurements of 30 subjects performing different movements such as sitting, standing, and walking. The researchers collected this data set to ask whether those measurements would be sufficient to tell the type of activity of the person. Instead of focusing on this classification problem, we will look at the structure of the data and investigate using PCA whether we can express the information contained in the 561 different measurements in a more compact form.

I will be using a subset of the data containing the measurements of only three subjects. As always, the code used for the pre-processing steps of the raw data can be found on GitHub.

Step 1: Explore the data

Let’s first load the pre-processed subset of the data into our R session.

# read data from Github
measurements <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_measurements.txt"))
description <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_description.txt"))

It’s always a good idea to check for a couple of basic things first. The big three I usually check are:

  • What are dimensions of the data, i.e. how many rows and columns?
  • What type of features are we dealing with, i.e. categorical, ordinal, continuous?
  • Are there any missing values?

The answer to those three questions will determine the amount of additional data munging we have to do before we can use the data for PCA.

# what are the dimensions of the data?
dim(measurements)
# what type of data are the features?
table(sapply(measurements, class))
# are there missing values?
any(is.na(measurements))

The data contains 990 samples (rows) with 561 measurements (columns) each. Clearly too many measurements for visualizing on a scatterplot. The measurements are all of type “numeric”, which means we are dealing with continuous variables. This is great because categorical and ordinal variable are not handled well by PCA. Those need to be “dummy coded“. We also don’t have to worry about missing values. Strategies for handling missing values are a topic on its own.

Before we run PCA on the data, we should look at the correlation structure of the features. If there are features, i.e. measurements in our case, that are highly correlated (or anti-correlated), there is redundancy within the data set and PCA will be able to find a more compact representation of the data.

# feature correlation before PCA
cor_m <- cor(measurements, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_m <- cor_m[upper.tri(cor_m)]
# plot correlations as histogram
hist(upt_m, prob = TRUE)
# plot correlations as image
image.plot(cor_m, axes = FALSE)

The code was simplified for clarity. The full version can be found in the script.

pca_part3_fig1We see in the histogram on the left that there is a considerable number of highly correlated features, most of them positively correlated. Those features show up as yellow in the image representation to the right. PCA will likely be able to provide us with a good lower dimensional approximation of this data set.

Step 2: Run PCA

After all the preparation, running PCA is just one line of code. Remember, that we need to at least center the data before using PCA (Why? see Part 2). Scaling is technically only necessary if the magnitude of the features are vastly different. Note, that the data appears to be already centered and scaled from the get-go.

# run PCA
pc <- prcomp(measurements, center = TRUE, scale. = TRUE)

Depending on your system and the number of features of your data this may take a couple of seconds.

The call to “prcomp” has constructed new features by linear combinations of the old features and sorted them by their and weighted by the amount of variance they explain. Because the new features are the eigenvectors of the feature covariance matrix, they should be orthogonal, and hence uncorrelated, by definition. Let’s visualize this directly.

The new representation of the data is stored as a matrix named “x” in the list object we get back from “prcomp”. In our case, the matrix would be stored as “pc$x”.

# feature correlation before PCA
cor_r <- cor(pc$x, method = "pearson")
# use only upper triangular matrix to avoid redundancy
upt_r <- cor_r[upper.tri(cor_r)]
# plot correlations as histogram
hist(upt_r, prob = TRUE)
# plot correlations as image
image.plot(cor_r, axes = FALSE)

pca_part3_fig2

The new features are clearly no longer correlated to each other. As everything seems to be in order, we can now focus on the interpretation of the results.

Step 3: Interpret the results

The first thing you will want to check is how much variance is explained by each component. In PCA speak, this can be visualized with a “scree plot”. R conveniently has a built-in function to draw such a plot.

# draw a scree plot
screeplot(pc, npc = 10, type = "line")

pca_part3_fig3

This is about as good as it gets. A large amount of the variance is captured by the first principal component followed by a sharp decline as the remaining components gradually explain less and less variance approaching zero.

The decision of how many components we should use to get a good approximation of the data has to be made on a case-by-case basis. The cut-offs for the percent explained variance depends on the kind of data you are working with and its inherent covariance structure. The majority of the data sets you will encounter are not nearly as well behaved as this one, meaning that the decline in explained variance is much more shallow. Common cut-offs range from 80% to 95% of explained variance.

Let’s look at how many components we would need to explain a given amount of variance. In the R implementation of PCA, the variances explained by each principle component are stored in a vector called “sdev”. As the name implies, these are standard deviations or the square roots of the variances, which in turn are scaled versions of the eigenvalues. We will need to take the squares “sdev” to get back the variances.

# calculate explained variance as cumulative sum
# sdev are the square roots of the variance
var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2)
# plot explained variance
plot(c(0, var_expl), type = "l", lwd = 2, ylim = c(0, 1), 
     xlab = "Principal Components", ylab = "Variance explained")
# plot number of components needed to for common cut-offs of variance explained
vars <- c(0.8, 0.9, 0.95, 0.99)
for (v in vars) {
npc <- which(var_expl > v)[1]
    lines(x = c(0, npc, npc), y = c(v, v, 0), lty = 3)
    text(x = npc, y = v - 0.05, labels = npc, pos = 4)
    points(x = npc, y = v)
}

pca_part3_fig4

The first principle component on its own explains more than 50% of the variance and we need only 20 components to get up to 80% of the explained variance. Fewer than 30% of the components (162 out of 561) are needed to capture 99% of the variance in the data set. This is a dramatic reduction of complexity. Being able to approximate the data set with a much smaller number of features can greatly speed up downstream analysis and can help to visualize the data graphically.

Finally, let’s investigate whether “variance” translates to “information”. In other words, do the prinicipal components associated with the largest eigenvalues discriminate between the different human activities?

If the class labels (“activities” in our case) are known, a good way to do look at the “information content” of the principal components is to look at scatter plots of the first couple of components and color-code the samples by class label. This code gives you a bare bones version of the figure shown below. The complete code can be found on Github.

# plot the first 8 principal components against each other
for(p in seq(1, 8, by = 2)) {
  plot(pc$x[, p:(p+1)], pch = 16, 
       col = as.numeric(description$activity_name))
}

pca_part3_fig5

We have seen previously that the first component alone explains about half of the variance and in this figure we see why. It almost perfectly separates non-moving “activities” (“laying”, “sitting”, “standing”) from moving activities (various types of “walking”). The second component does a reasonable job at telling the difference between walking and walking upstairs. As we move down the list, there remains visible structure but distinctions become somewhat less clear. One conclusion we can draw from this visualization is that it will most likely be most difficult to tell “sitting” apart from “standing” as none of the dimensions seems to be able to distinguish red and green samples. Oddly enough, the fifth component does a pretty good job of separating “laying” from “sitting” and “standing”.

Recap

PCA can be a powerful technique to obtain low dimensional approximations of data with lots of redundant features. The “Human Activity Recognition Using Smartphones Data Set” used in this tutorial is a particularly good example of that. Most real data sets will not be reduced to a few components so easily while retaining most of the information. But even cutting the number of features in half can lead to considerable time savings when using machine learning algorithms.

Here are a couple of useful questions when approaching a new data set to apply PCA to:

  1. Are the features numerical or do I have to convert categorial features?
  2. Are there missing values and if yes, which strategy do I apply to deal with them?
  3. What is the correlation structure of the data? Will PCA be effective in this case?
  4. What is the distribution of variances after PCA? Do I see a steep or shallow decline in explained variance?
  5. How much “explained variance” is a good enough approximation of the data? This is usually a compromise between how much potential information I am willing to sacrifice for cutting down computation time of follow-up analyses.

In the final part of this series, we will discuss some of the limitations of PCA.


Addendum: Understanding “prcomp”

The “prcomp” function is very convenient because it caclulates all the numbers we could possible want from our PCA analysis in one line. However, it is useful to know how those number were generated.

The three most frequently used objects returned by “prcomp” are

  • “rotation”: right eigenvectors (“feature eigenvectors”)
  • “sdev”: square roots of scaled eigenvalues
  • “x”: projection of original data onto the new features

Rotation

In Part 2, I mentioned that software implementations of PCA usually compute the eigenvectors of the data matrix using singular value decomposition (SVD) rather than eigendecomposition of the covariance matrix. In fact, R’s “prcomp” is no exception.

“Rotation” is a matrix whose columns are the right eigenvalues of the original data. We can reconstruct “rotation” using SVD.

# perform singular value decomposition on centered and scaled data
sv <- svd(scale(measurements))
# "prcomp" stores right eigenvectors in "rotation"
w <- pc$rotation
dimnames(w) = NULL
# "svd" stores right eigenvectors in matrix "v"
v <- sv$v
# check if the two matrices are equal
all.equal(w, v)

Sdev

Singular values are the square roots of the eigenvalues as we have seen in Part 2. “sdev” stands for standard deviation and thus stores the square roots of the variances. Thus, the squares of “sdev” and the squares of the singular values are directly proportional to each other and the scaling factor is the number of rows of the original data matrix minus 1.

# relationship between singular values and "sdev"
all.equal(sv$d^2/(nrow(sv$u)-1), pc$sdev^2)

x

The projection of the orignal data (“measurements”) onto its eigenbasis is automatically calculated by “prcomp” through its default argument “retx = TRUE” and stored in “x”. We can manually recreate the projection using matrix-matrix multiplication.

# manual projection of data
all.equal(pc$x, scale(measurements) %*% pc$rotation)

If we wanted to obtain a projection of the data onto a lower dimensional subspace, we just determine the number of components needed and subset the columns of matrix “x”. For example, if we wanted to get an approximation of the original data preserving 90% of the variance, we take the first 52 columns of “x”.

# projection of original data preserving 90% of variance
y90 <- pc$x[, 1:52]
# note that this is equivalent matrix multiplication with 
# the first 52 eigenvectors
all.equal(y90, scale(measurements) %*% pc$rotation[, 1:52])


Reproducibility

The full R code is available on Github.


Further reading

IRIS data set

Sebastian Raschka – Principle Component Analysis in 3 Simple Steps


PCA SERIES

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 3: In the Trenches

PCA – Part 2: A look behind the curtain

In Part 1 of this series, I provided a non-technical intuition about what principal component analysis (PCA) is and how we can use it to find lower dimensional projections of our data that conserve the majority of the signal.

In doing so, I have explicitly or implicitly made a number of assertions:

  1. PCA is a projection of the data onto a new set of uncorrelated (orthogonal) basis vectors, which turn out to be the eigenvectors of the covariance matrix of the data
  2. The new coordinate system is found by looking for the directions of maximum variance in the data
  3. The eigenvector associated with the largest eigenvalue is the direction of maximum variance
  4. The new basis vectors can be used to project the data onto a lower dimensional subspace that captures the majority of the signal of the data

In the second installment of this series on PCA, I will justify those assertions mathematically. For this I am relying heavily on this excellent resource. Less technical explanations and visual representations can be found in the reference section.

finding the ideal basis by maximizing the variance

We will first investigate how to find a new set of basis vectors that allows us to re-express the data taking into account its structure. For simplicity, we start with the projection of a single data vector \vec{x}, which we want to project onto a unit vector \hat{w}.

The component of the data vector \vec{x} along the unit vector \hat{w} is given by their dot product \vec{x} \cdot \hat{w}. The result of the dot product is a scalar, so the projected vector \vec{x_{p}} along \hat{w} is a scaled version of \hat{w}.

\vec{x_{p}} = (\vec{x} \cdot \hat{w}) \hat{w}

We can measure the error of the projection as the squared distance of the projected vector \vec{x_{p}} from the original vector \vec{x}. This is also called the residual sum of squares (RSS).

RSS(\hat{w}) = \| \vec{x} - \vec{x_p} \|^2 = \| \vec{x} - (\vec{x} \cdot \hat{w}) \hat{w} \|^2= \| \vec{x} \|^2 - 2 (\vec{x} \cdot \hat{w})^2 + \| \hat{w}\|^2

The length of the unit vector \hat{w} is 1 by definition, so the equation simplifies to

RSS(\hat{w}) = \| \vec{x} \|^2 - 2 (\vec{x} \cdot \hat{w})^2 + 1

If we want to find the unit vector \hat{w} that minimizes the RSS for n vectors \vec{x}, we just sum up the errors from the projections of all \vec{x_{i}} onto \hat{w}.

RSS(\hat{w}) = \sum \limits_{i=1}^{n} \| \vec{x_{i}} \|^2 - 2 (\vec{x_{i}} \cdot \hat{w})^2 + 1

For minimizing RSS with respect to \hat{w} we only care about the components of the equation that depend on \hat{w}. Isolation of the components that do depend on \hat{w} simplifies the problem considerably.

RSS(\hat{w}) = (n + \sum \limits_{i=1}^{n} \| \vec{x_{i}} \|^2) - 2 \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2

Notice that the component that depends on \hat{w} has a minus sign. This means that in order to minimize RSS, we need to maximize

\sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2

Since n does not depend on \hat{w} it is equivalent to maximize

\frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2

which is the sample mean of the square of the dot product between \vec{x} and \hat{w}. As we are not dealing with weighted averages, the sample mean is the same as the expected value (E). Recall that the mean of the square minus the square of the mean equals the variance

Var(z) = E(z^2) - [E(z)]^2

If we substitute (\vec{x_{i}} \cdot \hat{w}) for z and rearrange, we get

\frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2 = [\frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})]^2 + Var(\vec{x_{i}} \cdot \hat{w})

Now, let’s assume that all the vectors \vec{x} are centered, so that their mean is 0. If you have ever asked yourself why it is necessary to either center or scale the data before using PCA, this is why! This means that their projections onto \hat{w} also sum up to 0 and we are left with

Var(\vec{x_{i}} \cdot \hat{w}) = \frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2

Finding the unit vector \hat{w} that minimizes the projection error of the data vectors \vec{x_{i}} onto \hat{w} can be achieved by maximizing the variance of the dot products of \vec{x_{i}} and \hat{w}.

Conclusions

It makes intuitive sense that the dimension associated with the highest variance within the data has the potential to contain the most “information”. To be fair, there is also the most potential for noise. The math tells us that maximizing the variance is equivalent to minimizing the projection error, which makes even more intuitive sense.

the ideal basis for projection is the eigenbasis

The previous section established that we need to maximize the variance of \vec{x} \cdot \hat{w} to find the vector \hat{w} that minimizes the projection error. In this section we will go through the steps of the optimization problem and see that the set of unit vectors \hat{w} that maximize the variance upon projection of the data matrix \boldsymbol X are the eigenvectors of the feature covariance matrix \boldsymbol Q of \boldsymbol X.

It is convenient to transition from summation notation to matrix notation at this point. The variance expressed in matrix notation. If we stack up all our data vectors \vec{x} into a n \times p matrix \boldsymbol X, the dot product \boldsymbol X \boldsymbol w gives us the projection of the vectors onto \hat{w}.

\sigma_{\hat{w}}^2 = \frac{1}{n} \sum \limits_{i=1}^{n} (\vec{x_{i}} \cdot \hat{w})^2=\frac{1}{n} (\boldsymbol X \boldsymbol w)^T \boldsymbol X \boldsymbol w

The p \times p feature covariance matrix \boldsymbol Q is

\boldsymbol Q=\frac{1}{n}\boldsymbol X^T\boldsymbol X

We can see that simplification and rearrangement yields the following expression for the variance.

\sigma_{\hat{w}}^2=\frac{1}{n}\boldsymbol w^T\boldsymbol X^T\boldsymbol X\boldsymbol w=\boldsymbol w^T\frac{\boldsymbol X^T\boldsymbol X}{n}\boldsymbol w=\boldsymbol w^T\boldsymbol Q\boldsymbol w

In order to set up the optimization problem, we define a function f we want to maximize. In our case it is just the variance.

f(\boldsymbol w) = \boldsymbol w^T\boldsymbol Q\boldsymbol w

As we are not looking for all possible vectors \boldsymbol w but only unit vectors \boldsymbol w we set up the following constraint \|\boldsymbol w\|^2 = \boldsymbol w^T\boldsymbol w=1 and define the constraint function g.

g(\boldsymbol w) = \boldsymbol w^T \boldsymbol w = 1

Rearrangement and addition of the Langrange multiplier \lambda results in the constraint function u(\boldsymbol w, \lambda) = f(\boldsymbol w) - \lambda(g(\boldsymbol w) - 1), which we want to optimize.

u=\boldsymbol w^T\boldsymbol Q\boldsymbol w - \lambda (\boldsymbol w^T\boldsymbol w - 1)

Partial differentiation of u with respect to \boldsymbol w results in

\frac{\partial u}{\partial \boldsymbol w} = 2 \boldsymbol Q\boldsymbol w - 2 \lambda \boldsymbol w = 0

\boldsymbol Q\boldsymbol w =\lambda\boldsymbol w

This is the eigenvector equation for the feature covariance matrix \boldsymbol Q.

Conclusions

The desired vector \boldsymbol w is an eigenvector of the feature covariance matrix \boldsymbol Q. Among the p eigenvectors, the eigenvector with the largest eigenvalue will be the vector that captures the maximum variance or, equivalently, minimizes the residual sum of squares. It is the first principal component.

Sorting the eigenvector/eigenvalue pairs by eigenvalue from largest to smallest, we obtain all p principal components. By definition, eigenvectors are orthogonal to each other and are a basis of the p-dimensional feature space.

The sum of the eigenvalues is the sum of the variance described by each component

\sigma_{total}^2 = \sum \limits_{i=1}^p \lambda_{i}

We can now determine the fraction of variance explained by q principal components where q \leq p

\sigma_{explained}^2 / \sigma_{total}^2 = \sum \limits_{i=1}^q \lambda_{i} / \sum \limits_{i=1}^p \lambda_{i}

Stacking our eigenvectors \boldsymbol w as columns into a p \times p gives us the weight matrix \boldsymbol W. The connection to the eigendecomposition can be seen clearly now.

\sigma^2\boldsymbol I=\lambda\boldsymbol I=\boldsymbol \Lambda =\boldsymbol W^T\boldsymbol Q\boldsymbol W\propto \boldsymbol W^T \boldsymbol X^T \boldsymbol X \boldsymbol W

Rearranging yields the classical form of the eigendecomposition of a square matrix.

\boldsymbol X^T \boldsymbol X = \boldsymbol W \boldsymbol \Lambda \boldsymbol W^T

Relationship to singular value decomposition

Software implementations of PCA often use a matrix factorization called the singular value decomposition (SVD) to compute the eigenvector/eigenvalue pairs of a matrix. SVD is closely related to eigendecomposition but offers the advantage that it is more numerically accurate and can operate on matrices that are not square and does not require the computation of the feature correlation matrix \boldsymbol Q.

SVD decomposes a n \times p matrix \boldsymbol X into three components: An n \times n matrix \boldsymbol U, a diagonal n \times p matrix \boldsymbol \Sigma, and a p \times p matrix \boldsymbol V. \boldsymbol U and \boldsymbol V are both unitary matrices meaning that their transpose is also their inverse. The columns of the matrix \boldsymbol U contains the “left singular vectors” and the columns of \boldsymbol V the “right singular vectors”. The diagonal elements of \boldsymbol \Sigma are called the “singular values”, which are related to the eigenvalues.

\boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T

We can think of SVD as breaking down the linear transformation performed by \boldsymbol X into a rotation (\boldsymbol U), a scaling (\boldsymbol \Sigma), and another rotation (\boldsymbol V).

The relationship to the eigendecomposition becomes clear when we investigate the SVD of \boldsymbol X^T \boldsymbol X

\boldsymbol X^T \boldsymbol X = (\boldsymbol U \boldsymbol \Sigma \boldsymbol V^T)^T \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T = \boldsymbol V \boldsymbol \Sigma^T \boldsymbol U^T \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T= \boldsymbol V \boldsymbol \Sigma^2 \boldsymbol V^T

Recall the eigendecomposition \boldsymbol X^T \boldsymbol X = \boldsymbol W \boldsymbol \Sigma \boldsymbol W^T. It is evident that the eigenvalues are the squares of the singular values

\boldsymbol \Lambda = \boldsymbol \Sigma^2

or, equivalently, the singular values are the square roots of the eigenvalues.

\boldsymbol \Sigma = \sqrt{\boldsymbol \Lambda}

Conclusions

Eigendecomposition and SVD are highly related techniques. Both can be used to calculate eigenvectors and eigenvalues of a matrix with the exception that eigendecomposition requires a square matrix and SVD does not.

Projection of the data onto the new basis

One of the promises of PCA was that it finds rotation of the data that is more “natural”. So far we have only identified the new coordinate system (the eigenvectors) and which directions are the most variable (based on the eigenvalues). We have not yet put the data into the new frame of reference. This process is called “projection” or “transformation”.

The projection of the data matrix \boldsymbol X onto its eigenbase \boldsymbol W is given by

\boldsymbol Y_{n \times p} = \boldsymbol X_{n \times p} \boldsymbol W_{p \times p}

If we want to project the data on a subspace of \boldsymbol W, we just use the first q columns of \boldsymbol W. Again we choose q by how much of the variance we want to conserve.

\boldsymbol Y_{n \times q} = \boldsymbol X_{n \times p} \boldsymbol W_{p \times q}

Maybe you have determined the eigenvectors and eigenvalues of \boldsymbol X using SVD. If we substitute \boldsymbol X = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T into the previous equation, we get

\boldsymbol Y = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T \boldsymbol W

\boldsymbol V and \boldsymbol W are both p \times p matrices of the right eigenvectors of \boldsymbol X.

\boldsymbol Y_{n \times p} = \boldsymbol U \boldsymbol \Sigma \boldsymbol V^T \boldsymbol V = \boldsymbol U_{n \times n} \boldsymbol \Sigma_{n \times p}

Alternatively, we could just use \boldsymbol V in place of \boldsymbol W

\boldsymbol Y_{n \times p} = \boldsymbol X_{n \times p} \boldsymbol V_{p \times p}

Using only q components to project onto a lower dimensional subspace yields

\boldsymbol Y_{n \times q} = \boldsymbol U_{n \times n} \boldsymbol \Sigma_{n \times q}

\boldsymbol Y_{n \times q} = \boldsymbol X_{n \times p} \boldsymbol V_{p \times q}

Conclusions

When applying PCA in practice, we let the software do the heavy lifting, i.e. let it figure out the eigenvectors and eigenvalues of the data matrix \boldsymbol X. Once we have the eigenvector/eigenvalue pairs, obtaining the projection onto the new basis is a simple matrix-matrix product.

Choosing how many components q to use for the projection of your data is the real work for you as a user of PCA. We will face this challenge in the next part of this series, which will cover the practical application of PCA using R.


Further reading

Mathematics

Lindsay Smith – A Tutorial on Principal Components Analysis

Jeff Jauregui – Principal component analysis with linear algebra

Cosma Shalizi – Principle Components – Mathematics, Example, Interpretation

Visual Explanation

George Dallas – Principle Component Analysis 4 Dummies

Sebastian Raschka – Principle Component Analysis in 3 Simple Steps


PCA series

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 2: A look behind the curtain

PCA – Part 1: An intuition

Pee-See-Ay. Just the sound of these three letters inspires awe in an experimental biologist. I am speaking from personal experience. Let’s grab a hammer and chisel and get to work on that pedestal.

In this first post of a series on principal component analysis (PCA), my goal is to avoid any mathematical formalism and provide an intuition about what PCA is and how this technique can be useful when analyzing data.

In follow-up posts, I will explain the mathematical basis of PCA, show practical examples of applying PCA using R, and discuss some points to keep in mind when using PCA.

When to do PCA?

Most explanations of PCA start out with what it is mathematically and how it can be calculated. We will get to that later. I usually find it more instructive to first understand in which situations a method like PCA should be applied. Once I know what the practical applications are, the mathematics tend to make a lot more sense.

My own work is in genomics, so the first example I can think of is generally related to gene expression. Imagine a number of patient samples for which you have measured the transcript levels of all expressed genes. The number of patient samples (<100) is usually much lower than the number of transcripts you measure (>10000). In other words, the number of observations (samples) is smaller than the number of features (transcripts).

For most types of analyses, especially machine learning algorithms, this is not a favorable starting point. When faced with a large number of features, it is safe to assume that a good number of them contain very little information, such as measurements on genes that are not differentially expressed, or that there is redundancy (co-linearity) between features, e.g. genes that vary together. In any event, too many features increase computation time and colinear features can even produce unstable results in some algorithms.

In such a situation, PCA can be extremely useful because one of its strengths is to optimize the features based on the “hidden” structure of the data. PCA aims to find a new representation of the data by extracting combinations of features (components) from the data that are uncorrelated with each other and ordered by “importance”. This allows us to approximate the data with a reduced number of features. Dimensional reduction makes PCA a valuable tool not only for preparing data for machine learning but also for exploratory data analysis and data visualization.

What does pca do?

Mathematically, the new features PCA provides us with are linear combinations of the old features. That means each of the old features weighs in – to different extents – to make up the new features. In PCA jargon, the weights of the original features are called “loadings“. It is a little bit like saying, take 1 cup of flour, 1 egg, 1 cup of milk, 3 teaspoons of baking powder, 1 teaspoon of sugar, and 1 teaspoon of salt and call it “pancake”. So our new feature (pancake) is made up of a combination of different amounts (the “loadings”) of the old features (the ingredients).

An important aspect of PCA is that the new features are “orthogonal” to each other, which is a more general way of expressing what I have called “uncorrelated” before. Orthogonality is a concept of linear algebra, which defines two vectors as orthogonal if their dot product is 0. It is related to “perpendicular”, which is commonly understood as having a “right” angle between to entities, such as the sides of a square or the x- and y-axis of our Cartesian coordinate system. In fact, you can think of PCA as a rotation of the data into a new coordinate system that “fits” the data more naturally. The axes of the new coordinate system are the eigenvectors of the data matrix. Eigenvectors are particular (“eigen”) to a given matrix that do not change direction under linear transformation by that matrix. Eigenvectors may, however, change length and/or orientation. The amount of scaling under linear transformation is called the eigenvalue. Each eigenvector of a matrix is associated with its own eigenvalue.

How does pca work?

In general, a square matrix with m rows and columns has m eigenvector/eigenvalue pairs. How does PCA know which ones are the most important? The key assumption of PCA is that the directions within the data that show the most variance contain the most information and, thus, are likely the most important. We find the eigenvectors with the highest variance along their direction by eigenvector decomposition of the covariance or correlation matrix of the data. The eigenvector with the largest eigenvalue explains the most variance. It is also called the “principal component“. Further components explain progressively less variance and are generally considered less “important”. The sum of all components explain the total variance of the data, so nothing is lost by applying the PCA itself. The original data can be reconstituted by rotating back to the original feature space.

Now we are in a position to understand why PCA can be used for dimensional reduction. If the first couple eigenvalues are much larger than the rest, they capture the majority of the variance, and a projection of the data onto the first couple of components will result in a good approximation of the original data. We implicitly assume that we have concentrated the “signal” in the first couple of components, and the rest capture merely the “noise” of the data. There is no strict cut-off for what constitutes the majority of variance. Depending on the field of research, people use numbers between 60% and 95%.

Recap

  • PCA is a data analysis method that uses linear combinations of existing feature vectors to re-expresses your data in terms of uncorrelated, orthogonal components.
  • The components are selected and sorted based on how much variance they explain within the data. The sum of all components explain the total variance.
  • PCA can be used to reduce the number of features within a data set by choosing only those components that explain the majority of variance and projecting the data onto a lower dimensional subspace. This can result in a good approximation of the original data.

PCA SERIES

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 1: An intuition

A closer look at the fisherman’s dilemma

In my previous post I defined the dangers of candidate selection in high-throughput screens as the “fisherman’s dilemma”. I have argued that our preconceived notions of how a biological system “should” behave, i.e. our inherent scientific bias, and a disproportional focus on p-values contribute to the frequent failure of high-throughput screens to yield tangible or reproducible results.

be conservative about p-values

Today, I would like to take a closer look at the relationship between the p-value and the positive predictive value (PPV), also known as the true positive rate or the posterior probability that a hit is a true hit. Despite its fancy name, the PPV is just the ratio true positives (TP) over the sum of true positives and false positives (FP):

PPV = TP / (TP + FP)

The number of true positives is the determined by the prior probability of there being a hit \pi and the statistical power 1-\beta, which is our ability to detect such a hit. Power is the complement of the type II error rate \beta.

TP = (1 - \beta) \pi

The number of false positives depends on the false positive rate (type I error rate) \alpha and the prior probability of there being no hit 1-\pi.

FP = \alpha (1 - \pi)

Putting these two equations together, we get:

PPV = (1 - \beta) \pi / [ (1 - \beta) \pi + \alpha (1 - \pi) ]

From this equation it is evident that just focusing on the significance level \alpha can lead to vastly different PPVs depending on where on the spectrum of prior probability and power we operate.

For the purpose of illustration, I have plotted the PPV for four commonly used significance levels 0.1, 0.05, 0.01, and 0.001. Green means higher PPV, and red means lower PPV. The black contour line shows where the PPV is 0.5, that is half of our hits are predicted to be false positives. For the optimists among us, half of our hits would likely be true positives.

fishing_part2_fig1From this figure it is clear that a p-value of 0.05 only works in situations of high prior probability and high power. I have marked the domain of high-throughput screens (HTS) rather generously at up to 0.25 prior probability and 0.25 power. Due to small sample sizes (low power) and the fact that any given perturbation is unlikely to have an effect on the majority of cellular components (low prior), most high-throughput screens operate in a space even closer to the origin in the deeply red area.

On the flip-side, this analysis tells us that if we are a little more conservative in what we call a hit, in other words if we lower the p-value cut-off to let’s say 0.001 or lower, we improve our chances of identifying true positives quite dramatically. Unless the high-throughput screen is plagued by terribly low power and prior probability, we actually have a chance that the majority of hits are true positives.

keep your guard up

In genomics, p-values often originate from statistical tests like t-tests comparing a number of control samples to a number of treatment samples. If the values obtained upon treatment have a low probability to have originated from the control (null) distribution, we say the treatment has a “significantly” effect. The t-statistics takes into account the difference of the means between control and treatment distributions and their variances and combines them into a single value, the t-value, from which the p-value is calculated.

In situations when we small sample sizes, such as in high-throughput screens, it can happen that by chance either the control or treatment values cluster together closely. This results in either a very narrow control or treatment distribution. Due to the confounding of effect size and precision in the t-value, even tiny effects can end up called “significant” as long as the variance is small enough. This is a common problem in microarrays with small sample numbers.

Fortunately, there are quite effective solutions for this problem. Genes with low p-values and small effect sizes can be identified using a volcano plot, which displays effect size against the negative logarithm of the p-value.

Increasing the sample size and/or using a Bayesian correction of the standard deviation as it is implemented in Bioconductor’s “limma” package for microarray analysis can help to ameliorate this problem.

possible Ways out of the fisherman’s dilemma

  • High-throughput screens in biomedical research usually operate in a domain of low power and low prior probability. Based on your estimate of power and prior probability, use a more conservative p-value cut-off than 0.05.
  • In addition to choosing the significance level \alpha based on the power and prior probability of your study, be wary of low p-values of hits with small effect sizes or apply corrections if possible.
  • Try to increase the power of your experiment by increasing sample size, or better by decreasing measurement error if possible.
  • The prior probability of having an effect is determined by nature and out of our control. We need to be aware of the possibility, however, that the prior probability is very low or even zero. In that case, it would be very hard or impossible to find a true positive.
  • Ditch the p-value and use a Bayesian approach.

Reproducibility

The full R code can be found on Github.


Further reading

A closer look at the fisherman’s dilemma

On the dangers of fishing expeditions

Wherever you look, people do high-throughput screens. They are commonly referred to as hypothesis-generating experiments or, somewhat affectionately, “fishing expeditions”. The concept is alluring: Answer a broad scientific question in one fell swoop making use of recent technological advances in mass spectrometry, next-generation sequencing, and automated microscopy.

Months or even years later people present the results of their high-throughput screens and more often than not they struggle to distill concrete findings from their data.

A high-throughput screen may fail to produce interpretable results for a number of reasons.

  • Was the scientific question too broad to be answered? (Did we cast the fishing net too widely?)
  • Was the technology sensitive enough to pick up the expected differences? (Was the mesh of our fishing net too coarse?)
  • Should we have used a counterscreen? (What are all those dolphins doing in our net?)
  • Was the experimental design flawed? (Should we have packed a second net?)
  • Is there something to find out at all? (Are there any fish to catch?)

All of those questions are project-specific and some are beyond our control. Oftentimes, we realize what we should have done post factum.

So what are we supposed to do once the painstakingly acquired data awaits analysis on our hard drives? High-throughput screens typically yield intriguing clues by the bucket but the challenge lies in weeding out the dead ends. Or to stay with our fishing metaphor: What do we call a fish and what an old boot?

The standard approach followed by most people is to rely on a combination of “statistical significance” and “domain knowledge”. This sounds like objective theory married to a healthy dose of good old common sense. What could possibly go wrong?

Despite being appealing in theory, in practice this combination often fail to identify the right candidates for follow-up experiments. Or worse, it prevents you from realizing that the screen itself was a failure and you should spend your precious time and money on something else.

The reason for this phenomenon is partly to be found in the misuse of statistical theory and in the conscious or unconscious application of scientific bias. On top, their combination can lead to reinforcement of wrong choices that quickly send you on your way to no-man’s land.

The overvalued p-value

John Ioannidis’ piece on “Why most published research findings are false” caused quite a stir in the scientific community and, maybe even more so, in the pharmaceutical industry. His main point of contention is that the post-study probability that a hit is true does not only depend on the type I (false positive) error rate α (the significance level), but also on the type II (false negative) error rate β and the prevalence R of true hits. In Bayesian statistics, the prevalence would be called the prior probability.

Most statistics software packages are p-value generating machines. You feed them data, and they spit out p-values. If the p-value is below a certain threshold, commonly set at 0.05, we accept it as a hit. We all know that. Simple enough.

The p-value is the probability that a value as extreme or more is generated by a statistical model of our choice. The argumentation that a low p-value supports our hypothesis follows the classical straw man fallacy. We construct a straw man called the null hypothesis H0, show that our data is unlikely to be generated from H0 (the p-value reflects the probability), and conclude that by getting rid of our straw man, the alternative hypothesis H1, which happens to be our pet hypothesis, must be true.

The significance level α is the cut-off we somewhat arbitrarily set to 0.05. This means you still obtain a value as extreme or more extreme just by chance under the null hypothesis H0. Even in the best of cases, you would be wrong one out of twenty times. When hundreds or thousands of hypothesis tests are performed, which is ordinarily the case in modern genomics, this problem becomes so severe that it has to be addressed by a multiple testing correction. The fact that the specificity or true negative rate of an experiment is 1 – α further hints at the fact that the significance level has less to do with true hits but more with true “misses”. It is a little bit like saying what is the probability that it is a fish if you catch nothing. On its own it is certainly not a good predictor of whether your hit is true or not.

So, what is the function of the other two components that influence the posterior probability that a hit is true?

The complement of the type II error rate β is called the statistical power (1 – β) of a study design. It determines our ability to detect a hit if it is true (true positive). In other words, the probability that it is a fish if you catch something. We traditionally aim for a power of 0.8, which says that 80% of the hits are likely to be true positives and 20% false positives. Ideally, we would want the power to be even closer to 1 but as power depends on the sample number, it is often too expensive or too time consuming to have arbitrarily high power. Conversely, if an experiment has low power the majority of what we call hits are likely to be false positives. Statistical power is related to sensitivity or the true positive rate of the experiment. In machine learning, it is known as recall.

Prevalence is the probability of there being a true hit before you even start the experiment. It is the probability that there are fish where you you choose to cast your net. Intuitively, it makes sense that this number could make the difference between success and failure. In the realm of frequentist statistics, prevalence is doomed to live a life in the nether world. The reason for this is that prevalence is not a quantity that can be estimated from the data at hand but must either be derived from experience or “guessed”. However, the influence on the posterior probability that a hit is true can be huge. Even in a situation of relatively high prevalence, let’s say 50%, a p-value of 0.05 corresponds to a posterior probability of a true hit of 0.29. This means that in about 1/3 of the cases called significant, we are dealing with false positives.

How does all of this relate to high-throughput screens? By focusing exclusively on p-values we implicitly assume high power and high prevalence. Neither of which is typically true in high-throughput settings in modern biological research. Due to the high costs of such experiments, sample sizes are typically low and the differences we aim to detect are small. Both negatively affect statistical power. The prevalence is typically much less than 50%, more likely to be in the range of around 10%. We would not necessarily expect that upon some treatment more than 10% of genes are differentially expressed or that more than 10% of the phosphorylation events within a cell change, would we? A prevalence of 10% means that a p-value of 0.05 has a 89% chance of being a false positive. That is scary!

Conscious and unconscious bias

As human beings we all have preformed opinions and preferences that originate from our very own set of experiences. As scientists are humans too, we are no exception. Here is a fun experiment to try out for yourself. Generate a random list of 100 proteins or genes, take it to three principle investigators from different fields of biology, and tell them that they have the list of hits from your latest high-throughput screen fresh from the printer. It is not unlikely that you will walk out of their offices with three coherent but completely different stories of how the result of your screen could be interpreted in their respective field of research.

Modern biological research has only recently transitioned from a data-poor to a data-rich field. Most of us are trained to make decisions on limited information, fill in the blank spots creatively, and test the resulting hypothesis experimentally. How we frame our hypothesis critically depends on our own experience as a scientist and on the believes of the field. If a hypothesis coincides with what is in line with our own experience and what is the current thinking in the field, it is usually considered a “good” hypothesis. A value judgment based on subjectivity is the essence of bias. It happens all the time, consciously and unconsciously, and there is not much we can do about it.

In a high-throughput setting, we are very likely to encounter genes or proteins we have heard of before, worked with before, or simply feel sympathetic towards for whatever reason. I would wager that we are more likely to spot them on a list and select them for follow-up experiments, sometimes even despite contrary statistical evidence. It is called having a hunch.

Reality check

If we think about the combination of the shaky (or should I say nonexistent) foundation of the p-value as a predictor of finding a true hit and our intrinsic scientific biases, we should expect nothing else but a lack of tangible results of high-throughput screening. It is like giving a list of quasi-random names to a creative mind and asking for a story. You will get one, but whether it has anything to do with reality is a different question entirely.

If you look at published papers that include some form of high-throughput screens, you typically observe that one or two instances were “cherry picked” from a list and followed-up the old fashioned way. What happened to the great promises of systems biology, the understanding of complex regulatory patterns and emerging properties of biological networks?

It seems to me that this is another instance of “no free lunch”. You can’t have coverage and confidence at the same time. At least not at the moment.

In the meantime, have a close look at what dangles from your fishing rod. It might be an old boot masquerading as a fish. Don’t be fooled!

How to fish safely?

There are ways out of the fisherman’s / fisherwoman’s dilemma and I have listed some of them in a follow-up post. More information can be found in the articles listed below and the references therein.


Further reading

Three links to very accessible articles on the subject of p-values, statistical power, and prevalence:

On the dangers of fishing expeditions