# Analyzing quantitative PCR data the tidy way

Previously in this series on tidy data: Taking up the cudgels for tidy data

One of the most challenging aspects of working with data is how easy it is to get lost. Even if the data sets are small. Multiple levels of hierarchy and grouping quickly confuse our human brains (at least mine). Recording such data in two dimensional spreadsheets naturally leads to blurring of the distinction between observation and variable. Such data requires constant reformatting and its structure may not be intuitive to your fellow researcher.

Here are the two main rules about tidy data as defined in Hadley Wickham’s paper:

1. Each variable forms a column
2. Each observation forms a row

A variable is an “attribute” of a given data point that describes the conditions when it was taken. Variables often are categorical (but they don’t need to be). For example, the gene tested or the genotype associated with a given measurement would be a categorical variables.

An observation is a measurement associated with an arbitrary number of variables. There are no measurements that are taken under identical conditions. Each observation is uniquely described by the variables and should form its own row.

Let’s look at an example of a typical recording of quantitative PCR data in Excel. We have measurements for three different genotypes (“control”, “mutant1”, “mutant2”) from three separate experiments (“exp1”, “exp2”, “exp3”) with three replicates each (“rep1”, “rep2”, “rep3”).

Looking at the columns, we see that information on “experiment” and “replicate” are stored in the names of the columns rather than the entries of the columns. This will have to be changed.

There clearly are multiple measurements per row. More precisely, it looks like we have a set of nine measurements for each genotype. But this is not entirely true. Experiments are considered statistically independent as they are typically performed at different times and with different cells. They capture the full biological variability and we call them “biological replicates”. The repeated measurements done in each experiment are not statistically independent because they come from the same sample preparation and thus only capture sources of variance that originate from sample handling or instrumentation. We call them “technical replicates”. Technical replicates cannot be used for statistical inference that requires “statistical independence”, such as a t-test. As you can see, we have an implicit hierarchy in our data that is not expressed in the structure of the data representation shown above.

We will untangle all those complications one by one using R tools developed by Hadley Wickham and others to represent the same data in a tidy format suitable for statistical analysis and visualization. For details about how the code works, please consult the many excellent tutorials on dplyr, tidyr, ggplot2, and broom.

messy <- read.csv("qpcr_messy.csv", row.names = 1)


This is the original data read into R. Let’s get started. #### Row names should form their own column

The “genotype” information is recorded as row names. “Genotype” clearly is a variable, so we should make “genotype” a full column.

tidy <- data.frame(messy) %>%
# make row names a column
mutate(genotype = rownames(messy))


#### What are our variables?

Next, we need to think about what are our variables. We have already identified “genotype” but what are the other ones? The way we do this is to ask ourselves what kind of information we would need to uniquely describe each observation. The experiment and replicate number are essential to differentiate each quantitative PCR measurement, so we need to create separate columns for “experiment” and “replicate”. We will do this in two steps. First we use “gather” to convert tabular data from wide to long format (we could have also used the more general “melt” function from the “reshape2” package). The former column names (e.g. “exp1_rep1”) are saved into a temporary column called “sample”. As this column contains information about two variables (“experiment” and “replicate”), we need to separate it into two columns to conform with the “each variable forms a column” rule. To do this, we use “separate” to split “sample” into the two columns “experiment” and “replicate”.

tidy <- tidy %>%
# make each row a single measurement
gather(key = sample, value = measurement, -genotype) %>%
# make each column a single variable
separate(col = sample, into = c("experiment", "replicate"), sep = "_") Here are the first 10 columns of the “tidy” representation of the initial Excel table. Before we can do statistical tests and visualization, we have to take care of one more thing.

#### Untangling implicit Domain specific hierarchies

Remember what we said before about the two different kind of replicates. Only data from biological replicates (“experiments”) are considered statistically independent samples, while technical replicates (“replicate”) are not. One common approach is to average the technical replicates (“replicate”) before any statistical test is applied. With tidy data, this is simple.

data <- tidy %>%
# calculate mean of technical replicates by genotype and experiment
group_by(genotype, experiment) %>%
summarise(measurement = mean(measurement)) %>%
ungroup()


Having each variable as its own column makes the application of the same operation onto different groups straightforward. In our case, we calculate the mean of technical replicates for each genotype and experiment combination.

Now, the data is ready for analysis.

#### TIdy Statistical analysis of quantitative pcr data

The scientific rational for a quantitative PCR experiment is to find out whether the number of transcripts for a given gene is different between two or more conditions. We have measurements for one transcript in three distinct genotypes (“control”, “mutant1”, “mutant2”). Biological replicates are considered independent and measurements are assumed to be normally distributed around a “true” mean value. A t-test would be an appropriate choice for the comparison of two genotypes. In this case, we have three genotypes, so we will use one-way anova followed by Tukey’s post-hoc test.

mod <- data %>%
# set "control" as reference
mutate(genotype = relevel(factor(genotype), ref = "control")) %>%
# one-way anova and Tukey's post hoc test
do(tidy(TukeyHSD(aov(measurement ~ genotype, data = .))))


We generally want to compare the effect of a genetic mutation to a “control” condition. We therefore set the reference of “genotype” to “control”.

Using base R statistics functions like “aov” and “TukeyHSD” in a tidy data analysis workflow can pose problems because they were not created with the idea of “dplyr”-style piping (“%>%”) in mind. Piping requires that the input and output of each function is a data frame and that the input is the first argument of the function. The “aov” function neither takes the input data frame as its first argument, nor does it return a data frame but a specialized “aov” object. To add insult to injury, the “TukeyHSD” function only works with such a specialized “aov” object as input.

In situations like this, the “do” function comes in handy. Within the “do” function, the input of the previous line is accessible through the dot character, so we can use an arbitrary function within “do” and just refer to the input data at the appropriate place with “.”. As a final clean-up, the “tidy” function from the “broom” package makes sure that the output of the line is a data frame. Tukey’s post hoc test thinks “mutant1” is different from “control” but “mutant2” is not. Let’s visualize the results to get a better idea of how the data looks like.

#### tidy Visualization of quantitative PCR data

We are dealing with few replicates, three in our case, so a bar graph is not the most efficient representation of our data. Plotting the individual data points and the confidence intervals gives us more information using less ink. We will use the “ggplot2” package because it is designed to work with data in the tidy format.

# genotype will be on the x-axis, measurements on the y-axis
ggplot(data, aes(x = genotype, y = measurement, col = experiment)) +
# plot the mean of each genotype as a cross
stat_summary(fun.y = "mean", geom = "point", color = "black", shape = 3, size = 5) +
# plot the 95% confidence interval for each genotype
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", color = "black", width = 0.1) +
# we we add the averaged measurements for each experiment
geom_point(shape = 16, size = 5) +
theme_classic() We can see why the first mutant is different from the “control” sample and the second is not. More replicates would be needed to test whether the small difference in means between “control” and “mutant2” is a true difference or not.

What I have shown here is just the tip of the iceberg. There are many more tools and functions to discover. The more data analysis you do, the more you will realize how important it is not to waste time formatting and reformatting the the data for each step of the analysis. Learning about how to tidy up your data is an important step towards that goal.

The R code can be found on Github.

# Taking up the cudgels for tidy data

The abundance of data has led to a revolution in marketing and advertisement as well as in biomedical research. A decade ago, the emerging field of “systems biology” (no pun intended) promised to take basic research to the next level through the use of high-throughput screens and “big data”. Institutes were built, huge projects were funded, but surprisingly little of substance has been accomplished since.

There are three main reasons, I think, two of which are under our control and one is not.

During our first forays into understanding biology as a “system” we underestimated the complexity of even an isolated cell, let alone a multi-cellular organism. A somewhat complete description of even the most basic regulatory mechanisms and pathways remains a dream to this day due to myriads of adaptive mechanisms and cross-talks preventing the formulation of a coherent view. Unfortunately, this problem has to be overcome with improved scientific methodology or analysis and will take time.

There are things we can do right now, however.

Overly optimistic or incorrect interpretation of statistical results and “cherry-picking” of hits of high-throughput screens has led to a surprising number of publications that cannot be replicated or even be reproduced. I have written more extensively about this particular problem I refer to as the “Fisherman’s dilemma“.

The lack of standards for structuring data is another reason that prevents the use of existing data and makes merging data sets from different studies or sources unnecessary painful and time consuming. A common saying is that data science is 80% data cleaning, 20% data analysis. The same is true for bioinformatics, where one needs to wrestle with incomplete and messy datasets or, if it’s your lucky day, just different data formats. This problem is especially prevalent in meta-analysis of scientific data. Arguably, the integration of datasets from different sources is where we would predict to find some of the most important and universal results. Why else spend the time to generate expensive datasets if we don’t use them to compare and cross-reference?

If we were to spent less time on the arduous task of “cleaning” data, we could focus our attention on the question itself and the implementation of the analysis. In recent years, Hadley Wickham and others have developed a suite of R tools that help to “tidy up” messy data and establish and enforce a “grammar of data” that allows easy visualization, statistical modeling, and data merging without the need to “translate” the data for each step of the analysis. Hadley deservedly gets a lot of credit for his dplyr, reshape2, tidyr, and ggplot2 packages, but not nearly enough. At this point David Robinson’s excellent broom package for cleaning results from statistical models should also be mentioned.

The idea of tidy data is surprisingly simple. Here are the two most basic rules (see Hadley’s paper for more details).

1. Each variable forms a column.
2. Each observation forms a row.

Here is an example. This is a standard form of recording biological research data such as data from a PCR experiment with three replicates. At first glance, the data looks pretty tidy. Genes in rows, replicates in columns. What’s wrong here?

In fact, this way of recording data violates both basic rules of a tidy dataset. First, the “replicates” are not distinct variables but instances of the same variable, which violates the first rule. Second, each measurement is a distinct observation and should have its own row. Clearly, this is not the case either.

This is how the same data looks like once cleaned-up. Each variable is a column, each observation is a row.

Representing data “the tidy way” is not novel. It has been called the “long” format previously, as opposed to the “wide” (“messy”) format. Although “tidy” and “messy” imply a value judgement, it is important to note that while the tidy/long format has distinct advantages for data analysis, the wide format is often seen as the more intuitive and almost always is the more concise.

The most important advantage of tidy data in data analysis is that there is one way of representing the data in a tidy format, while there are many possible ways of having a messy data structure. Take the example of the messy data from above. Storing replicates in rows and genes in columns (the transpose) would have been an equivalent representation to the one shown above. However, cleaning up both representations results in the same tidy data representation shown. This advantage becomes even more important with datasets that contain multiple variables.

A related, but more technical advantage of the tidy format is that it simplifies the use of loops and vectorized programming (implicit loops) because the “one variable, one column – one observation, one row” structure enforces a “linearization” of the data that is more easily dealt with from a programming perspective.

Having data in a consistent format allows feeding data into visualization and modeling tools without spending time on getting the data in the right shape. Similarly, tidy dataset from different sources can be more easily merged and analyzed together.

While data in marketing is sometimes called “cheap”, research data in science often is generally very expensive, both in terms of time and money. Taking the extra step of recording and sharing data in a “tidy” format, would make data analysis in biomedical research and clinical trials more effective and potentially more productive.

In a follow-up post, I will cover the practical application of some of the R tools developed to work with tidy data using an example most experimental biologists are familiar with: statistical analysis and visualization of quantitative PCR.

# Tidy unnesting

At least once a week, I have to work with a data set that has “nested” measurements in one of its columns. Such data violates rule #2 of Hadley Wickam’s definition of tidy data. Not all observations are in separate rows. In order to work with this data set, we generally want to “unnest” those measurements to make each a separate row. ### The “untidy” way

Ordinarily, I would convert the untidy data to tidy data using a cumbersome sequence of messy commands that involve splitting the entries in “value” into lists, then counting the elements of each list, and replicating each row of the data frame accordingly.

library(stringr)
# split "value" into lists
values <- str_split(data$value, ";") # count number of elements for each list n <- sapply(values, length) # replicate rownames of data based on elements for each list row_rep <- unlist(mapply(rep, rownames(data), n)) # replicate rows of original data data_tidy <- data[row_rep, ] # replace nested measurements with unnested measurements data_tidy$value <- unlist(values)
# reformat row names
rownames(data_tidy) <- seq(nrow(data_tidy))


### The “tidy” way

Is it really necessary to go through all those (untidy!) steps to tidy up that data set? It turns out, it is not. In comes the “unnest” function in the “tidyr” package.

library(tidyr)
# use dplyr/magrittr style piping
data_tidy2 <- data %>%
# split "value" into lists
transform(value = str_split(value, ";")) %>%
# unnest magic
unnest(value)


Much tidier! Just split and “unnest”. On top of that, “unnest” nicely fits into a “dplyr”-style data processing workflow using “magrittr” piping (“%>%”)

all.equal(data_tidy, data_tidy2)


The results of both methods are equivalent. Your choice, I have made mine!

### Synonymous fission yeast gene names

This is how the workflow would look like using an ID mapping file of synonymous gene names of the fission yeast Schizosaccharomyces pombe. The file can be obtained from the “Pombase” website.

# read data
raw <- read.delim("sysID2product.tsv", header = FALSE, stringsAsFactors = FALSE)
names(raw) <- c("orf", "symbol", "synonyms", "protein")
# unnest
data <- raw %>%
transform(synonyms = str_split(synonyms, ",")) %>%
unnest(synonyms)


Tidy code is almost as much of a blessing as tidy data.

### Reproducibility

The full R code is available on Github.

# PCA – Part 5: Eigenpets

In this post scriptum to my series on Principal Component Analysis (PCA) I will show how PCA can be applied to image analysis. Given a number of images of faces, PCA decomposes those images into “eigenfaces” that are the basis of some facial recognition algorithms. Eigenfaces are the eigenvectors of the image data matrix. I have shown in Part 3 of this series that the eigenvectors that capture a given amount of variance of the data can be used to obtain an approximation of the original data using fewer dimensions. Likewise, we can approximate the image of a human face by a weighted combination of eigenfaces.

### From images to matrices

A digital image is just a matrix of numbers. Fair game for PCA. You might ask yourself, though, how to coerce many matrices into a single data matrix with samples in the rows and features in the columns? Just stack the columns of the image matrices to get a single long vector and then stack the image vectors to obtain the data matrix. For example, a 64 by 64 image matrix would result in a 4096-element image vector, and 100 such image vectors would be stacked into a 100 by 4096 data matrix.

The more elements a matrix has, the more computationally expensive it becomes to do the matrix factorization that yields the eigenvectors and eigenvalues. As the number of images (samples) is usually much smaller than the number of pixels (features), it is more efficient to compute the eigenvectors of the transpose of the data matrix with the pixels in the rows and the images in the columns.

Fortunately, there is an easy way to get from the eigenvectors of the covariance matrix $\boldsymbol A^T \boldsymbol A$ to those of the covariance matrix $\boldsymbol A \boldsymbol A^T$. The eigenvector equation of $\boldsymbol A^T \boldsymbol A$ is $\boldsymbol A^T \boldsymbol A \vec{v} = \lambda \vec{v}$

Multiplying $\boldsymbol A$ to the left on both sides gives us important clues about the relationship between the eigenvectors and eigenvalues between the two matrices. $\boldsymbol A \boldsymbol A^T (\boldsymbol A \vec{v}) = \lambda (\boldsymbol A \vec{v})$

The eigenvalues of the covariance matrices $\boldsymbol A^T \boldsymbol A$ and $\boldsymbol A \boldsymbol A^T$ are identical. If $\boldsymbol A$ is an $m$ by $n$ matrix and $m < n$, there will be $m$ nonzero eigenvalues and $n-m$ zero eigenvalues.

To get from an eigenvector $\vec{v}$ of $\boldsymbol A^T \boldsymbol A$ to an eigenvector of $\boldsymbol A \boldsymbol A^T$, we just need to multiply by $\boldsymbol A$ on the left.

### What does an “eigenpet” look like?

For this demonstration, I will not be using images of human faces but, in line with the predominant interests of the internet, faces of cats and dogs. I obtained this data set when I took the course “Computational Methods for Data Analysis” on Coursera and converted the original data to text files to make them more accessible to R. The data can be found on Github.

library(RCurl)
# read data from Github
cats <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/cat.csv"), sep = ",")
dogs <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/dog.csv"), sep = ",")
# combine cats and dogs into single data frame
pets <- cbind(cats, dogs)


The data matrix already is in a convenient format. Each 64 by 64 pixel image has been converted into a 4096 pixel vector and each of the 160 image vector is stacked vertically to obtain a data matrix with dimensions 4096 by 160. As you might have guessed, there are 80 cats and 80 dogs in the data set.

The distinction between what is a sample and what is a feature becomes a little blurred in this analysis. Technically, the 160 images are the samples and the 4096 pixels are the features, so we should do PCA on a 160 by 4096 matrix. However, as discussed above it is more convenient to operate with the transpose of this matrix, which is why image data usually is prepared in its transposed form with features in rows and samples in columns.

As we have seen in theoretically in Part 2 of this series, it is necessary to center (and scale) the features before performing PCA. We are dealing with 8-bit greyscale pixel values that are naturally bounded between 0 and 255, so they already are on the same scale. Moreover, all features measure the same quantity (brightness), so it is customary to center the data by subtracting the “average” image from the data.

# compute "average" pet
pet0 <- rowMeans(pets)
cat0 <- rowMeans(pets[, 1:80])
dog0 <- rowMeans(pets[, 81:160])


So what does the average pet, the average cat, average dog look like?

# create grey scale color map
greys <- gray.colors(256, start = 0, end = 1)
# convenience function to plot pets
show_image <- function(v, n = 64, col = greys) {
# transform image vector back to matrix
m <- matrix(v, ncol = n, byrow = TRUE)
# invert columns to obtain right orientation
# plot using "image"
image(m[, nrow(m):1], col = col, axes = FALSE)
}
# plot average pets
for (i in list(pet0, cat0, dog0)) {
show_image(i)
} As expected, the average pet has features of both cats and dogs, while the average cat and dog are quite recognizable as members of their respective species.

Let’s run PCA on the data set and see what an “eigenpet” looks like.

# subtract average pet
pets0 <- pets - pet0
# run pca
pc <- prcomp(pets0, center = FALSE, scale. = FALSE)


Ordinarily, we would find the eigenvector matrix in the “rotation” object return to us by “prcomp”. Remember, that we did PCA on the transpose of the data matrix, so we have to do some additional work to get to the eigenvectors of the original data. I have shown above that to get from the eigenvectors of $\boldsymbol A^T \boldsymbol A$ to the eigenvectors of $\boldsymbol A \boldsymbol A^T$, we just need to multiply by $\boldsymbol A$ on the left. $\boldsymbol A$, in this case, is our centered data matrix “pets0”.

Note that the unscaled eigenvectors of the eigenfaces are equivalent to the projection of the data onto the eigenvectors of the transposed data matrix. Therefore, we do not have to explicitly compute them but can use the “pc$x” object. # obtain unscaled eigenvectors of eigenfaces u_unscaled <- as.matrix(pets0) %*% pc$rotation
# this turns out to be the same as the projection of the data
# stored in "pc$x" all.equal(u_unscaled, pc$x)


Both “u_unscaled” and “pc$x” are unscaled versions of the eigenvectors, which means that they are not unitary matrices. For plotting, this does not matter because the images will be scaled automatically. If scaled eigenvectors are important, it is more convenient to use singular value decomposition and use the left eigenvalues stored in the matrix “u”. # singular value decomposition of data sv <- svd(pets0) # left eigenvalues are stored in matrix "u" u <- sv$u


Let’s look at the first couple of “eigenpets” (eigenfaces).

# display first 6 eigenfaces
for (i in seq(6)) {
show_image(pc$x[, i]) } Some eigenfaces are definitely more cat-like and others more dog-like. We also see a common issue with eigenfaces. The directions of highest variance are usually associated with the lighting conditions of the original images. Whether they are predominantly light or dark dominates the first couple of components. In this particular case, it appears that the images of cats have stronger contrasts between foreground and background. If we were to do face recognition, we would either preprocess the images to have comparable lighting conditions or exclude the first couple of eigenfaces. ### Reconstruction of Pets using “Eigenpets” Using the eigenfaces we can approximate or completely reconstruct the original images. Let’s see how this looks like with a couple of different variance cut-offs. # a number of variance cut-offs vars <- c(0.2, 0.5, 0.8, 0.9, 0.98, 1) # calculate cumulative explained variance var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2) # get the number of components that explain a given amount of variance npc <- sapply(vars, function(v) which(var_expl >= v)) # reconstruct four cats and four dogs for (i in seq(79, 82)) { for (j in npc) { # project data using "j" principal components r <- pc$x[, 1:j] %*% t(pc$rotation[, 1:j]) show_image(r[, i]) text(x = 0.01, y = 0.05, pos = 4, labels = paste("v =", round(var_expl[j], 2))) text(x = 0.99, y = 0.05, pos = 2, labels = paste("pc =", j)) } } Every pet starts out looking like a cat due to the dominance of lighting conditions. Somewhere between 80% and 90% of captured variance, every pet is clearly identifiable as either a cat or a dog. Even at 90% of capture variance, we use less than one third of all components for our approximation of the original image. This may be enough for a machine learning algorithm to tell apart a cat from a dog with high confidence. We see, however, that a lot of the detail of the image is contained in the last two thirds of the components. ### Reproducibility The full R code is available on Github. ### Further Reading Scholarpedia – Eigenfaces Jeff Jauregui – Principal Component Analysis with Linear Algebra ### PCA SERIES Part 1: An Intuition Part 2: A Look Behind The Curtain Part 3: In the Trenches Part 4: Potential Pitfalls Part 5: Eigenpets # PCA – Part 4: Potential Pitfalls In the first three parts of this series on principal component analysis (PCA), we have talked about what PCA can do for us, what it is mathematically, and how to apply it in practice. Today, I will briefly discuss some of the potential caveats of PCA. ### INformation and Noise PCA looks for the dimensions with highest variance within the data and assumes that high variance is a proxy for “information”. This assumption is usually warranted otherwise PCA would not be useful. In cases of unsupervised learning, that is if we have no class labels of the data available, looking for structure within the data based on the data itself is our only choice. In a sense, we cannot tell what parts of the data are information and what parts are noise. If we have class labels available (supervised learning), we could in principle look for dimensions of variance that optimally separate the classes from each other. PCA does not do that. It is “class agnostic” and thus treats “information”-variance and “noise”-variance the same way. It is possible that principle components associated with small eigenvalues nevertheless carry the most information. In other words, the size of the eigenvalue and the information content are not necessarily correlated. When choosing the number of components to project our data, we could thus lose important information. Luckily, such situations rarely happen in practice. Or we just never realize … There are other techniques related to PCA that attempt to find dimensions of the data that optimally separate the data based on class labels. The most famous is Fisher’s “Linear Discriminant Analysis” (LDA) and its non-linear cousins “Quadratic Discriminant Analysis” (QDA). ### Interpretability In Part 3 of this series, we have looked at a data set containing a multitude of motion detection measurements of humans doing various activities. We used PCA to find a lower dimensional representation of those measurements that approximate the data well. Each of the original measurements were quite tangible (despite their sometimes cryptic names) and therefore interpretable. After PCA, we are left with linear combinations of those original features, which may or may not be interpretable. It is far from guaranteed that the eigenvectors correspond to “real” entities, they may just be convenient summaries of the data. We will rarely be able to say the first principle component means “X” and the second principle component means “Y”, however tempting it may be based on our preconceived notions of the data. A good example of that is mentioned in Cosma Shalizi’s excellent notes on PCA. Cavalli-Sforza et al. analyzed the distribution of human genes using PCA and interpreted the principal components as patterns of human migration and population expansion. Later, November and Stephens showed that similar patterns could be obtained using simulated data with spatial correlation. As humans are genetically more similar to humans they close to (at least historically), genetic data is necessarily spatially correlated and thus PCA will uncover such structures, even if they do not represent “real” events or are liable to misinterpretation. ### Independence Linear algebra tells us that eigenvectors are orthogonal to each other. A set of $n$ orthogonal vectors form a basis of an $n$-dimensional subspace. The principle components are eigenvectors of the covariance matrix and the set of principle components form a basis for our data. We also say that the principle components are “uncorrelated”. This becomes obvious when we remember that matrix decomposition is sometimes called “diagonalization”. In the eigendecomposition, the matrix containing the eigenvalues has zeros everywhere but on its diagonal, which contains the eigenvalues. Variance and covariance are measures in the L2 norm, which means that they involve the second moment or square. Being uncorrelated in the L2 norm does not mean that there is no “correlation” in higher norms, in other words the absence of correlation does not imply independence. In statistics, higher order norms are skew (“tailedness” or third moment) and kurtosis (“peakedness” or fourth moment). Techniques related to PCA such as Independent Component Analysis (IDA) can be used to extract two separate, but convolved signals (“independent components”) from each other based on higher order norms. The distinction between correlation and independence is a technical point when it comes to the practical application of PCA but certainly worth being aware of. ### Further reading Cosma Shalizi – Principal Components: Mathematics, Example, Interpretation Cavalli-Sforza et al. – The History and Geography of Human Genes (1994) Novembre & Stephens – Interpreting principal component analyses of spatial genetic variation (2008) ### PCA SERIES Part 1: An Intuition Part 2: A Look Behind The Curtain Part 3: In the Trenches Part 4: Potential Pitfalls Part 5: Eigenpets # PCA – Part 3: In the Trenches Now that we have an intuition of what principal component analysis (PCA) is and understand some of the mathematics behind it, it is time we make PCA work for us. Practical examples of PCA typically use Ronald Fisher’s famous “Iris” data set, which contains four measurements of leaf lenghts and widths of three subspecies of Iris flowers. To mix things up a little bit, I will use a data set that is closer to what you would encounter in the real world. The “Human Activity Recognition Using Smartphones Data Set” available from the UCI Machine Learning Repository contains a total of 561 triaxial acceleration and angular velocity measurements of 30 subjects performing different movements such as sitting, standing, and walking. The researchers collected this data set to ask whether those measurements would be sufficient to tell the type of activity of the person. Instead of focusing on this classification problem, we will look at the structure of the data and investigate using PCA whether we can express the information contained in the 561 different measurements in a more compact form. I will be using a subset of the data containing the measurements of only three subjects. As always, the code used for the pre-processing steps of the raw data can be found on GitHub. ### Step 1: Explore the data Let’s first load the pre-processed subset of the data into our R session. # read data from Github measurements <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_measurements.txt")) description <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/pca_part3_description.txt"))  It’s always a good idea to check for a couple of basic things first. The big three I usually check are: • What are dimensions of the data, i.e. how many rows and columns? • What type of features are we dealing with, i.e. categorical, ordinal, continuous? • Are there any missing values? The answer to those three questions will determine the amount of additional data munging we have to do before we can use the data for PCA. # what are the dimensions of the data? dim(measurements) # what type of data are the features? table(sapply(measurements, class)) # are there missing values? any(is.na(measurements))  The data contains 990 samples (rows) with 561 measurements (columns) each. Clearly too many measurements for visualizing on a scatterplot. The measurements are all of type “numeric”, which means we are dealing with continuous variables. This is great because categorical and ordinal variable are not handled well by PCA. Those need to be “dummy coded“. We also don’t have to worry about missing values. Strategies for handling missing values are a topic on its own. Before we run PCA on the data, we should look at the correlation structure of the features. If there are features, i.e. measurements in our case, that are highly correlated (or anti-correlated), there is redundancy within the data set and PCA will be able to find a more compact representation of the data. # feature correlation before PCA cor_m <- cor(measurements, method = "pearson") # use only upper triangular matrix to avoid redundancy upt_m <- cor_m[upper.tri(cor_m)] # plot correlations as histogram hist(upt_m, prob = TRUE) # plot correlations as image image.plot(cor_m, axes = FALSE)  The code was simplified for clarity. The full version can be found in the script. We see in the histogram on the left that there is a considerable number of highly correlated features, most of them positively correlated. Those features show up as yellow in the image representation to the right. PCA will likely be able to provide us with a good lower dimensional approximation of this data set. ### Step 2: Run PCA After all the preparation, running PCA is just one line of code. Remember, that we need to at least center the data before using PCA (Why? see Part 2). Scaling is technically only necessary if the magnitude of the features are vastly different. Note, that the data appears to be already centered and scaled from the get-go. # run PCA pc <- prcomp(measurements, center = TRUE, scale. = TRUE)  Depending on your system and the number of features of your data this may take a couple of seconds. The call to “prcomp” has constructed new features by linear combinations of the old features and sorted them by their and weighted by the amount of variance they explain. Because the new features are the eigenvectors of the feature covariance matrix, they should be orthogonal, and hence uncorrelated, by definition. Let’s visualize this directly. The new representation of the data is stored as a matrix named “x” in the list object we get back from “prcomp”. In our case, the matrix would be stored as “pc$x”.

# feature correlation before PCA
cor_r <- cor(pc$x, method = "pearson") # use only upper triangular matrix to avoid redundancy upt_r <- cor_r[upper.tri(cor_r)] # plot correlations as histogram hist(upt_r, prob = TRUE) # plot correlations as image image.plot(cor_r, axes = FALSE) The new features are clearly no longer correlated to each other. As everything seems to be in order, we can now focus on the interpretation of the results. ### Step 3: Interpret the results The first thing you will want to check is how much variance is explained by each component. In PCA speak, this can be visualized with a “scree plot”. R conveniently has a built-in function to draw such a plot. # draw a scree plot screeplot(pc, npc = 10, type = "line") This is about as good as it gets. A large amount of the variance is captured by the first principal component followed by a sharp decline as the remaining components gradually explain less and less variance approaching zero. The decision of how many components we should use to get a good approximation of the data has to be made on a case-by-case basis. The cut-offs for the percent explained variance depends on the kind of data you are working with and its inherent covariance structure. The majority of the data sets you will encounter are not nearly as well behaved as this one, meaning that the decline in explained variance is much more shallow. Common cut-offs range from 80% to 95% of explained variance. Let’s look at how many components we would need to explain a given amount of variance. In the R implementation of PCA, the variances explained by each principle component are stored in a vector called “sdev”. As the name implies, these are standard deviations or the square roots of the variances, which in turn are scaled versions of the eigenvalues. We will need to take the squares “sdev” to get back the variances. # calculate explained variance as cumulative sum # sdev are the square roots of the variance var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2) # plot explained variance plot(c(0, var_expl), type = "l", lwd = 2, ylim = c(0, 1), xlab = "Principal Components", ylab = "Variance explained") # plot number of components needed to for common cut-offs of variance explained vars <- c(0.8, 0.9, 0.95, 0.99) for (v in vars) { npc <- which(var_expl > v) lines(x = c(0, npc, npc), y = c(v, v, 0), lty = 3) text(x = npc, y = v - 0.05, labels = npc, pos = 4) points(x = npc, y = v) } The first principle component on its own explains more than 50% of the variance and we need only 20 components to get up to 80% of the explained variance. Fewer than 30% of the components (162 out of 561) are needed to capture 99% of the variance in the data set. This is a dramatic reduction of complexity. Being able to approximate the data set with a much smaller number of features can greatly speed up downstream analysis and can help to visualize the data graphically. Finally, let’s investigate whether “variance” translates to “information”. In other words, do the prinicipal components associated with the largest eigenvalues discriminate between the different human activities? If the class labels (“activities” in our case) are known, a good way to do look at the “information content” of the principal components is to look at scatter plots of the first couple of components and color-code the samples by class label. This code gives you a bare bones version of the figure shown below. The complete code can be found on Github. # plot the first 8 principal components against each other for(p in seq(1, 8, by = 2)) { plot(pc$x[, p:(p+1)], pch = 16,
col = as.numeric(description$activity_name)) } We have seen previously that the first component alone explains about half of the variance and in this figure we see why. It almost perfectly separates non-moving “activities” (“laying”, “sitting”, “standing”) from moving activities (various types of “walking”). The second component does a reasonable job at telling the difference between walking and walking upstairs. As we move down the list, there remains visible structure but distinctions become somewhat less clear. One conclusion we can draw from this visualization is that it will most likely be most difficult to tell “sitting” apart from “standing” as none of the dimensions seems to be able to distinguish red and green samples. Oddly enough, the fifth component does a pretty good job of separating “laying” from “sitting” and “standing”. ### Recap PCA can be a powerful technique to obtain low dimensional approximations of data with lots of redundant features. The “Human Activity Recognition Using Smartphones Data Set” used in this tutorial is a particularly good example of that. Most real data sets will not be reduced to a few components so easily while retaining most of the information. But even cutting the number of features in half can lead to considerable time savings when using machine learning algorithms. Here are a couple of useful questions when approaching a new data set to apply PCA to: 1. Are the features numerical or do I have to convert categorial features? 2. Are there missing values and if yes, which strategy do I apply to deal with them? 3. What is the correlation structure of the data? Will PCA be effective in this case? 4. What is the distribution of variances after PCA? Do I see a steep or shallow decline in explained variance? 5. How much “explained variance” is a good enough approximation of the data? This is usually a compromise between how much potential information I am willing to sacrifice for cutting down computation time of follow-up analyses. In the final part of this series, we will discuss some of the limitations of PCA. ### Addendum: Understanding “prcomp” The “prcomp” function is very convenient because it caclulates all the numbers we could possible want from our PCA analysis in one line. However, it is useful to know how those number were generated. The three most frequently used objects returned by “prcomp” are • “rotation”: right eigenvectors (“feature eigenvectors”) • “sdev”: square roots of scaled eigenvalues • “x”: projection of original data onto the new features #### Rotation In Part 2, I mentioned that software implementations of PCA usually compute the eigenvectors of the data matrix using singular value decomposition (SVD) rather than eigendecomposition of the covariance matrix. In fact, R’s “prcomp” is no exception. “Rotation” is a matrix whose columns are the right eigenvalues of the original data. We can reconstruct “rotation” using SVD. # perform singular value decomposition on centered and scaled data sv <- svd(scale(measurements)) # "prcomp" stores right eigenvectors in "rotation" w <- pc$rotation
dimnames(w) = NULL
# "svd" stores right eigenvectors in matrix "v"


#### x

The projection of the orignal data (“measurements”) onto its eigenbasis is automatically calculated by “prcomp” through its default argument “retx = TRUE” and stored in “x”. We can manually recreate the projection using matrix-matrix multiplication.

# manual projection of data
all.equal(pc$x, scale(measurements) %*% pc$rotation)


If we wanted to obtain a projection of the data onto a lower dimensional subspace, we just determine the number of components needed and subset the columns of matrix “x”. For example, if we wanted to get an approximation of the original data preserving 90% of the variance, we take the first 52 columns of “x”.

# projection of original data preserving 90% of variance
y90 <- pc$x[, 1:52] # note that this is equivalent matrix multiplication with # the first 52 eigenvectors all.equal(y90, scale(measurements) %*% pc$rotation[, 1:52])


### Reproducibility

The full R code is available on Github.

##### IRIS data set

Sebastian Raschka – Principle Component Analysis in 3 Simple Steps

### PCA SERIES

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

# PCA – Part 1: An intuition

Pee-See-Ay. Just the sound of these three letters inspires awe in an experimental biologist. I am speaking from personal experience. Let’s grab a hammer and chisel and get to work on that pedestal.

In this first post of a series on principal component analysis (PCA), my goal is to avoid any mathematical formalism and provide an intuition about what PCA is and how this technique can be useful when analyzing data.

In follow-up posts, I will explain the mathematical basis of PCA, show practical examples of applying PCA using R, and discuss some points to keep in mind when using PCA.

### When to do PCA?

Most explanations of PCA start out with what it is mathematically and how it can be calculated. We will get to that later. I usually find it more instructive to first understand in which situations a method like PCA should be applied. Once I know what the practical applications are, the mathematics tend to make a lot more sense.

My own work is in genomics, so the first example I can think of is generally related to gene expression. Imagine a number of patient samples for which you have measured the transcript levels of all expressed genes. The number of patient samples (<100) is usually much lower than the number of transcripts you measure (>10000). In other words, the number of observations (samples) is smaller than the number of features (transcripts).

For most types of analyses, especially machine learning algorithms, this is not a favorable starting point. When faced with a large number of features, it is safe to assume that a good number of them contain very little information, such as measurements on genes that are not differentially expressed, or that there is redundancy (co-linearity) between features, e.g. genes that vary together. In any event, too many features increase computation time and colinear features can even produce unstable results in some algorithms.

In such a situation, PCA can be extremely useful because one of its strengths is to optimize the features based on the “hidden” structure of the data. PCA aims to find a new representation of the data by extracting combinations of features (components) from the data that are uncorrelated with each other and ordered by “importance”. This allows us to approximate the data with a reduced number of features. Dimensional reduction makes PCA a valuable tool not only for preparing data for machine learning but also for exploratory data analysis and data visualization.

### What does pca do?

Mathematically, the new features PCA provides us with are linear combinations of the old features. That means each of the old features weighs in – to different extents – to make up the new features. In PCA jargon, the weights of the original features are called “loadings“. It is a little bit like saying, take 1 cup of flour, 1 egg, 1 cup of milk, 3 teaspoons of baking powder, 1 teaspoon of sugar, and 1 teaspoon of salt and call it “pancake”. So our new feature (pancake) is made up of a combination of different amounts (the “loadings”) of the old features (the ingredients).

An important aspect of PCA is that the new features are “orthogonal” to each other, which is a more general way of expressing what I have called “uncorrelated” before. Orthogonality is a concept of linear algebra, which defines two vectors as orthogonal if their dot product is 0. It is related to “perpendicular”, which is commonly understood as having a “right” angle between to entities, such as the sides of a square or the x- and y-axis of our Cartesian coordinate system. In fact, you can think of PCA as a rotation of the data into a new coordinate system that “fits” the data more naturally. The axes of the new coordinate system are the eigenvectors of the data matrix. Eigenvectors are particular (“eigen”) to a given matrix that do not change direction under linear transformation by that matrix. Eigenvectors may, however, change length and/or orientation. The amount of scaling under linear transformation is called the eigenvalue. Each eigenvector of a matrix is associated with its own eigenvalue.

### How does pca work?

In general, a square matrix with $m$ rows and columns has $m$ eigenvector/eigenvalue pairs. How does PCA know which ones are the most important? The key assumption of PCA is that the directions within the data that show the most variance contain the most information and, thus, are likely the most important. We find the eigenvectors with the highest variance along their direction by eigenvector decomposition of the covariance or correlation matrix of the data. The eigenvector with the largest eigenvalue explains the most variance. It is also called the “principal component“. Further components explain progressively less variance and are generally considered less “important”. The sum of all components explain the total variance of the data, so nothing is lost by applying the PCA itself. The original data can be reconstituted by rotating back to the original feature space.

Now we are in a position to understand why PCA can be used for dimensional reduction. If the first couple eigenvalues are much larger than the rest, they capture the majority of the variance, and a projection of the data onto the first couple of components will result in a good approximation of the original data. We implicitly assume that we have concentrated the “signal” in the first couple of components, and the rest capture merely the “noise” of the data. There is no strict cut-off for what constitutes the majority of variance. Depending on the field of research, people use numbers between 60% and 95%.

### Recap

• PCA is a data analysis method that uses linear combinations of existing feature vectors to re-expresses your data in terms of uncorrelated, orthogonal components.
• The components are selected and sorted based on how much variance they explain within the data. The sum of all components explain the total variance.
• PCA can be used to reduce the number of features within a data set by choosing only those components that explain the majority of variance and projecting the data onto a lower dimensional subspace. This can result in a good approximation of the original data.

### PCA SERIES

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets