# Analyzing quantitative PCR data the tidy way

Previously in this series on tidy data: Taking up the cudgels for tidy data

One of the most challenging aspects of working with data is how easy it is to get lost. Even if the data sets are small. Multiple levels of hierarchy and grouping quickly confuse our human brains (at least mine). Recording such data in two dimensional spreadsheets naturally leads to blurring of the distinction between observation and variable. Such data requires constant reformatting and its structure may not be intuitive to your fellow researcher.

Here are the two main rules about tidy data as defined in Hadley Wickham’s paper:

1. Each variable forms a column
2. Each observation forms a row

A variable is an “attribute” of a given data point that describes the conditions when it was taken. Variables often are categorical (but they don’t need to be). For example, the gene tested or the genotype associated with a given measurement would be a categorical variables.

An observation is a measurement associated with an arbitrary number of variables. There are no measurements that are taken under identical conditions. Each observation is uniquely described by the variables and should form its own row.

Let’s look at an example of a typical recording of quantitative PCR data in Excel.

We have measurements for three different genotypes (“control”, “mutant1”, “mutant2”) from three separate experiments (“exp1”, “exp2”, “exp3”) with three replicates each (“rep1”, “rep2”, “rep3”).

Looking at the columns, we see that information on “experiment” and “replicate” are stored in the names of the columns rather than the entries of the columns. This will have to be changed.

There clearly are multiple measurements per row. More precisely, it looks like we have a set of nine measurements for each genotype. But this is not entirely true. Experiments are considered statistically independent as they are typically performed at different times and with different cells. They capture the full biological variability and we call them “biological replicates”. The repeated measurements done in each experiment are not statistically independent because they come from the same sample preparation and thus only capture sources of variance that originate from sample handling or instrumentation. We call them “technical replicates”. Technical replicates cannot be used for statistical inference that requires “statistical independence”, such as a t-test. As you can see, we have an implicit hierarchy in our data that is not expressed in the structure of the data representation shown above.

We will untangle all those complications one by one using R tools developed by Hadley Wickham and others to represent the same data in a tidy format suitable for statistical analysis and visualization. For details about how the code works, please consult the many excellent tutorials on dplyr, tidyr, ggplot2, and broom.

```messy <- read.csv("qpcr_messy.csv", row.names = 1)
```

This is the original data read into R. Let’s get started.

#### Row names should form their own column

The “genotype” information is recorded as row names. “Genotype” clearly is a variable, so we should make “genotype” a full column.

```tidy <- data.frame(messy) %>%
# make row names a column
mutate(genotype = rownames(messy))
```

#### What are our variables?

Next, we need to think about what are our variables. We have already identified “genotype” but what are the other ones? The way we do this is to ask ourselves what kind of information we would need to uniquely describe each observation. The experiment and replicate number are essential to differentiate each quantitative PCR measurement, so we need to create separate columns for “experiment” and “replicate”. We will do this in two steps. First we use “gather” to convert tabular data from wide to long format (we could have also used the more general “melt” function from the “reshape2” package). The former column names (e.g. “exp1_rep1”) are saved into a temporary column called “sample”. As this column contains information about two variables (“experiment” and “replicate”), we need to separate it into two columns to conform with the “each variable forms a column” rule. To do this, we use “separate” to split “sample” into the two columns “experiment” and “replicate”.

```tidy <- tidy %>%
# make each row a single measurement
gather(key = sample, value = measurement, -genotype) %>%
# make each column a single variable
separate(col = sample, into = c("experiment", "replicate"), sep = "_")
```

Here are the first 10 columns of the “tidy” representation of the initial Excel table. Before we can do statistical tests and visualization, we have to take care of one more thing.

#### Untangling implicit Domain specific hierarchies

Remember what we said before about the two different kind of replicates. Only data from biological replicates (“experiments”) are considered statistically independent samples, while technical replicates (“replicate”) are not. One common approach is to average the technical replicates (“replicate”) before any statistical test is applied. With tidy data, this is simple.

```data <- tidy %>%
# calculate mean of technical replicates by genotype and experiment
group_by(genotype, experiment) %>%
summarise(measurement = mean(measurement)) %>%
ungroup()
```

Having each variable as its own column makes the application of the same operation onto different groups straightforward. In our case, we calculate the mean of technical replicates for each genotype and experiment combination.

Now, the data is ready for analysis.

#### TIdy Statistical analysis of quantitative pcr data

The scientific rational for a quantitative PCR experiment is to find out whether the number of transcripts for a given gene is different between two or more conditions. We have measurements for one transcript in three distinct genotypes (“control”, “mutant1”, “mutant2”). Biological replicates are considered independent and measurements are assumed to be normally distributed around a “true” mean value. A t-test would be an appropriate choice for the comparison of two genotypes. In this case, we have three genotypes, so we will use one-way anova followed by Tukey’s post-hoc test.

```mod <- data %>%
# set "control" as reference
mutate(genotype = relevel(factor(genotype), ref = "control")) %>%
# one-way anova and Tukey's post hoc test
do(tidy(TukeyHSD(aov(measurement ~ genotype, data = .))))
```

We generally want to compare the effect of a genetic mutation to a “control” condition. We therefore set the reference of “genotype” to “control”.

Using base R statistics functions like “aov” and “TukeyHSD” in a tidy data analysis workflow can pose problems because they were not created with the idea of “dplyr”-style piping (“%>%”) in mind. Piping requires that the input and output of each function is a data frame and that the input is the first argument of the function. The “aov” function neither takes the input data frame as its first argument, nor does it return a data frame but a specialized “aov” object. To add insult to injury, the “TukeyHSD” function only works with such a specialized “aov” object as input.

In situations like this, the “do” function comes in handy. Within the “do” function, the input of the previous line is accessible through the dot character, so we can use an arbitrary function within “do” and just refer to the input data at the appropriate place with “.”. As a final clean-up, the “tidy” function from the “broom” package makes sure that the output of the line is a data frame.

Tukey’s post hoc test thinks “mutant1” is different from “control” but “mutant2” is not. Let’s visualize the results to get a better idea of how the data looks like.

#### tidy Visualization of quantitative PCR data

We are dealing with few replicates, three in our case, so a bar graph is not the most efficient representation of our data. Plotting the individual data points and the confidence intervals gives us more information using less ink. We will use the “ggplot2” package because it is designed to work with data in the tidy format.

```# genotype will be on the x-axis, measurements on the y-axis
ggplot(data, aes(x = genotype, y = measurement, col = experiment)) +
# plot the mean of each genotype as a cross
stat_summary(fun.y = "mean", geom = "point", color = "black", shape = 3, size = 5) +
# plot the 95% confidence interval for each genotype
stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", color = "black", width = 0.1) +
# we we add the averaged measurements for each experiment
geom_point(shape = 16, size = 5) +
theme_classic()
```

We can see why the first mutant is different from the “control” sample and the second is not. More replicates would be needed to test whether the small difference in means between “control” and “mutant2” is a true difference or not.

What I have shown here is just the tip of the iceberg. There are many more tools and functions to discover. The more data analysis you do, the more you will realize how important it is not to waste time formatting and reformatting the the data for each step of the analysis. Learning about how to tidy up your data is an important step towards that goal.

The R code can be found on Github.

# Taking up the cudgels for tidy data

The abundance of data has led to a revolution in marketing and advertisement as well as in biomedical research. A decade ago, the emerging field of “systems biology” (no pun intended) promised to take basic research to the next level through the use of high-throughput screens and “big data”. Institutes were built, huge projects were funded, but surprisingly little of substance has been accomplished since.

There are three main reasons, I think, two of which are under our control and one is not.

During our first forays into understanding biology as a “system” we underestimated the complexity of even an isolated cell, let alone a multi-cellular organism. A somewhat complete description of even the most basic regulatory mechanisms and pathways remains a dream to this day due to myriads of adaptive mechanisms and cross-talks preventing the formulation of a coherent view. Unfortunately, this problem has to be overcome with improved scientific methodology or analysis and will take time.

There are things we can do right now, however.

Overly optimistic or incorrect interpretation of statistical results and “cherry-picking” of hits of high-throughput screens has led to a surprising number of publications that cannot be replicated or even be reproduced. I have written more extensively about this particular problem I refer to as the “Fisherman’s dilemma“.

The lack of standards for structuring data is another reason that prevents the use of existing data and makes merging data sets from different studies or sources unnecessary painful and time consuming. A common saying is that data science is 80% data cleaning, 20% data analysis. The same is true for bioinformatics, where one needs to wrestle with incomplete and messy datasets or, if it’s your lucky day, just different data formats. This problem is especially prevalent in meta-analysis of scientific data. Arguably, the integration of datasets from different sources is where we would predict to find some of the most important and universal results. Why else spend the time to generate expensive datasets if we don’t use them to compare and cross-reference?

If we were to spent less time on the arduous task of “cleaning” data, we could focus our attention on the question itself and the implementation of the analysis. In recent years, Hadley Wickham and others have developed a suite of R tools that help to “tidy up” messy data and establish and enforce a “grammar of data” that allows easy visualization, statistical modeling, and data merging without the need to “translate” the data for each step of the analysis. Hadley deservedly gets a lot of credit for his dplyr, reshape2, tidyr, and ggplot2 packages, but not nearly enough. At this point David Robinson’s excellent broom package for cleaning results from statistical models should also be mentioned.

The idea of tidy data is surprisingly simple. Here are the two most basic rules (see Hadley’s paper for more details).

1. Each variable forms a column.
2. Each observation forms a row.

Here is an example.

This is a standard form of recording biological research data such as data from a PCR experiment with three replicates. At first glance, the data looks pretty tidy. Genes in rows, replicates in columns. What’s wrong here?

In fact, this way of recording data violates both basic rules of a tidy dataset. First, the “replicates” are not distinct variables but instances of the same variable, which violates the first rule. Second, each measurement is a distinct observation and should have its own row. Clearly, this is not the case either.

This is how the same data looks like once cleaned-up.

Each variable is a column, each observation is a row.

Representing data “the tidy way” is not novel. It has been called the “long” format previously, as opposed to the “wide” (“messy”) format. Although “tidy” and “messy” imply a value judgement, it is important to note that while the tidy/long format has distinct advantages for data analysis, the wide format is often seen as the more intuitive and almost always is the more concise.

The most important advantage of tidy data in data analysis is that there is one way of representing the data in a tidy format, while there are many possible ways of having a messy data structure. Take the example of the messy data from above. Storing replicates in rows and genes in columns (the transpose) would have been an equivalent representation to the one shown above. However, cleaning up both representations results in the same tidy data representation shown. This advantage becomes even more important with datasets that contain multiple variables.

A related, but more technical advantage of the tidy format is that it simplifies the use of loops and vectorized programming (implicit loops) because the “one variable, one column – one observation, one row” structure enforces a “linearization” of the data that is more easily dealt with from a programming perspective.

Having data in a consistent format allows feeding data into visualization and modeling tools without spending time on getting the data in the right shape. Similarly, tidy dataset from different sources can be more easily merged and analyzed together.

While data in marketing is sometimes called “cheap”, research data in science often is generally very expensive, both in terms of time and money. Taking the extra step of recording and sharing data in a “tidy” format, would make data analysis in biomedical research and clinical trials more effective and potentially more productive.

In a follow-up post, I will cover the practical application of some of the R tools developed to work with tidy data using an example most experimental biologists are familiar with: statistical analysis and visualization of quantitative PCR.

# Tidy unnesting

At least once a week, I have to work with a data set that has “nested” measurements in one of its columns.

Such data violates rule #2 of Hadley Wickam’s definition of tidy data. Not all observations are in separate rows. In order to work with this data set, we generally want to “unnest” those measurements to make each a separate row.

### The “untidy” way

Ordinarily, I would convert the untidy data to tidy data using a cumbersome sequence of messy commands that involve splitting the entries in “value” into lists, then counting the elements of each list, and replicating each row of the data frame accordingly.

```library(stringr)
# split "value" into lists
values <- str_split(data\$value, ";")
# count number of elements for each list
n <- sapply(values, length)
# replicate rownames of data based on elements for each list
row_rep <- unlist(mapply(rep, rownames(data), n))
# replicate rows of original data
data_tidy <- data[row_rep, ]
# replace nested measurements with unnested measurements
data_tidy\$value <- unlist(values)
# reformat row names
rownames(data_tidy) <- seq(nrow(data_tidy))
```

### The “tidy” way

Is it really necessary to go through all those (untidy!) steps to tidy up that data set? It turns out, it is not. In comes the “unnest” function in the “tidyr” package.

```library(tidyr)
# use dplyr/magrittr style piping
data_tidy2 <- data %>%
# split "value" into lists
transform(value = str_split(value, ";")) %>%
# unnest magic
unnest(value)
```

Much tidier! Just split and “unnest”. On top of that, “unnest” nicely fits into a “dplyr”-style data processing workflow using “magrittr” piping (“%>%”)

```all.equal(data_tidy, data_tidy2)
```

The results of both methods are equivalent. Your choice, I have made mine!

### Synonymous fission yeast gene names

This is how the workflow would look like using an ID mapping file of synonymous gene names of the fission yeast Schizosaccharomyces pombe. The file can be obtained from the “Pombase” website.

```# read data