Previously in this series on tidy data: Taking up the cudgels for tidy data
One of the most challenging aspects of working with data is how easy it is to get lost. Even if the data sets are small. Multiple levels of hierarchy and grouping quickly confuse our human brains (at least mine). Recording such data in two dimensional spreadsheets naturally leads to blurring of the distinction between observation and variable. Such data requires constant reformatting and its structure may not be intuitive to your fellow researcher.
Here are the two main rules about tidy data as defined in Hadley Wickham’s paper:
- Each variable forms a column
- Each observation forms a row
A variable is an “attribute” of a given data point that describes the conditions when it was taken. Variables often are categorical (but they don’t need to be). For example, the gene tested or the genotype associated with a given measurement would be a categorical variables.
An observation is a measurement associated with an arbitrary number of variables. There are no measurements that are taken under identical conditions. Each observation is uniquely described by the variables and should form its own row.
Let’s look at an example of a typical recording of quantitative PCR data in Excel.
We have measurements for three different genotypes (“control”, “mutant1”, “mutant2”) from three separate experiments (“exp1”, “exp2”, “exp3”) with three replicates each (“rep1”, “rep2”, “rep3”).
Looking at the columns, we see that information on “experiment” and “replicate” are stored in the names of the columns rather than the entries of the columns. This will have to be changed.
There clearly are multiple measurements per row. More precisely, it looks like we have a set of nine measurements for each genotype. But this is not entirely true. Experiments are considered statistically independent as they are typically performed at different times and with different cells. They capture the full biological variability and we call them “biological replicates”. The repeated measurements done in each experiment are not statistically independent because they come from the same sample preparation and thus only capture sources of variance that originate from sample handling or instrumentation. We call them “technical replicates”. Technical replicates cannot be used for statistical inference that requires “statistical independence”, such as a t-test. As you can see, we have an implicit hierarchy in our data that is not expressed in the structure of the data representation shown above.
We will untangle all those complications one by one using R tools developed by Hadley Wickham and others to represent the same data in a tidy format suitable for statistical analysis and visualization. For details about how the code works, please consult the many excellent tutorials on dplyr, tidyr, ggplot2, and broom.
messy <- read.csv("qpcr_messy.csv", row.names = 1)
This is the original data read into R. Let’s get started.
Row names should form their own column
The “genotype” information is recorded as row names. “Genotype” clearly is a variable, so we should make “genotype” a full column.
tidy <- data.frame(messy) %>% # make row names a column mutate(genotype = rownames(messy))
Next, we need to think about what are our variables. We have already identified “genotype” but what are the other ones? The way we do this is to ask ourselves what kind of information we would need to uniquely describe each observation. The experiment and replicate number are essential to differentiate each quantitative PCR measurement, so we need to create separate columns for “experiment” and “replicate”. We will do this in two steps. First we use “gather” to convert tabular data from wide to long format (we could have also used the more general “melt” function from the “reshape2” package). The former column names (e.g. “exp1_rep1”) are saved into a temporary column called “sample”. As this column contains information about two variables (“experiment” and “replicate”), we need to separate it into two columns to conform with the “each variable forms a column” rule. To do this, we use “separate” to split “sample” into the two columns “experiment” and “replicate”.
tidy <- tidy %>% # make each row a single measurement gather(key = sample, value = measurement, -genotype) %>% # make each column a single variable separate(col = sample, into = c("experiment", "replicate"), sep = "_")
Untangling implicit Domain specific hierarchies
Remember what we said before about the two different kind of replicates. Only data from biological replicates (“experiments”) are considered statistically independent samples, while technical replicates (“replicate”) are not. One common approach is to average the technical replicates (“replicate”) before any statistical test is applied. With tidy data, this is simple.
data <- tidy %>% # calculate mean of technical replicates by genotype and experiment group_by(genotype, experiment) %>% summarise(measurement = mean(measurement)) %>% ungroup()
Having each variable as its own column makes the application of the same operation onto different groups straightforward. In our case, we calculate the mean of technical replicates for each genotype and experiment combination.
Now, the data is ready for analysis.
TIdy Statistical analysis of quantitative pcr data
The scientific rational for a quantitative PCR experiment is to find out whether the number of transcripts for a given gene is different between two or more conditions. We have measurements for one transcript in three distinct genotypes (“control”, “mutant1”, “mutant2”). Biological replicates are considered independent and measurements are assumed to be normally distributed around a “true” mean value. A t-test would be an appropriate choice for the comparison of two genotypes. In this case, we have three genotypes, so we will use one-way anova followed by Tukey’s post-hoc test.
mod <- data %>% # set "control" as reference mutate(genotype = relevel(factor(genotype), ref = "control")) %>% # one-way anova and Tukey's post hoc test do(tidy(TukeyHSD(aov(measurement ~ genotype, data = .))))
We generally want to compare the effect of a genetic mutation to a “control” condition. We therefore set the reference of “genotype” to “control”.
Using base R statistics functions like “aov” and “TukeyHSD” in a tidy data analysis workflow can pose problems because they were not created with the idea of “dplyr”-style piping (“%>%”) in mind. Piping requires that the input and output of each function is a data frame and that the input is the first argument of the function. The “aov” function neither takes the input data frame as its first argument, nor does it return a data frame but a specialized “aov” object. To add insult to injury, the “TukeyHSD” function only works with such a specialized “aov” object as input.
In situations like this, the “do” function comes in handy. Within the “do” function, the input of the previous line is accessible through the dot character, so we can use an arbitrary function within “do” and just refer to the input data at the appropriate place with “.”. As a final clean-up, the “tidy” function from the “broom” package makes sure that the output of the line is a data frame.
tidy Visualization of quantitative PCR data
We are dealing with few replicates, three in our case, so a bar graph is not the most efficient representation of our data. Plotting the individual data points and the confidence intervals gives us more information using less ink. We will use the “ggplot2” package because it is designed to work with data in the tidy format.
# genotype will be on the x-axis, measurements on the y-axis ggplot(data, aes(x = genotype, y = measurement, col = experiment)) + # plot the mean of each genotype as a cross stat_summary(fun.y = "mean", geom = "point", color = "black", shape = 3, size = 5) + # plot the 95% confidence interval for each genotype stat_summary(fun.data = "mean_cl_normal", geom = "errorbar", color = "black", width = 0.1) + # we we add the averaged measurements for each experiment geom_point(shape = 16, size = 5) + theme_classic()
We can see why the first mutant is different from the “control” sample and the second is not. More replicates would be needed to test whether the small difference in means between “control” and “mutant2” is a true difference or not.
What I have shown here is just the tip of the iceberg. There are many more tools and functions to discover. The more data analysis you do, the more you will realize how important it is not to waste time formatting and reformatting the the data for each step of the analysis. Learning about how to tidy up your data is an important step towards that goal.
The R code can be found on Github.