Tidy unnesting

At least once a week, I have to work with a data set that has “nested” measurements in one of its columns.

data_untidySuch data violates rule #2 of Hadley Wickam’s definition of tidy data. Not all observations are in separate rows. In order to work with this data set, we generally want to “unnest” those measurements to make each a separate row.

data_tidy

The “untidy” way

Ordinarily, I would convert the untidy data to tidy data using a cumbersome sequence of messy commands that involve splitting the entries in “value” into lists, then counting the elements of each list, and replicating each row of the data frame accordingly.

library(stringr)
# split "value" into lists
values <- str_split(data$value, ";")
# count number of elements for each list
n <- sapply(values, length)
# replicate rownames of data based on elements for each list
row_rep <- unlist(mapply(rep, rownames(data), n))
# replicate rows of original data
data_tidy <- data[row_rep, ]
# replace nested measurements with unnested measurements
data_tidy$value <- unlist(values)
# reformat row names
rownames(data_tidy) <- seq(nrow(data_tidy))

The “tidy” way

Is it really necessary to go through all those (untidy!) steps to tidy up that data set? It turns out, it is not. In comes the “unnest” function in the “tidyr” package.

library(tidyr)
# use dplyr/magrittr style piping
data_tidy2 <- data %>% 
  # split "value" into lists
  transform(value = str_split(value, ";")) %>%
  # unnest magic
  unnest(value)

Much tidier! Just split and “unnest”. On top of that, “unnest” nicely fits into a “dplyr”-style data processing workflow using “magrittr” piping (“%>%”)

all.equal(data_tidy, data_tidy2)

The results of both methods are equivalent. Your choice, I have made mine!

Synonymous fission yeast gene names

This is how the workflow would look like using an ID mapping file of synonymous gene names of the fission yeast Schizosaccharomyces pombe. The file can be obtained from the “Pombase” website.

# read data
raw <- read.delim("sysID2product.tsv", header = FALSE, stringsAsFactors = FALSE)
names(raw) <- c("orf", "symbol", "synonyms", "protein")
# unnest
data <- raw %>% 
  transform(synonyms = str_split(synonyms, ",")) %>% 
  unnest(synonyms)

Tidy code is almost as much of a blessing as tidy data.


Reproducibility

The full R code is available on Github.

Advertisements
Tidy unnesting

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s