At least once a week, I have to work with a data set that has “nested” measurements in one of its columns.
Such data violates rule #2 of Hadley Wickam’s definition of tidy data. Not all observations are in separate rows. In order to work with this data set, we generally want to “unnest” those measurements to make each a separate row.
The “untidy” way
Ordinarily, I would convert the untidy data to tidy data using a cumbersome sequence of messy commands that involve splitting the entries in “value” into lists, then counting the elements of each list, and replicating each row of the data frame accordingly.
library(stringr) # split "value" into lists values <- str_split(data$value, ";") # count number of elements for each list n <- sapply(values, length) # replicate rownames of data based on elements for each list row_rep <- unlist(mapply(rep, rownames(data), n)) # replicate rows of original data data_tidy <- data[row_rep, ] # replace nested measurements with unnested measurements data_tidy$value <- unlist(values) # reformat row names rownames(data_tidy) <- seq(nrow(data_tidy))
The “tidy” way
Is it really necessary to go through all those (untidy!) steps to tidy up that data set? It turns out, it is not. In comes the “unnest” function in the “tidyr” package.
library(tidyr) # use dplyr/magrittr style piping data_tidy2 <- data %>% # split "value" into lists transform(value = str_split(value, ";")) %>% # unnest magic unnest(value)
Much tidier! Just split and “unnest”. On top of that, “unnest” nicely fits into a “dplyr”-style data processing workflow using “magrittr” piping (“%>%”)
The results of both methods are equivalent. Your choice, I have made mine!
Synonymous fission yeast gene names
# read data raw <- read.delim("sysID2product.tsv", header = FALSE, stringsAsFactors = FALSE) names(raw) <- c("orf", "symbol", "synonyms", "protein") # unnest data <- raw %>% transform(synonyms = str_split(synonyms, ",")) %>% unnest(synonyms)
Tidy code is almost as much of a blessing as tidy data.
The full R code is available on Github.