# Tidy unnesting

At least once a week, I have to work with a data set that has “nested” measurements in one of its columns.

Such data violates rule #2 of Hadley Wickam’s definition of tidy data. Not all observations are in separate rows. In order to work with this data set, we generally want to “unnest” those measurements to make each a separate row.

### The “untidy” way

Ordinarily, I would convert the untidy data to tidy data using a cumbersome sequence of messy commands that involve splitting the entries in “value” into lists, then counting the elements of each list, and replicating each row of the data frame accordingly.

```library(stringr)
# split "value" into lists
values <- str_split(data\$value, ";")
# count number of elements for each list
n <- sapply(values, length)
# replicate rownames of data based on elements for each list
row_rep <- unlist(mapply(rep, rownames(data), n))
# replicate rows of original data
data_tidy <- data[row_rep, ]
# replace nested measurements with unnested measurements
data_tidy\$value <- unlist(values)
# reformat row names
rownames(data_tidy) <- seq(nrow(data_tidy))
```

### The “tidy” way

Is it really necessary to go through all those (untidy!) steps to tidy up that data set? It turns out, it is not. In comes the “unnest” function in the “tidyr” package.

```library(tidyr)
# use dplyr/magrittr style piping
data_tidy2 <- data %>%
# split "value" into lists
transform(value = str_split(value, ";")) %>%
# unnest magic
unnest(value)
```

Much tidier! Just split and “unnest”. On top of that, “unnest” nicely fits into a “dplyr”-style data processing workflow using “magrittr” piping (“%>%”)

```all.equal(data_tidy, data_tidy2)
```

The results of both methods are equivalent. Your choice, I have made mine!

### Synonymous fission yeast gene names

This is how the workflow would look like using an ID mapping file of synonymous gene names of the fission yeast Schizosaccharomyces pombe. The file can be obtained from the “Pombase” website.

```# read data
names(raw) <- c("orf", "symbol", "synonyms", "protein")
# unnest
data <- raw %>%
transform(synonyms = str_split(synonyms, ",")) %>%
unnest(synonyms)
```

Tidy code is almost as much of a blessing as tidy data.

### Reproducibility

The full R code is available on Github.