Tidy unnesting

At least once a week, I have to work with a data set that has “nested” measurements in one of its columns.

data_untidySuch data violates rule #2 of Hadley Wickam’s definition of tidy data. Not all observations are in separate rows. In order to work with this data set, we generally want to “unnest” those measurements to make each a separate row.

data_tidy

The “untidy” way

Ordinarily, I would convert the untidy data to tidy data using a cumbersome sequence of messy commands that involve splitting the entries in “value” into lists, then counting the elements of each list, and replicating each row of the data frame accordingly.

library(stringr)
# split "value" into lists
values <- str_split(data$value, ";")
# count number of elements for each list
n <- sapply(values, length)
# replicate rownames of data based on elements for each list
row_rep <- unlist(mapply(rep, rownames(data), n))
# replicate rows of original data
data_tidy <- data[row_rep, ]
# replace nested measurements with unnested measurements
data_tidy$value <- unlist(values)
# reformat row names
rownames(data_tidy) <- seq(nrow(data_tidy))

The “tidy” way

Is it really necessary to go through all those (untidy!) steps to tidy up that data set? It turns out, it is not. In comes the “unnest” function in the “tidyr” package.

library(tidyr)
# use dplyr/magrittr style piping
data_tidy2 <- data %>% 
  # split "value" into lists
  transform(value = str_split(value, ";")) %>%
  # unnest magic
  unnest(value)

Much tidier! Just split and “unnest”. On top of that, “unnest” nicely fits into a “dplyr”-style data processing workflow using “magrittr” piping (“%>%”)

all.equal(data_tidy, data_tidy2)

The results of both methods are equivalent. Your choice, I have made mine!

Synonymous fission yeast gene names

This is how the workflow would look like using an ID mapping file of synonymous gene names of the fission yeast Schizosaccharomyces pombe. The file can be obtained from the “Pombase” website.

# read data
raw <- read.delim("sysID2product.tsv", header = FALSE, stringsAsFactors = FALSE)
names(raw) <- c("orf", "symbol", "synonyms", "protein")
# unnest
data <- raw %>% 
  transform(synonyms = str_split(synonyms, ",")) %>% 
  unnest(synonyms)

Tidy code is almost as much of a blessing as tidy data.


Reproducibility

The full R code is available on Github.

Tidy unnesting

PCA – Part 5: Eigenpets

In this post scriptum to my series on Principal Component Analysis (PCA) I will show how PCA can be applied to image analysis. Given a number of images of faces, PCA decomposes those images into “eigenfaces” that are the basis of some facial recognition algorithms. Eigenfaces are the eigenvectors of the image data matrix. I have shown in Part 3 of this series that the eigenvectors that capture a given amount of variance of the data can be used to obtain an approximation of the original data using fewer dimensions. Likewise, we can approximate the image of a human face by a weighted combination of eigenfaces.

From images to matrices

A digital image is just a matrix of numbers. Fair game for PCA. You might ask yourself, though, how to coerce many matrices into a single data matrix with samples in the rows and features in the columns? Just stack the columns of the image matrices to get a single long vector and then stack the image vectors to obtain the data matrix. For example, a 64 by 64 image matrix would result in a 4096-element image vector, and 100 such image vectors would be stacked into a 100 by 4096 data matrix.

The more elements a matrix has, the more computationally expensive it becomes to do the matrix factorization that yields the eigenvectors and eigenvalues. As the number of images (samples) is usually much smaller than the number of pixels (features), it is more efficient to compute the eigenvectors of the transpose of the data matrix with the pixels in the rows and the images in the columns.

Fortunately, there is an easy way to get from the eigenvectors of the covariance matrix \boldsymbol A^T \boldsymbol A to those of the covariance matrix \boldsymbol A \boldsymbol A^T. The eigenvector equation of \boldsymbol A^T \boldsymbol A is

\boldsymbol A^T \boldsymbol A \vec{v} = \lambda \vec{v}

Multiplying \boldsymbol A to the left on both sides gives us important clues about the relationship between the eigenvectors and eigenvalues between the two matrices.

\boldsymbol A \boldsymbol A^T (\boldsymbol A \vec{v}) = \lambda (\boldsymbol A \vec{v})

The eigenvalues of the covariance matrices \boldsymbol A^T \boldsymbol A and \boldsymbol A \boldsymbol A^T are identical. If \boldsymbol A is an m by n matrix and m < n, there will be m nonzero eigenvalues and n-m zero eigenvalues.

To get from an eigenvector \vec{v} of \boldsymbol A^T \boldsymbol A to an eigenvector of \boldsymbol A \boldsymbol A^T, we just need to multiply by \boldsymbol A on the left.

What does an “eigenpet” look like?

For this demonstration, I will not be using images of human faces but, in line with the predominant interests of the internet, faces of cats and dogs. I obtained this data set when I took the course “Computational Methods for Data Analysis” on Coursera and converted the original data to text files to make them more accessible to R. The data can be found on Github.

library(RCurl)
# read data from Github
cats <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/cat.csv"), sep = ",")
dogs <- read.table(text = getURL("https://raw.githubusercontent.com/bioramble/pca/master/dog.csv"), sep = ",")
# combine cats and dogs into single data frame
pets <- cbind(cats, dogs)

The data matrix already is in a convenient format. Each 64 by 64 pixel image has been converted into a 4096 pixel vector and each of the 160 image vector is stacked vertically to obtain a data matrix with dimensions 4096 by 160. As you might have guessed, there are 80 cats and 80 dogs in the data set.

The distinction between what is a sample and what is a feature becomes a little blurred in this analysis. Technically, the 160 images are the samples and the 4096 pixels are the features, so we should do PCA on a 160 by 4096 matrix. However, as discussed above it is more convenient to operate with the transpose of this matrix, which is why image data usually is prepared in its transposed form with features in rows and samples in columns.

As we have seen in theoretically in Part 2 of this series, it is necessary to center (and scale) the features before performing PCA. We are dealing with 8-bit greyscale pixel values that are naturally bounded between 0 and 255, so they already are on the same scale. Moreover, all features measure the same quantity (brightness), so it is customary to center the data by subtracting the “average” image from the data.

# compute "average" pet
pet0 <- rowMeans(pets)
cat0 <- rowMeans(pets[, 1:80])
dog0 <- rowMeans(pets[, 81:160])

So what does the average pet, the average cat, average dog look like?

# create grey scale color map
greys <- gray.colors(256, start = 0, end = 1)
# convenience function to plot pets
show_image <- function(v, n = 64, col = greys) {
    # transform image vector back to matrix
    m <- matrix(v, ncol = n, byrow = TRUE)
    # invert columns to obtain right orientation
    # plot using "image"
    image(m[, nrow(m):1], col = col, axes = FALSE)
}
# plot average pets
for (i in list(pet0, cat0, dog0)) {
    show_image(i)
}

pca_part5_fig1As expected, the average pet has features of both cats and dogs, while the average cat and dog are quite recognizable as members of their respective species.

Let’s run PCA on the data set and see what an “eigenpet” looks like.

# subtract average pet
pets0 <- pets - pet0
# run pca
pc <- prcomp(pets0, center = FALSE, scale. = FALSE)

Ordinarily, we would find the eigenvector matrix in the “rotation” object return to us by “prcomp”. Remember, that we did PCA on the transpose of the data matrix, so we have to do some additional work to get to the eigenvectors of the original data. I have shown above that to get from the eigenvectors of \boldsymbol A^T \boldsymbol A to the eigenvectors of \boldsymbol A \boldsymbol A^T, we just need to multiply by \boldsymbol A on the left. \boldsymbol A, in this case, is our centered data matrix “pets0”.

Note that the unscaled eigenvectors of the eigenfaces are equivalent to the projection of the data onto the eigenvectors of the transposed data matrix. Therefore, we do not have to explicitly compute them but can use the “pc$x” object.

# obtain unscaled eigenvectors of eigenfaces
u_unscaled <- as.matrix(pets0) %*% pc$rotation
# this turns out to be the same as the projection of the data 
# stored in "pc$x"
all.equal(u_unscaled, pc$x)

Both “u_unscaled” and “pc$x” are unscaled versions of the eigenvectors, which means that they are not unitary matrices. For plotting, this does not matter because the images will be scaled automatically. If scaled eigenvectors are important, it is more convenient to use singular value decomposition and use the left eigenvalues stored in the matrix “u”.

# singular value decomposition of data
sv <- svd(pets0)
# left eigenvalues are stored in matrix "u"
u <- sv$u

Let’s look at the first couple of “eigenpets” (eigenfaces).

# display first 6 eigenfaces
for (i in seq(6)) {
    show_image(pc$x[, i])
}

pca_part5_fig2Some eigenfaces are definitely more cat-like and others more dog-like. We also see a common issue with eigenfaces. The directions of highest variance are usually associated with the lighting conditions of the original images. Whether they are predominantly light or dark dominates the first couple of components. In this particular case, it appears that the images of cats have stronger contrasts between foreground and background. If we were to do face recognition, we would either preprocess the images to have comparable lighting conditions or exclude the first couple of eigenfaces.

Reconstruction of Pets using “Eigenpets”

Using the eigenfaces we can approximate or completely reconstruct the original images. Let’s see how this looks like with a couple of different variance cut-offs.

# a number of variance cut-offs
vars <- c(0.2, 0.5, 0.8, 0.9, 0.98, 1)
# calculate cumulative explained variance
var_expl <- cumsum(pc$sdev^2) / sum(pc$sdev^2)
# get the number of components that explain a given amount of variance
npc <- sapply(vars, function(v) which(var_expl >= v)[1])
# reconstruct four cats and four dogs
for (i in seq(79, 82)) {
    for (j in npc) {
        # project data using "j" principal components
        r <- pc$x[, 1:j] %*% t(pc$rotation[, 1:j])
        show_image(r[, i])
        text(x = 0.01, y = 0.05, pos = 4, labels = paste("v =", round(var_expl[j], 2)))
        text(x = 0.99, y = 0.05, pos = 2, labels = paste("pc =", j))
    }
}

pca_part5_fig3Every pet starts out looking like a cat due to the dominance of lighting conditions. Somewhere between 80% and 90% of captured variance, every pet is clearly identifiable as either a cat or a dog. Even at 90% of capture variance, we use less than one third of all components for our approximation of the original image. This may be enough for a machine learning algorithm to tell apart a cat from a dog with high confidence. We see, however, that a lot of the detail of the image is contained in the last two thirds of the components.


Reproducibility

The full R code is available on Github.


Further Reading

Scholarpedia – Eigenfaces

Jeff Jauregui – Principal Component Analysis with Linear Algebra


PCA SERIES

Part 1: An Intuition

Part 2: A Look Behind The Curtain

Part 3: In the Trenches

Part 4: Potential Pitfalls

Part 5: Eigenpets

PCA – Part 5: Eigenpets