tidyr::nest

Sep 30, 2018 in R
favouritethings rstats
2 min read

Originally posted 2018-05-02 to Blogger.

In dplyr, I often want to group_by some columns, apply a function to the subtables defined by grouping, and then dissolve away the grouping. The function applied to the subtables may return:

a single row for each group - in which case you’d use dplyr::summarise() or summarize();
a row for each row in the subtable - in which case there may be an appropriate window function;
any number of rows for each subtable.

In the latter case, you can use tidyr::nest().

Suppose I’ve got a data-frame with gene-expression data for three different tissues. The columns are tissue, gene_id and expression. For each tissue, I want to find the top-10 most highly expressed genes (according to the value of expression).

suppressPackageStartupMessages(
  library(tidyverse)
)
set.seed(1)

df <- tibble(
  tissue = rep(LETTERS[1:3], each = 26),
  gene_id = rep(letters, 3),
  expression = rnorm(78)
)

The following function would extract the top-10s for a given tissue if there was only one tissue in the dataframe

get_top10 <- function(.df){
  .df %>% 
    dplyr::arrange(
      dplyr::desc(expression)
    ) %>%
    head(10)
}

I could split the data-frame into a list based on the tissue values, order the subtables by expression, take the head of those subtables and join them back up. Something like:

df %>% 
  split(f = df$tissue) %>% 
  map_df(get_top10)

## # A tibble: 30 x 3
##    tissue gene_id expression
##    <chr>  <chr>        <dbl>
##  1 A      d            1.60 
##  2 A      k            1.51 
##  3 A      o            1.12 
##  4 A      r            0.944
##  5 A      u            0.919
##  6 A      s            0.821
##  7 A      v            0.782
##  8 A      h            0.738
##  9 A      y            0.620
## 10 A      t            0.594
## # ... with 20 more rows

Which works perfectly fine.

Or, I could use nest

df %>%
  group_by(tissue) %>%
  nest() %>%
  mutate(data = map(data, get_top10)) %>%
  unnest()

## # A tibble: 30 x 3
##    tissue gene_id expression
##    <chr>  <chr>        <dbl>
##  1 A      d            1.60 
##  2 A      k            1.51 
##  3 A      o            1.12 
##  4 A      r            0.944
##  5 A      u            0.919
##  6 A      s            0.821
##  7 A      v            0.782
##  8 A      h            0.738
##  9 A      y            0.620
## 10 A      t            0.594
## # ... with 20 more rows

The value of nest() over split() is that you can group_by multiple different columns at the same time and you don’t have to do that ugly df$my_column stuff.