Originally posted 2018-05-02 to Blogger.
In dplyr
, I often want to group_by
some columns, apply a function to the subtables defined by grouping, and then dissolve away the grouping. The function applied to the subtables may return:
a single row for each group - in which case you’d use
dplyr::summarise()
orsummarize()
;a row for each row in the subtable - in which case there may be an appropriate window function;
any number of rows for each subtable.
In the latter case, you can use tidyr::nest()
.
Suppose I’ve got a data-frame with gene-expression data for three different tissues. The columns are tissue
, gene_id
and expression
. For each tissue, I want to find the top-10 most highly expressed genes (according to the value of expression).
suppressPackageStartupMessages(
library(tidyverse)
)
set.seed(1)
df <- tibble(
tissue = rep(LETTERS[1:3], each = 26),
gene_id = rep(letters, 3),
expression = rnorm(78)
)
The following function would extract the top-10s for a given tissue if there was only one tissue in the dataframe
get_top10 <- function(.df){
.df %>%
dplyr::arrange(
dplyr::desc(expression)
) %>%
head(10)
}
I could split
the data-frame into a list based on the tissue
values, order the subtables by expression, take the head of those subtables and join them back up. Something like:
df %>%
split(f = df$tissue) %>%
map_df(get_top10)
## # A tibble: 30 x 3
## tissue gene_id expression
## <chr> <chr> <dbl>
## 1 A d 1.60
## 2 A k 1.51
## 3 A o 1.12
## 4 A r 0.944
## 5 A u 0.919
## 6 A s 0.821
## 7 A v 0.782
## 8 A h 0.738
## 9 A y 0.620
## 10 A t 0.594
## # ... with 20 more rows
Which works perfectly fine.
Or, I could use nest
df %>%
group_by(tissue) %>%
nest() %>%
mutate(data = map(data, get_top10)) %>%
unnest()
## # A tibble: 30 x 3
## tissue gene_id expression
## <chr> <chr> <dbl>
## 1 A d 1.60
## 2 A k 1.51
## 3 A o 1.12
## 4 A r 0.944
## 5 A u 0.919
## 6 A s 0.821
## 7 A v 0.782
## 8 A h 0.738
## 9 A y 0.620
## 10 A t 0.594
## # ... with 20 more rows
The value of nest()
over split()
is that you can group_by
multiple different columns at the same time and you don’t have to do that ugly df$my_column
stuff.