Statistics, Science, Random Ramblings

A blog about data and other interesting things

Use tidyr::spread on mulitple columns

Posted at — Aug 12, 2019

Something I came across today was that it is not directly possible to use tidyr::spread() on multiple columns.

In short, you need to gather everything together first, before you can then spread. Let’s do a quick walk-through to clarify:

library("tidyverse")
data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

For this example we assume that we want to group the data by cyl and gear and then calculate the mean and standard deviation within the group.

mt_summary <- mtcars %>% 
    group_by(cyl, gear) %>% 
    summarise(mean_hp = mean(hp), sd_hp = sd(hp))
mt_summary
## # A tibble: 8 x 4
## # Groups:   cyl [3]
##     cyl  gear mean_hp sd_hp
##   <dbl> <dbl>   <dbl> <dbl>
## 1     4     3     97  NA   
## 2     4     4     76  20.1 
## 3     4     5    102  15.6 
## 4     6     3    108.  3.54
## 5     6     4    116.  7.51
## 6     6     5    175  NA   
## 7     8     3    194. 33.4 
## 8     8     5    300. 50.2

Now let’s also assume that we want to transform the data that we have a column for mean sd for each level of gear. This sounds like a job for tidyr::spread(), but as mentioned it can only handle a single key/value pair, so we could spread either mean or sd.

Fortunately there is a workaround. First we need to gather the mean and sd columns:

mt_transformed <- gather(mt_summary, var, value, -c(cyl, gear))
mt_transformed
## # A tibble: 16 x 4
## # Groups:   cyl [3]
##      cyl  gear var      value
##    <dbl> <dbl> <chr>    <dbl>
##  1     4     3 mean_hp  97   
##  2     4     4 mean_hp  76   
##  3     4     5 mean_hp 102   
##  4     6     3 mean_hp 108.  
##  5     6     4 mean_hp 116.  
##  6     6     5 mean_hp 175   
##  7     8     3 mean_hp 194.  
##  8     8     5 mean_hp 300.  
##  9     4     3 sd_hp    NA   
## 10     4     4 sd_hp    20.1 
## 11     4     5 sd_hp    15.6 
## 12     6     3 sd_hp     3.54
## 13     6     4 sd_hp     7.51
## 14     6     5 sd_hp    NA   
## 15     8     3 sd_hp    33.4 
## 16     8     5 sd_hp    50.2

Now we need to create our column names for each group:

mt_transformed <- unite(mt_transformed, tempcol, var, gear)
mt_transformed
## # A tibble: 16 x 3
## # Groups:   cyl [3]
##      cyl tempcol    value
##    <dbl> <chr>      <dbl>
##  1     4 mean_hp_3  97   
##  2     4 mean_hp_4  76   
##  3     4 mean_hp_5 102   
##  4     6 mean_hp_3 108.  
##  5     6 mean_hp_4 116.  
##  6     6 mean_hp_5 175   
##  7     8 mean_hp_3 194.  
##  8     8 mean_hp_5 300.  
##  9     4 sd_hp_3    NA   
## 10     4 sd_hp_4    20.1 
## 11     4 sd_hp_5    15.6 
## 12     6 sd_hp_3     3.54
## 13     6 sd_hp_4     7.51
## 14     6 sd_hp_5    NA   
## 15     8 sd_hp_3    33.4 
## 16     8 sd_hp_5    50.2

And from here it is pretty straightforward:

mt_transformed <- spread(mt_transformed, tempcol, value)
mt_transformed
## # A tibble: 3 x 7
## # Groups:   cyl [3]
##     cyl mean_hp_3 mean_hp_4 mean_hp_5 sd_hp_3 sd_hp_4 sd_hp_5
##   <dbl>     <dbl>     <dbl>     <dbl>   <dbl>   <dbl>   <dbl>
## 1     4       97        76       102    NA      20.1     15.6
## 2     6      108.      116.      175     3.54    7.51    NA  
## 3     8      194.       NA       300.   33.4    NA       50.2

Of course this can be chained using pipes, but to illustrate this workaround I did this step by step.

Future versions of tidyr will include pivot functions which appear to be more flexible than the gather and spread pair that is currently available.

Concluding remarks

  • You can not use tidyr::spread on multiple columns.
  • However, using a workaround you can still achieve the same goal.
  • Future versions of tidyr will include more flexible pivot functions.