Something I came across today was that it is not directly possible to use
tidyr::spread()
on multiple columns.
In short, you need to gather
everything together first, before you can
then spread.
Let’s do a quick walk-through to clarify:
library("tidyverse")
data(mtcars)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
For this example we assume that we want to group the data by cyl
and gear
and then calculate the mean and standard deviation within the group.
mt_summary <- mtcars %>%
group_by(cyl, gear) %>%
summarise(mean_hp = mean(hp), sd_hp = sd(hp))
mt_summary
## # A tibble: 8 x 4
## # Groups: cyl [3]
## cyl gear mean_hp sd_hp
## <dbl> <dbl> <dbl> <dbl>
## 1 4 3 97 NA
## 2 4 4 76 20.1
## 3 4 5 102 15.6
## 4 6 3 108. 3.54
## 5 6 4 116. 7.51
## 6 6 5 175 NA
## 7 8 3 194. 33.4
## 8 8 5 300. 50.2
Now let’s also assume that we want to transform the data that we have a
column for mean sd for each level of gear.
This sounds like a job for tidyr::spread()
, but as mentioned it can
only handle a single key/value pair, so we could spread either mean or sd.
Fortunately there is a workaround. First we need to gather the mean and sd columns:
mt_transformed <- gather(mt_summary, var, value, -c(cyl, gear))
mt_transformed
## # A tibble: 16 x 4
## # Groups: cyl [3]
## cyl gear var value
## <dbl> <dbl> <chr> <dbl>
## 1 4 3 mean_hp 97
## 2 4 4 mean_hp 76
## 3 4 5 mean_hp 102
## 4 6 3 mean_hp 108.
## 5 6 4 mean_hp 116.
## 6 6 5 mean_hp 175
## 7 8 3 mean_hp 194.
## 8 8 5 mean_hp 300.
## 9 4 3 sd_hp NA
## 10 4 4 sd_hp 20.1
## 11 4 5 sd_hp 15.6
## 12 6 3 sd_hp 3.54
## 13 6 4 sd_hp 7.51
## 14 6 5 sd_hp NA
## 15 8 3 sd_hp 33.4
## 16 8 5 sd_hp 50.2
Now we need to create our column names for each group:
mt_transformed <- unite(mt_transformed, tempcol, var, gear)
mt_transformed
## # A tibble: 16 x 3
## # Groups: cyl [3]
## cyl tempcol value
## <dbl> <chr> <dbl>
## 1 4 mean_hp_3 97
## 2 4 mean_hp_4 76
## 3 4 mean_hp_5 102
## 4 6 mean_hp_3 108.
## 5 6 mean_hp_4 116.
## 6 6 mean_hp_5 175
## 7 8 mean_hp_3 194.
## 8 8 mean_hp_5 300.
## 9 4 sd_hp_3 NA
## 10 4 sd_hp_4 20.1
## 11 4 sd_hp_5 15.6
## 12 6 sd_hp_3 3.54
## 13 6 sd_hp_4 7.51
## 14 6 sd_hp_5 NA
## 15 8 sd_hp_3 33.4
## 16 8 sd_hp_5 50.2
And from here it is pretty straightforward:
mt_transformed <- spread(mt_transformed, tempcol, value)
mt_transformed
## # A tibble: 3 x 7
## # Groups: cyl [3]
## cyl mean_hp_3 mean_hp_4 mean_hp_5 sd_hp_3 sd_hp_4 sd_hp_5
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4 97 76 102 NA 20.1 15.6
## 2 6 108. 116. 175 3.54 7.51 NA
## 3 8 194. NA 300. 33.4 NA 50.2
Of course this can be chained using pipes, but to illustrate this workaround I did this step by step.
Future versions of tidyr will include pivot
functions which appear
to be more flexible than the gather
and spread
pair that is currently
available.
tidyr::spread
on multiple columns.