Statistics, Science, Random Ramblings

A blog about data and other interesting things

If you use readr you should also use col_types

Posted at — Jul 30, 2021

Most of you are probably familiar with the tidyverse and the associated readr package for reading all kinds of rectangular, tabular data.

Consider the following example from a GTFS dataset (the July 2021 GTFS data from Open Data ÖPNV provided by the VRR):

library("readr")
calendar <- read_csv("calendar.txt")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   service_id = col_character(),
##   monday = col_double(),
##   tuesday = col_double(),
##   wednesday = col_double(),
##   thursday = col_double(),
##   friday = col_double(),
##   saturday = col_double(),
##   sunday = col_double(),
##   start_date = col_double(),
##   end_date = col_double()
## )

This is a relatively small file with 1476 rows, so it is read quickly and readr makes the right guess about the data types in the columns.

Now, consider a much larger file.

stops <- read_csv("stops.txt")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   stop_id = col_character(),
##   stop_code = col_logical(),
##   stop_name = col_character(),
##   stop_lat = col_double(),
##   stop_lon = col_double(),
##   stop_url = col_logical(),
##   location_type = col_double(),
##   parent_station = col_character(),
##   wheelchair_boarding = col_double(),
##   platform_code = col_logical()
## )
## Warning: 100 parsing failures.
##  row           col           expected actual                                  file
## 1975 platform_code 1/0/T/F/TRUE/FALSE      2 '../../../input_data/readr/stops.txt'
## 2009 platform_code 1/0/T/F/TRUE/FALSE      4 '../../../input_data/readr/stops.txt'
## 2012 platform_code 1/0/T/F/TRUE/FALSE      3 '../../../input_data/readr/stops.txt'
## 2013 platform_code 1/0/T/F/TRUE/FALSE      2 '../../../input_data/readr/stops.txt'
## 2019 platform_code 1/0/T/F/TRUE/FALSE      2 '../../../input_data/readr/stops.txt'
## .... ............. .................. ...... .....................................
## See problems(...) for more details.

This file has 32883 rows, so it takes much longer to read and we see some parsing failures. You can imagine that things become even more problematic as the data gets larger, a few ten-thousand rows are nowhere near large when it comes to data.

You might want to try increasing the value for the guess_max option to avoid the parsing failures. This will work out as expected, but at the same time reading the data becomes awfully slow.

system.time(read_csv("stops.txt"))
system.time(read_csv("stops.txt", guess_max = 35000))
##    user  system elapsed 
##   0.077   0.005   0.082
##    user  system elapsed 
##   0.249   0.031   0.280

You can clearly see that the second approach takes considerably more time than the first. And now imagine the situation with a few million rows.

So, increasing guess_max might be an option for small-ish datasets, but for large data we need something else. This is where col_types comes in. You probably have noted the verbose column specification readr prints every time you read some data? You can copy that and use it for the col_types argument and it will both speed up reading of data and remove that message. Of course, you can provide your own definitions of column types if you know what to expect. For example, for our GTFS stops data:

stops_cols <- cols(
  stop_id = col_character(),
  stop_code = col_logical(),
  stop_name = col_character(),
  stop_lat = col_double(),
  stop_lon = col_double(),
  stop_url = col_logical(),
  location_type = col_double(),
  parent_station = col_character(),
  wheelchair_boarding = col_double(),
  platform_code = col_character()
)

And then we can use it like so:

stops <- readr::read_csv("stops.txt", col_types = stops_cols)

Now, no messages were printed.

system.time(read_csv("stops.txt", col_types = stops_cols))
##    user  system elapsed 
##   0.047   0.001   0.048

And it turns out providing col_types speeds up the reading of data quite a bit compared ot what we got above without column specifications.

In conclusion, you should use column specifications when reading data with readr, especially if the data is large. This will speed things up quite dramatically and will also avoid parsing failures due to incorrectly guessed column types. In addition, you will gain even more safety because unexpected data will now lead to parsing failures, which you can then investigate.