Most of you are probably familiar with the tidyverse and the associated
readr
package for reading all kinds of rectangular, tabular data.
Consider the following example from a GTFS dataset (the July 2021 GTFS data from Open Data ÖPNV provided by the VRR):
library("readr")
calendar <- read_csv("calendar.txt")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## service_id = col_character(),
## monday = col_double(),
## tuesday = col_double(),
## wednesday = col_double(),
## thursday = col_double(),
## friday = col_double(),
## saturday = col_double(),
## sunday = col_double(),
## start_date = col_double(),
## end_date = col_double()
## )
This is a relatively small file with 1476 rows, so it is
read quickly and readr
makes the right guess about the data types in
the columns.
Now, consider a much larger file.
stops <- read_csv("stops.txt")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## stop_id = col_character(),
## stop_code = col_logical(),
## stop_name = col_character(),
## stop_lat = col_double(),
## stop_lon = col_double(),
## stop_url = col_logical(),
## location_type = col_double(),
## parent_station = col_character(),
## wheelchair_boarding = col_double(),
## platform_code = col_logical()
## )
## Warning: 100 parsing failures.
## row col expected actual file
## 1975 platform_code 1/0/T/F/TRUE/FALSE 2 '../../../input_data/readr/stops.txt'
## 2009 platform_code 1/0/T/F/TRUE/FALSE 4 '../../../input_data/readr/stops.txt'
## 2012 platform_code 1/0/T/F/TRUE/FALSE 3 '../../../input_data/readr/stops.txt'
## 2013 platform_code 1/0/T/F/TRUE/FALSE 2 '../../../input_data/readr/stops.txt'
## 2019 platform_code 1/0/T/F/TRUE/FALSE 2 '../../../input_data/readr/stops.txt'
## .... ............. .................. ...... .....................................
## See problems(...) for more details.
This file has 32883 rows, so it takes much longer to read and we see some parsing failures. You can imagine that things become even more problematic as the data gets larger, a few ten-thousand rows are nowhere near large when it comes to data.
You might want to try increasing the value for the guess_max
option to
avoid the parsing failures.
This will work out as expected, but at the same time reading the data
becomes awfully slow.
system.time(read_csv("stops.txt"))
system.time(read_csv("stops.txt", guess_max = 35000))
## user system elapsed
## 0.077 0.005 0.082
## user system elapsed
## 0.249 0.031 0.280
You can clearly see that the second approach takes considerably more time than the first. And now imagine the situation with a few million rows.
So, increasing guess_max
might be an option for small-ish datasets, but
for large data we need something else.
This is where col_types
comes in.
You probably have noted the verbose column specification readr
prints
every time you read some data?
You can copy that and use it for the col_types
argument and it will both
speed up reading of data and remove that message.
Of course, you can provide your own definitions of column types if you know
what to expect.
For example, for our GTFS stops data:
stops_cols <- cols(
stop_id = col_character(),
stop_code = col_logical(),
stop_name = col_character(),
stop_lat = col_double(),
stop_lon = col_double(),
stop_url = col_logical(),
location_type = col_double(),
parent_station = col_character(),
wheelchair_boarding = col_double(),
platform_code = col_character()
)
And then we can use it like so:
stops <- readr::read_csv("stops.txt", col_types = stops_cols)
Now, no messages were printed.
system.time(read_csv("stops.txt", col_types = stops_cols))
## user system elapsed
## 0.047 0.001 0.048
And it turns out providing col_types
speeds up the reading of data quite a bit
compared ot what we got above without column specifications.
In conclusion, you should use column specifications when reading data
with readr
, especially if the data is large.
This will speed things up quite dramatically and will also avoid parsing
failures due to incorrectly guessed column types.
In addition, you will gain even more safety because unexpected data will
now lead to parsing failures, which you can then investigate.