Statistics, Science, Random Ramblings

A blog about data and other interesting things

The listr package

Posted at — Jan 27, 2022

Recently, I did a lot of work with data stored in lists. Lists in R are pretty useful as they allow you to store data of different types in a single object, where there are pretty much no limits what kinds of data you put in there. In my case it was mostly large data frames resulting from a series of processing steps on external data. Ultimately, the data needed for specific analyses was put into data frames, however on the way to building those data frames I noted that I applied a set of patterns on lists over and over again. For example putting the name of the list items as column into the data frames or binding some of those data frames together.

Thus, I decided to write a package to make my life a bit easier. Is it the first package for doing things with lists? Probably not and there is for example some overlap with purrr; but it was fun to write and also a nice learning experience aiming to build a package that others could use as well. This included putting some effort into writing documentation and not making too many assumptions around the use cases.

The result is the listr package, which at the moment is only available from my gitlab page as version 0.0.1. As the version number suggests there are still rough edges and most likely bugs, but the core functionality is there. I aim to submit the package to CRAN at some point in the future when it is more polished.

During the last two months I used the available version in the real world and found it quite useful, so I am confident that I will continue to expand the package in the future.

A quick tour of listr

library("listr")
data("penguins", package = "palmerpenguins")
p <- split(penguins, penguins$island)

One of the main ideas of listr is using tidyselect for interacting with lists, meaning you can apply all those nifty patterns you like from using tidyverse functions.

p |> 
    list_rename(i1 = Biscoe, i2 = Dream) |> 
    list_select(starts_with("i"))
## $i1
## # A tibble: 168 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           37.8          18.3               174        3400
##  2 Adelie  Biscoe           37.7          18.7               180        3600
##  3 Adelie  Biscoe           35.9          19.2               189        3800
##  4 Adelie  Biscoe           38.2          18.1               185        3950
##  5 Adelie  Biscoe           38.8          17.2               180        3800
##  6 Adelie  Biscoe           35.3          18.9               187        3800
##  7 Adelie  Biscoe           40.6          18.6               183        3550
##  8 Adelie  Biscoe           40.5          17.9               187        3200
##  9 Adelie  Biscoe           37.9          18.6               172        3150
## 10 Adelie  Biscoe           40.5          18.9               180        3950
## # … with 158 more rows, and 2 more variables: sex <fct>, year <int>
## 
## $i2
## # A tibble: 124 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Dream            39.5          16.7               178        3250
##  2 Adelie  Dream            37.2          18.1               178        3900
##  3 Adelie  Dream            39.5          17.8               188        3300
##  4 Adelie  Dream            40.9          18.9               184        3900
##  5 Adelie  Dream            36.4          17                 195        3325
##  6 Adelie  Dream            39.2          21.1               196        4150
##  7 Adelie  Dream            38.8          20                 190        3950
##  8 Adelie  Dream            42.2          18.5               180        3550
##  9 Adelie  Dream            37.6          19.3               181        3300
## 10 Adelie  Dream            39.8          19.1               184        4650
## # … with 114 more rows, and 2 more variables: sex <fct>, year <int>

That is of course a toy example, but it should demonstrate the use well enough.

The other main idea behind the package is that it is pipe-friendly. Pipes have become quite popular in R and so it was not a hard decision to make all functions in the package work with pipes by simply making all functions expect the data to work on as first argument.

p |> 
    list_insert("penguins are cool", 2, name = "random_text") |> 
    list_select(1, 2)
## $Biscoe
## # A tibble: 168 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           37.8          18.3               174        3400
##  2 Adelie  Biscoe           37.7          18.7               180        3600
##  3 Adelie  Biscoe           35.9          19.2               189        3800
##  4 Adelie  Biscoe           38.2          18.1               185        3950
##  5 Adelie  Biscoe           38.8          17.2               180        3800
##  6 Adelie  Biscoe           35.3          18.9               187        3800
##  7 Adelie  Biscoe           40.6          18.6               183        3550
##  8 Adelie  Biscoe           40.5          17.9               187        3200
##  9 Adelie  Biscoe           37.9          18.6               172        3150
## 10 Adelie  Biscoe           40.5          18.9               180        3950
## # … with 158 more rows, and 2 more variables: sex <fct>, year <int>
## 
## $random_text
## [1] "penguins are cool"

I probably need to find a good dataset that I can bundle with the package to build better examples.

The package does contain wrappers around do.call("rbind", ...) and do.call("cbind", ...), something which I use very often with lists.

p |> 
    list_bind(1, 2, name = "biscoe_and_dream")
## $biscoe_and_dream
## # A tibble: 292 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##  * <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           37.8          18.3               174        3400
##  2 Adelie  Biscoe           37.7          18.7               180        3600
##  3 Adelie  Biscoe           35.9          19.2               189        3800
##  4 Adelie  Biscoe           38.2          18.1               185        3950
##  5 Adelie  Biscoe           38.8          17.2               180        3800
##  6 Adelie  Biscoe           35.3          18.9               187        3800
##  7 Adelie  Biscoe           40.6          18.6               183        3550
##  8 Adelie  Biscoe           40.5          17.9               187        3200
##  9 Adelie  Biscoe           37.9          18.6               172        3150
## 10 Adelie  Biscoe           40.5          18.9               180        3950
## # … with 282 more rows, and 2 more variables: sex <fct>, year <int>
## 
## $Torgersen
## # A tibble: 52 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 42 more rows, and 2 more variables: sex <fct>, year <int>

The default is to bind rows and to keep the elements that were bound together in the list. I found myself oftentimes applying a pattern like the following:

p |> 
    list_bind(everything()) |> 
    list_extract(1)
## # A tibble: 344 × 8
##    species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##  * <fct>   <fct>           <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Biscoe           37.8          18.3               174        3400
##  2 Adelie  Biscoe           37.7          18.7               180        3600
##  3 Adelie  Biscoe           35.9          19.2               189        3800
##  4 Adelie  Biscoe           38.2          18.1               185        3950
##  5 Adelie  Biscoe           38.8          17.2               180        3800
##  6 Adelie  Biscoe           35.3          18.9               187        3800
##  7 Adelie  Biscoe           40.6          18.6               183        3550
##  8 Adelie  Biscoe           40.5          17.9               187        3200
##  9 Adelie  Biscoe           37.9          18.6               172        3150
## 10 Adelie  Biscoe           40.5          18.9               180        3950
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

Which probably should get its own wrapper function in a future version of listr.

What’s next?

The next version will probably focus on some primarily cosmetic changes. One thing are the error messages from tidyselect, which as of now refer to columns when for example an element in the list does not exist. Most likely I will also introduce a class for nicer and more compact printing of lists, similar to how tbl_df look nicer than raw data frames.

In the meantime feel free to try the package with devtools::install_gitlab("choh/listr").