Statistics, Science, Random Ramblings

A blog about data and other interesting things

Data analysis projects as R packages: a practical primer

Posted at — Jan 24, 2020

From time to time you hear and read that it might be a good idea to put your data analysis projects into an R package. The advantages of this approach are having proper documentation, clear dependencies and easy automation of your analysis project. However, how to adapt the structure of a package to a standalone data analysis is not exactly obvious. As I gave the data analysis as a package approach a try recently and found this approach extremely useful especially for larger projects, I will here give an overview on how to do this.

The structure of an R package

If you have never built an R package yourself you might wonder what the structure of an R package looks like. As this has been extensively covered elsewhere, I just summarise the most important points. See Hadley Wickham’s excellent (and free to read online) book “R packages” for everything you need to know about packages and their development. The most important parts of a package’s structure for using it for your analysis are:

+ DESCRIPTION
+ NAMESPACE
+ R/
+ man/
+ data/
+ vignettes/

The files DESCRIPTION and NAMESPACE are metadata for your package. The former contains things like its name, version, licence and dependencies; the latter contains all symbols your package exports – the functions that can be called directly after loading it with library – and imports – functions from other packages you make directly available to your package. Using packages made to help with package development like roxygen2 and usethis helps to do a lot of the work here. In fact if you use roxygen2 for inline documentation throughout your project, you won’t have to touch the NAMESPACE file yourself. Additionally, when using RStudio it will lay out the basic structure for you when creating a project and specifying that it is a package.

The directory R contains all your functions. Your source files here should not include top-level code, as this will be executed when building the package, but not when using it.

It is good practice to document your functions and the documentation ends up in the man directory. R uses a LaTeX-like language for formatting documentation, but if you use roxygen2 to generate documentation from specially formatted comments, you will not need to worry about it as the work is done automatically. So, similar as with NAMESPACE you won’t have to touch these files, nor bother with the format unless you want to.

Data you bundle with your package goes into data. Here you should have a single .RData file per dataset you want to bundle with your package. You can then load the data with the data command. Datasets can and should be documented as well.

Finally, there is the vignette directory. You might be familiar with vignettes, which usually contain some longer articles that demonstrate how to use a package or how to achieve certain goals. At the end of the day, vignettes are just HTML files and you can build them using RMarkdown and knitr. Perhaps you are seeing where this is going, but more on this later.

There are more standard directories possible in a package, but for now we will just ignore them. The ones I have covered here are all you need to put your analysis project in a package. If you have read until here, you still might not exactly know how to lay out your data analysis project as a package, but I will get to that now.

So, how does this all translate to data analysis projects?

Given the structure above, the data directory is for data you bundle with your package, the vignette directory is where your reports made with RMarkdown go and the R directory is where your functions go.

Now, you might be wondering where you do your data wrangling and whether to bundle all your raw data with the package. While in theory nothing stops you from putting 200 lines of data cleaning into a function and to bundle messy data with your package, this does not seem to elegant.

So what to do instead? Simple: create a data-raw directory, where you can put all the scripts you need to tidy up your data and the raw data itself. This way they will not be bundled into the package. The idea here is to run a set of scripts leading to a clean dataset you can work from, which you then bundle with your package.

What I like to do here is to split the process up into several smaller scripts and then create one meta-script sourcing all the others. The nice thing here is you can use things that do not work inside of packages like library or relative paths, as these scripts will probably only be run on your machine. Just call usethis::use_data(my_data, overwrite = TRUE) for each fully processed dataset you want to bundle and use in your analysis. Note, I added overwrite = TRUE to ensure data stays up to date when changing the processing, or adding new or updated data.

If you then use the Install and Restart tool in R Studio (or do this process manually), you can then load your data whenever you need it with data("my_data", package = "my_analysis") (or omit the package option if you have used library("my_analysis") beforehand).

Vignettes, the place for actual analyses

You are probably used to writing reports with RMarkdown and to use them to communicate your results. When putting your analysis into a package you can make use of the vignette system to keep your reports organised and build them automatically.

If you use usethis::use_vignette() many things are done automatically, but for completeness sake, let’s look at the header of a vignette in a data analysis as package context.

title: "Linear Regression"
output: 
  rmarkdown::html_document:
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{linear_regression}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
  %\VignetteDepends{dplyr, tibble}

There are a few noteworthy things here: first of all I did replace the output format with html_document for being able to use a floating table of contents, while the default would be html_vignette. You can use whatever you like here as long as it is supported by the specified VignetteEngine. Also note, while you can change the output format to whatever you like, CRAN probably won’t like it, if it is something different from html_vignette, the reason being that the output file size inflates dramatically with other outputs. However, as we are talking about data analysis projects here, it is unlikely that you submit them to CRAN, so feel free to use whatever you like.

The VignetteDepends field specifies which packages are required to build the vignette, which is a nice way of avoiding them to fail halfway through the building. Note that this does not load the packages for you, it just ensures that they are installed before attempting to build. You still need to load them though (or explicitly refer to their namespace using ::). You can also load packages not specified in VignetteDepends, but if there is a mechanism in place that ensures everything can be run properly, you should probably use it.

Generally you probably want to start your analysis within a vignette by loading the packages you want to use (including the one you create for your analysis) and then also your data you bundled with your package. This allows you to keep independent parts of your analysis self-contained. And from there on everything is just like your regular RMarkdown report, just make sure to put whatever function you create into the R directory and to document and export it.

R CMD check

One more advantage when using a package for creating an analysis is that you are more or less required to document your functions and datasets. At least running R CMD check will complain if documentation is missing. And you should be running R CMD check from time to time, as it will help you spot potential problems with your code, ensuring that you will be able to reproduce your analyses in the future. There is no need to have a perfect result if you do not plan to submit to CRAN, but the overall assessment of issues is certainly useful.

Downsides

Obviously, as with everything, there are upsides and downsides when turning a data analysis project into a package. If you work outside of a package structure you can throw some code together and be done with it, but creating a stand-alone package requires much more care. For example you will need to make sure to include all your dependencies in your DESCRIPTION file and you will have to be explicit with namespaces in your functions (i.e. calling dplyr::filter instead of filter, unless you want to pollute your NAMESPACE with Import statements).
Additionally you will not be able to make too many assumptions about the environment the code is run in, for example by using relative paths. And obviously, as large parts of this post were about structure you will have to follow a more or less rigid and standardised way to structure your code, whether you like it or not.

Conclusions

Putting your data analysis projects into packages provides many benefits, but requires some more care when organising the project. Nonetheless benefits like documentation, portability and reproducibility far outweigh the downsides. This is especially true the more complex your project is.

Until one of my recent projects at work got a bit difficult to organise due to the large amount of analyses involved I never considered the analysis as a package approach. However, after giving it a try for the first time on a project of my own really changed my mind and I would consider it for all but the smallest projects.