From time to time you hear and read that it might be a good idea to put your data analysis projects into an R package. The advantages of this approach are having proper documentation, clear dependencies and easy automation of your analysis project. However, how to adapt the structure of a package to a standalone data analysis is not exactly obvious. As I gave the data analysis as a package approach a try recently and found this approach extremely useful especially for larger projects, I will here give an overview on how to do this.
If you have never built an R package yourself you might wonder what the structure of an R package looks like. As this has been extensively covered elsewhere, I just summarise the most important points. See Hadley Wickham’s excellent (and free to read online) book “R packages” for everything you need to know about packages and their development. The most important parts of a package’s structure for using it for your analysis are:
+ DESCRIPTION
+ NAMESPACE
+ R/
+ man/
+ data/
+ vignettes/
The files DESCRIPTION
and NAMESPACE
are metadata for your package. The
former contains things like its name, version, licence and dependencies;
the latter contains all symbols your package exports – the functions that
can be called directly after loading it with library
– and imports –
functions from other packages you make directly available to your package.
Using packages made to help with package development like
roxygen2
and
usethis
helps to do a lot of the work here.
In fact if you use roxygen2
for inline documentation throughout your project,
you won’t have to touch the NAMESPACE
file yourself.
Additionally, when using RStudio it will lay out the basic structure for you
when creating a project and specifying that it is a package.
The directory R
contains all your functions. Your source files here should not
include top-level code, as this will be executed when building the package,
but not when using it.
It is good practice to document your functions and the documentation ends
up in the man
directory. R uses a LaTeX-like language for formatting
documentation, but if you use roxygen2
to generate documentation from
specially formatted comments,
you will not need to worry about it as the
work is done automatically. So, similar as with NAMESPACE
you won’t have
to touch these files, nor bother with the format unless you want to.
Data you bundle with your package goes into data
. Here you should have
a single .RData
file per dataset you want to bundle with your package.
You can then load the data with the data
command.
Datasets can and should be documented as well.
Finally, there is the vignette
directory. You might be familiar with
vignettes, which usually contain some longer articles that demonstrate how to
use a package or how to achieve certain goals. At the end of the day,
vignettes are just HTML files and you can build them using RMarkdown
and
knitr
. Perhaps you are seeing where this is going, but more on this later.
There are more standard directories possible in a package, but for now we will just ignore them. The ones I have covered here are all you need to put your analysis project in a package. If you have read until here, you still might not exactly know how to lay out your data analysis project as a package, but I will get to that now.
Given the structure above, the data
directory is for data you bundle with
your package, the vignette
directory is where your reports made with
RMarkdown
go and the R
directory is where your functions go.
Now, you might be wondering where you do your data wrangling and whether to bundle all your raw data with the package. While in theory nothing stops you from putting 200 lines of data cleaning into a function and to bundle messy data with your package, this does not seem to elegant.
So what to do instead? Simple: create a data-raw
directory, where you can
put all the scripts you need to tidy up your data and the raw data itself.
This way they will not be bundled into the package. The idea here is to
run a set of scripts leading to a clean dataset you can work from, which you
then bundle with your package.
What I like to do here is to split the process up into several smaller scripts
and then create one meta-script sourcing all the others. The nice thing here
is you can use things that do not work inside of packages like library
or
relative paths, as these scripts will probably only be run on your machine.
Just call usethis::use_data(my_data, overwrite = TRUE)
for each fully
processed dataset you want to bundle and use in your analysis.
Note, I added overwrite = TRUE
to ensure data stays up to date when changing
the processing, or adding new or updated data.
If you then use the Install and Restart tool in R Studio (or do this
process manually), you can then load your data whenever you need it with
data("my_data", package = "my_analysis")
(or omit the package
option if
you have used library("my_analysis")
beforehand).
You are probably used to writing reports with RMarkdown and to use them to communicate your results. When putting your analysis into a package you can make use of the vignette system to keep your reports organised and build them automatically.
If you use usethis::use_vignette()
many things are done automatically,
but for completeness sake, let’s look at the header of a vignette in a
data analysis as package context.
title: "Linear Regression"
output:
rmarkdown::html_document:
toc: true
toc_float: true
vignette: >
%\VignetteIndexEntry{linear_regression}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{dplyr, tibble}
There are a few noteworthy things here: first of all I did replace the
output format with html_document
for being able to use a floating table of
contents, while the default would be html_vignette
. You can use whatever
you like here as long as it is supported by the specified VignetteEngine
.
Also note, while you can change the output format to whatever you like, CRAN
probably won’t like it, if it is something different from html_vignette
, the
reason being that the output file size inflates dramatically with other outputs.
However, as we are talking about data analysis projects here, it is unlikely
that you submit them to CRAN, so feel free to use whatever you like.
The VignetteDepends
field specifies which packages are required to build
the vignette, which is a nice way of avoiding them to fail halfway through
the building. Note that this does not load the packages for you, it just
ensures that they are installed before attempting to build. You still need
to load them though (or explicitly refer to their namespace using ::
).
You can also load packages not specified in VignetteDepends
, but if there
is a mechanism in place that ensures everything can be run properly, you should
probably use it.
Generally you probably want to start your analysis within a vignette by loading
the packages you want to use (including the one you create for your analysis)
and then also your data you bundled with your package. This allows you to keep
independent parts of your analysis self-contained.
And from there on everything is just like your regular RMarkdown
report,
just make sure to put whatever function you create into the R
directory
and to document and export it.
One more advantage when using a package for creating an analysis
is that you are more or less required to document your functions and datasets.
At least running R CMD check
will complain if documentation is missing.
And you should be running R CMD check
from time to time, as it will help
you spot potential problems with your code, ensuring that you will be able
to reproduce your analyses in the future. There is no need to have a perfect
result if you do not plan to submit to CRAN, but the overall assessment of
issues is certainly useful.
Obviously, as with everything, there are upsides and downsides when turning a
data analysis project into a package. If you work outside of a package
structure you can throw some code together and be done with it, but creating
a stand-alone package requires much more care. For example you will need
to make sure to include all your dependencies in your DESCRIPTION
file and
you will have to be explicit with namespaces in your functions
(i.e. calling dplyr::filter
instead of filter
, unless you want to pollute
your NAMESPACE
with Import
statements).
Additionally you will not be able to make too many assumptions
about the environment the code is run in, for example by using relative paths.
And obviously, as large parts of this post were about structure you will have
to follow a more or less rigid and standardised way to structure your code,
whether you like it or not.
Putting your data analysis projects into packages provides many benefits, but requires some more care when organising the project. Nonetheless benefits like documentation, portability and reproducibility far outweigh the downsides. This is especially true the more complex your project is.
Until one of my recent projects at work got a bit difficult to organise due to the large amount of analyses involved I never considered the analysis as a package approach. However, after giving it a try for the first time on a project of my own really changed my mind and I would consider it for all but the smallest projects.