Testing code in RMarkdown documents with knitr

Over the last few months, Literate Programming  has proved to be a huge help to me in documenting my exploratory code in R. Writing Rmarkdown documents and building them with knitr not only provides me a greater opportunity to clarify my code in plain English, it also allows me to rationalise why I did something in the first place.

While this is really useful, this has come at the expense of writing careful, well unit tested code. For instance, last week I discovered that a relatively simple function that I wrote to take the average values from multiple data frames was completely wrong.

As such, I wanted to find a way that let me continue to write Rmarkdown while also testing my code directly  by using a common unit testing framework like testthat.

Here is one solution to that problem: if we isolate our key functions from our Rmarkdown document and place them in a separate R file, we can test them with testthat and include them in our Rmarkdown document by using knitr’s get_chunk function.

Prime Numbers

As an example, let’s create a document which shows our function for finding out if a number if prime or not:

This Rmarkdown document lists the function and tests to see whether 1,000 is prime or not. After rendering the HTML document, I am relieved to see is.prime does indeed yield FALSE for is.prime(1000). If I wanted to introduce more tests, I could simply add more lines to the test-is-prime chunk and, if they passed, comment that chunk out of the file.

This isn’t ideal for a number of reasons, one reason being that in this case, I’m not using a testing framework which would allow me to automatically check if I had broken my code.

I solved this by moving the is-prime-function chunk into a separate R file called is_prime.R:

Knitr interprets ‘## —-‘ at the start of a line as an indicator for a label for a chunk of code that can be reused later. The chunk of code associated with this label goes until the end of the file or until another label is encountered.

To include this code, as before,I just have to change a few small things:

You will note the extra line to the setup chunk which includes a read_chunk function in knitr which includes the path to the newly created R file. To include the function is.prime in the document again, an empty chunk with the same name as the label in is_prime.R has to be created. When I use knitr to create the document, knitr will inject the is.prime function into the external-code chunk and is.prime(1000) will execute successfully.

Now, testing is.prime with testthat is relatively easy by just creating test/test_is_prime.R and writing a few test cases:

And to run our tests in RStudio, we just have to type this into the console:

It’s fairly simple, clean and common sense. An added bonus is that I can now inject is_prime into any other Rmarkdown document by following the same method.

The accompanying source code for this blog can be found at https://github.com/hiraethus/How-to-unit-test-RMarkdown.

Using Packrat with Bioconductor in RStudio

As an R programmer, you may not be familiar with the development processes involved in programming Java. For those of you who have written some production Java code, you may have found that the barrier to entry can seem quite high. With so many tools you need to grok in order to have a basic level of proficiency, particularly if you are thrown in the deep end on a mature code base. Furthermore, soon even more complexity will be added to the Java developer’s toolbox with the additions presented in Java JDK 9 such as the module system that has burst forth from Project Jigsaw.

With this said, one thing tool that has matured in Java software development is the use of dependency management. While in years past, Java libraries (packages) would be committed to repositories along with a project’s source code, it is now common practice to define a pom.xml file which contains a listing of all libraries and their versions. When a developer clones a copy of the source code, she will run ‘mvn build’ to build the project and simultaneously download all library dependencies to her machine if they are not already present. This means that all developers who build their project using maven will be using the same versions of libraries while testing their code.

Packrat provides this very same level of convenience to R programmers. In particular, packrat works by creating a subfolder in your R project which stores a file that specifies all the packages and their versions you used in your project as well as a repository of packages that is used privately by your project.

As this blog addresses using R with Bioconductor, I will discuss what I do in order to set up a project with packrat using RStudio. You will note that I mix using the Graphical User Interface and a handful of packrat commands in the console which I find to be most useful.

The easiest way to set up RStudio to use Packrat is, when creating a new project, is to ensure you choose the ‘Use packrat with this project’ option.

Ensure you choose 'Use packrat with this project'

Now we have an R project with its own package library inside the packrat subfolder. If using a version control system like Git, it’s tempting to commit the whole packrat directory along with the project to your git repository. The consequence of this is that you are potentially installing binary files to the repository making it a much larger repository for others to download if they want to run your code.

To avoid committing R packages to git, packrat provides a function that modifies your .gitignore file. Run this inside your RStudio console:

You can now commit everything to your git repository as an initial commit.

You should now be able to install all packages using install.packages to retrieve packages from CRAN. After installing a package, it will be saved in your private packrat repository. However, in order to update the packrat list of packages (which is described in the packrat/packrat.lock file), you should perform packrat::snapshot() after each package you install in order to avoid any surprises later on.

Finally, one issue I had with using packrat was how to install packages from Bioconductor. In my experience, the easiest way to do this is through setting the available repositories interactively by typing

into the console. This presents you with a text-based prompt:

Select all BioC repositories and then you can simply install all required Bioconductor packages using install.packages. Packrat will keep track of the version of Bioconductor currently being used.

Having done all this, when someone wants to use your code elsewhere, they need only clone your project and load it in to RStudio. RStudio will automatically restore all the packages that are missing into the project by downloading them from the relevant repositories.

Packrat is by no means perfect, for instance, packrat will endeavour to download binary packages on Windows as it lacks a toolchain for compiling any C/C++ code. Some packages in Bioconductor are only available as source and, as such, packrat is unable to find these packages.

I really appreciate the work done to make packrat work with R and it will, I’m sure become increasingly important in the future to make sure that R code that is written is more stable and predictable by keeping R packages consistent across all computers using a particular R project.