Working Directories and RMarkdown

Discussed elsewhere, I organise my bioinformatics projects like this:

    - <jobname>/
        - conf/
        - data/
        - doc/
            - notebook.Rmd
            - some_other_script.Rmd
        - .here
        - lib/
        - results/
        - requirements.txt
        - scripts/
        - subjobs/
            - <nested_structure>
        - Snakefile

Here, the top level snakemake script controls the running of all scripts and the compiling of all documents. My labbooks are stored as R-markdown documents and get compiled / knitted to .pdfs by the packages rmarkdown and knitr.

When working on the project <jobname>, my working directory is ./jobs/<jobname> and, in the simple case when a given project has no subjobs, this working directory shouldn’t be changed by any of the scripts or documents. knitr’s a bit of a bugger though. Unless you specifically tell it otherwise (eg, using the root.dir option of opts_knit), knitr will set the working directory to match the directory where your script is stored, rather than where you called it from (see below).

To compile a single *.Rmd notebook, I have the following snakemake recipe. It converts doc/some_file.Rmd into doc/some_file.pdf or doc/some_file.docx depending on the required output filename1.

rule compile_markdown:
        Compile R-markdown to .pdf or .docx in job `{params.job}`

        - Use `*.Rmd` as first input argument and put any additional
        dependencies into `rmarkdown_prereqs`
        - The output `.pdf` or `.docx` is the first output argument, and any
          additionally-constructed files should be implicitly named in
          `rmarkdown_result_regexes`. Each entry in the latter should have a
          {{doc_name}} and a {{ext}} wildcard in their definition.

        ["doc/{doc_name}.Rmd"] + rmarkdown_prereqs

        ["doc/{doc_name}.{ext,pdf|docx}"] + rmarkdown_result_regexes

        job = os.path.basename(os.getcwd())

            doctype <- list(pdf  = "pdf_document",
                            docx = "word_document"
            rmd.script <- "{input[0]}"
                   output_format = doctype,
                   output_file   = "{output[0]}",
                   output_dir    = "doc",
                   quiet = TRUE)

What does knitr do to the working directory?

Assume I’m sitting in the working directory “~/HERE” on the command line.

Let’s write a simple R markdown script (doc/temp.Rmd) that just prints out the current working directory:

title: temp.Rmd
author: Russ Hyde

# Print the current working directory

… and then render that into a .pdf:

~/HERE> Rscript -e "library(rmarkdown); render('doc/temp.Rmd')"

This prints out the working directory as ~/HERE/doc (where the script is stored) rather than the directory ~/HERE, where I called Rscript from.

Note that if I put a similar .R script in ./doc, that prints out the current working directory, this doesn’t happen. Running the following using Rscript prints out the user’s working directory (ie, ~/HERE).

# temp.R
# end of temp.R

What’s wrong with monkeying with the working directory?

There’s a few reasons that I don’t like the working directory being changed within a script. When you want to read some data from a file into a script that you’re running, you can either

    1. write or compute the filepath for your data within the script; or
    1. you can provide the script with the filepath as an argument (or in a config file).

Let’s suppose you had two different scripts that access the same file, but where one of those scripts changes the working directory. Then:

In case i: you’d have to write the filepath differently for the two scripts. Here, if I wanted to access ./data/my_counts.tsv from within ./doc/temp.R and ./doc/temp.Rmd, I’d have to write the filepath as ./data/my_counts.tsv within the former, and ../data/my_counts.tsv within the latter.

In case ii: you’d have to similarly mangle the filepaths. A script that changes the working directory should be provided filepaths relative to the working directory chosen by that script, so when you’re passing in file-paths you have to think as if you’re in that directory used by that script at the time the file is accessed (NO!!!!); or use absolute paths (NO!!!!!!).

I know it seems trivial, and described as above it only seems like a mild inefficiency to have to write different scripts in slightly different ways. But if you get into the habit of using scripts that modify the working directory you limit the portability / reproducibility of your work.

My old approach was to set root.dir in opts_knit during knitting the document. It was pretty scruffy but it worked and is described in an old version of this entry at my old blog.


There is another path however.

Suppose you’re writing an Rmarkdown script myproject/doc/notebook.Rmd, and you want to read in some file myproject/data/important-datas.tsv.

By the default knitr settings you’d read this file in relative to the Rmarkdown’s position:

# no!
my_data <- read_tsv("../data/important-datas.tsv")

That works, but suppose you move the Rmarkdown into doc/initial_scripts/. Any filepaths that were written relative to doc/ are no longer valid - and your script will no longer knit.

The alternative path is as follows:

  • Install the CRAN package here - this allows you to write file-paths relative to a directory of your choosing;

  • Stick an empty file called .here in the main directory of your project (this anchors your relative-paths);

  • At the start of all of your Rmarkdown files load the here package using library(here)

  • Then specify all your file-paths relative to the project’s working directory using the here function, instead of relative to your Rmarkdown’s position.

my_data <- read_tsv(here("data/important-datas.tsv"))

  1. Using R within snakemake is frowned upon now, it’s better to use external scripts, so I’ll probably rewrite this rule soon. See here.