Discussed elsewhere, I organise my bioinformatics projects like this:
./jobs/
- <jobname>/
- conf/
- data/
- doc/
- notebook.Rmd
- some_other_script.Rmd
- .here
- lib/
- results/
- requirements.txt
- scripts/
- subjobs/
- <nested_structure>
- Snakefile
Here, the top level snakemake script controls the running of all scripts and the compiling of all documents. My labbooks are stored as R-markdown documents and get compiled / knitted to .pdf
s by the packages rmarkdown
and knitr
.
When working on the project <jobname>
, my working directory is ./jobs/<jobname>
and, in the simple case when a given project has no subjobs, this working directory shouldn’t be changed by any of the scripts or documents. knitr
’s a bit of a bugger though. Unless you specifically tell it otherwise (eg, using the root.dir
option of opts_knit
), knitr
will set the working directory to match the directory where your script is stored, rather than where you called it from (see below).
To compile a single *.Rmd
notebook, I have the following snakemake
recipe. It converts doc/some_file.Rmd
into doc/some_file.pdf
or doc/some_file.docx
depending on the required output filename1.
rule compile_markdown:
message:
"""
Compile R-markdown to .pdf or .docx in job `{params.job}`
- Use `*.Rmd` as first input argument and put any additional
dependencies into `rmarkdown_prereqs`
- The output `.pdf` or `.docx` is the first output argument, and any
additionally-constructed files should be implicitly named in
`rmarkdown_result_regexes`. Each entry in the latter should have a
{{doc_name}} and a {{ext}} wildcard in their definition.
"""
input:
["doc/{doc_name}.Rmd"] + rmarkdown_prereqs
output:
["doc/{doc_name}.{ext,pdf|docx}"] + rmarkdown_result_regexes
params:
job = os.path.basename(os.getcwd())
run:
R("""
library(rmarkdown)
library(tools)
doctype <- list(pdf = "pdf_document",
docx = "word_document"
)[["{wildcards.ext}"]]
rmd.script <- "{input[0]}"
render(rmd.script,
output_format = doctype,
output_file = "{output[0]}",
output_dir = "doc",
quiet = TRUE)
""")
What does knitr
do to the working directory?
Assume I’m sitting in the working directory “~/HERE” on the command line.
Let’s write a simple R markdown script (doc/temp.Rmd
) that just prints out the current working directory:
---
title: temp.Rmd
author: Russ Hyde
---
```{r}
# Print the current working directory
print(getwd())
```
… and then render that into a .pdf
:
~/HERE> Rscript -e "library(rmarkdown); render('doc/temp.Rmd')"
This prints out the working directory as ~/HERE/doc
(where the script is stored) rather than the directory ~/HERE
, where I called Rscript from.
Note that if I put a similar .R
script in ./doc
, that prints out the current working directory, this doesn’t happen. Running the following using Rscript
prints out the user’s working directory (ie, ~/HERE
).
# temp.R
print(getwd())
# end of temp.R
What’s wrong with monkeying with the working directory?
There’s a few reasons that I don’t like the working directory being changed within a script. When you want to read some data from a file into a script that you’re running, you can either
- write or compute the filepath for your data within the script; or
- you can provide the script with the filepath as an argument (or in a config file).
Let’s suppose you had two different scripts that access the same file, but where one of those scripts changes the working directory. Then:
In case i: you’d have to write the filepath differently for the two scripts. Here, if I wanted to access ./data/my_counts.tsv
from within ./doc/temp.R
and ./doc/temp.Rmd
, I’d have to write the filepath as ./data/my_counts.tsv
within the former, and ../data/my_counts.tsv
within the latter.
In case ii: you’d have to similarly mangle the filepaths. A script that changes the working directory should be provided filepaths relative to the working directory chosen by that script, so when you’re passing in file-paths you have to think as if you’re in that directory used by that script at the time the file is accessed (NO!!!!); or use absolute paths (NO!!!!!!).
I know it seems trivial, and described as above it only seems like a mild inefficiency to have to write different scripts in slightly different ways. But if you get into the habit of using scripts that modify the working directory you limit the portability / reproducibility of your work.
My old approach was to set root.dir
in opts_knit
during knitting the document. It was pretty scruffy but it worked and is described in an old version of this entry at my old blog.
here::here()
There is another path however.
Suppose you’re writing an Rmarkdown script myproject/doc/notebook.Rmd
, and you want to read in some file myproject/data/important-datas.tsv
.
By the default knitr settings you’d read this file in relative to the Rmarkdown’s position:
```{r}
# no!
my_data <- read_tsv("../data/important-datas.tsv")
```
That works, but suppose you move the Rmarkdown into doc/initial_scripts/
. Any filepaths that were written relative to doc/
are no longer valid - and your script will no longer knit.
The alternative path is as follows:
Install the CRAN package
here
- this allows you to write file-paths relative to a directory of your choosing;Stick an empty file called
.here
in the main directory of your project (this anchors your relative-paths);At the start of all of your Rmarkdown files load the
here
package usinglibrary(here)
Then specify all your file-paths relative to the project’s working directory using the
here
function, instead of relative to your Rmarkdown’s position.
```{r}
library(here)
my_data <- read_tsv(here("data/important-datas.tsv"))
```