Singularity container for an R module on a research cluster for VEuPathDB

Introduction

To use this technology, we need Singularity installed in the target environment by root, which is possibly the largest obstacle for us.

Writing a development container

Here is a container for analysing 16s rRNA data in MicrobiomeDB, requiring an R package DADA2. DADA2 releases to BioConductor, but we want the ability to load an arbitrary commit for development.

We can build on somebody's container with r-devtools, and add our libraries:

Bootstrap: docker
From: zamora/r-devtools

%post
R --version

R --slave -e 'install.packages("data.table",repos="https://cran.rstudio.com/")'
R --slave -e 'install.packages("optparse",repos="https://cran.rstudio.com/")'

R --slave -e 'library("devtools"); devtools::install_github("benjjneb/dada2", ref="v1.14")'

%test
R --slave -e 'packageVersion("dada2")'

This is how the container gets built and used:

sudo singularity build ./our-container.simg <the file above>
singularity exec ./our-container.simg R --slave -e 'packageVersion("dada2")'

singularity exec ./our-container.simg Rscript test.R

It works well enough for testing. The container is 574MB large so one can build it on a laptop, send it off to a cluster, and run it.

Problems

This isn't quite enterprise-ready:

what version of R is it? This actually depends on when zamora built their r-devtools container, and happens to be 3.6.0 at the time of writing.
where do I keep this file?
where do I keep the container? Ideally it should be made once and then available to everyone who wants to use it.

Integrating with SingularityHub

SingularityHub is a public resource that can build containers for us if we add the singularity files to GitHub.

Naming

SingularityHub has a convention: https://singularityhub.github.io/singularityhub-docs/docs/getting-started/naming Our group also has a convention - we keep all code that runs in distributed environments on https://github.com/VEuPathDB/DJob.

This suggests the above file should go somewhere like https://github.com/VEuPathDB/DJob/tree/master/DistribJobTasks/lib/containers . The name of the file needs to start with Singularity. It is going to be how our pipelines will refer to the container, so it should probably include:

name of the VEuPathDB project, if it's for a single project
name of analysis the container is used for
something about what the container will be used for

Perhaps: Singularity.MicrobiomeDB-16srRNA-R.

We can integrate SingularityHub with our repositories, so that pushing the file to GitHub will build a container for us. Then we could use our containers like we can already use public containers - compare with the Rocker project's base container:

singularity pull shub://r-base:3.6.2
singularity exec ./r-base-3.6.2.simg R # A quick R session

singularity pull shub://VEuPathDB:MicrobiomeDB-16srRNA-R
singularity exec ./VEuPathDB-MicrobiomeDB-16srRNA-R R # an R session including our libraries

Organising images on the cluster

Currently, we can use the cluster in our workflows elsewhere - it knows how to connect to the cluster, copy files to/from there, and orchestrate jobs. It requires an environment that needs to be prepared as follows:

make a user account, install SSH keys, etc.
make sure the PATH of our user includes /project/eupathdblab/workflow-software/bin and source code location like GUS_HOME
install third party software
copy source code to the cluster

/project/eupathdblab/workflow-software is managed by the whole project. Our code uses the tools there by assuming they are on PATH.

If we want our code to call a container through Singularity, we need to satisfy some assumptions:

Singularity is installed and on PATH
The right container is somewhere on the cluster, and its location can be known

Using a registry

There is a program called sregistry, a registry for Singularity containers: https://singularityhub.github.io/sregistry-cli/

We can install it in /project/eupathdblab/workflow-software and make sure sregistry is on PATH. This would possibly be the last program we need to install. :)

# Get a container corresponding to a Singularity file we added to GitHub, and add to the registry
sregistry pull shub://VEuPathDB:MicrobiomeDB-16srRNA-R
# Add a local image to the registry
sregistry add --name VEuPathDB-MicrobiomeDB-16srRNA-R dev-container.simg 

# From the pipeline code
singularity exec $(sregistry get VEuPathDB-MicrobiomeDB-16srRNA-R) Rscript dada2-filterAndTrimFastqs.R $workflowDir/$projectName/input/fastqs

All the moving pieces summarised

we write container files, and publish them to GitHub
we configure SingularityHub, who build the containers for us as a service
we keep a registry on the cluster, which keeps track of which container names correspond to which image files
when we want the code to run a new or different container image, we interact with the registry through pull/add
the code we write refers to container names

Possible future applications and directions

Containers for development

Making a container for a project has some immediate benefits even if the project can't be deployed as such - staying on top of what needs to get installed and where, etc.

All the third party tools called through containers

If we containerise everything, our commitment to any particular cluster environment can go away completely - switching to a different cluster or cloud provider would be as simple as installing singularity, sregistry, and fetching the containers needed.

Containers combined with other forms of dependency management

Perl modules on CPAN and elsewhere frequently have a cpanfile listing modules that need to be installed for the project to work. If we were to containerise our workflows completely, listing Perl modules that need to be installed for each one would be a prerequisite.

Containers integrated into snakemake rules

A rule can call any script, so it can also call singularity exec $(sregistry get VEuPathDB-MicrobiomeDB-16srRNA-R) Rscript buildErrorModelsFromFastqsFolderUsingDADA2.R $workflowDir/$projectName/input/fastqs or something similar. There's a bit of support built in, to simplify the syntax: https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#running-jobs-in-containers

I'm not sure how this can work with sregistry, or how snakemake runs the containers. We do want to stay in charge of which container images will be used, and we want to minimise the number of container pulls: pulls are slow, and SingularityHub has a limit - so if snakemake pulls the containers before running the rule, this wouldn't work for us.

wbazant/R-package-with-singularity.md