Please see the usage docs for general usage instructions on running the mag pipeline. This document contains information specific to the Nucleic Acid Observatory (NAO).
Running the pipeline on the MIT Engaging cluster
For simplicity, there is a slurm_submit
script that can be used to submit pipeline jobs on the Engaging cluster. It will automatically load the required modules and submit the pipeline to the cluster. After making any required modifications, the script can be used as follows:
sbatch slurm_submit
If you look into the SLURM submission script (see here), you will see that it is running the pipeline using something like the following command:
nextflow run main.nf -params-file params/illumina.json -profile engaging -resume
Here, the -params-file
parameter is used to specify the file containing all of the (non-default) input parameters. The -profile
parameter is used to specify the pipeline profile to use. In this case, the engaging
profile is used to specify that the pipeline should be run on the Engaging cluster which will run the pipeline using Singularity and the SLURM executor to submit jobs to the cluster. The -resume
parameter is used to resume a failed run from the point where it failed. This can be useful if you want to change the pipeline parameters and re-run the pipeline from the point where it failed.
Specifying input parameters and running the pipeline is described in the usage docs. In the example above, all input parameters can be specified in a parameters JSON file, however, there are many other ways to specify input parameters including on the command line (which take precedence). Similarly, the input samples can be specified in a samplesheet file or on the command line. I would generally recommend using input files for both the parameters and input files as well as storing them in this GitHub repo. This makes it easier to reproduce the results and to share the pipeline runs with others.
Therefore to run the pipeline, specify these two input files:
- Parameters (
-params-file
) - containing the input parameters for the pipeline (seeillumina.json
for an example) - Samplesheet (
--input
) - containing paths to the input FASTQ files for each sample (seeexp4.006_samplesheet.csv
for an example). Either local paths or remote URLs/S3 paths can be used. In the case of remote files, the files will be downloaded to the local work directory (using the defined AWS credentials if required) before being processed by the pipeline.
Sidenote: Input parameters with a single dash (-
) are Nextflow input parameters, whereas parameters with a double dash (--
) are pipeline input parameters.
The pipeline has not yet been tested on AWS, but it should be possible to run it in a similar way to that describe above using the docker
profile instead of engaging
. The pipeline can be run on AWS Batch using the awsbatch
profile. See the following documentation for an example of how this could be done using Nextflow Tower.
For some useful tips on debugging nf-core pipelines see the nf-core troubleshooting docs. One of the most useful tips is to go to the work
directory and look at the .command.{out,err,log,run,sh}
files for each failed process. These files contain the output and command to replicate any errors, they can be used to quickly/iteratively replicate and fix the error.
For debugging, it is recommended to use the -resume
parameter described above (personally, I use this parameter for all runs).
For all major runs, the input parameters and samplesheets can be found within the params
and data
directories respectively. These currently include the following runs:
Experiment | Description | Samplesheet | Parameters | AWS S3 Results |
---|---|---|---|---|
exp4.006 | Initial NAO generated llumina data | exp4.006_samplesheet.csv |
illumina.json |
s3://nao-illumina-private/exp4.006/mag_results |
Rothman HTP | Public wastewater dataset from Rothman et al. for unenriched samples from the HTP site | rothman_htp_samplesheet.csv |
rothman_htp.json |
s3://nao-phil-public/mag/results_rothman_htp |
Nextflow pipelines consist of two main file types:
.nf
files that contain the pipeline codemain.nf
is the main pipeline file that is executed when the pipeline is run, it can be thought of as a wrapper script/entry pointmag.nf
contains the main workflow and logic for the pipeline, including loading the processes, specifying the order in which they are executed, and channels, to load the input data and pass it between processesmodules/**.nf
contains the individual modules that are used in the pipeline
.config
files that contain the pipeline configurationnextflow.config
contains the default pipeline configuration for all parameters and profilesconf/**.config
contains additional pipeline configurationbase.config
contains the configuration for the base profile (enabled by default), that specifies the resources and error strategy for each processmodules.config
contains the pipeline configuration for the modules including extra arguments for the tools and specifies what output files that get copied to the results directory
To view the differences between the original nf-core/mag pipeline and the NAO version, you can compare the two GitHub repos
The main changes are also described in the CHANGELOG.
See the geNomad PR for an example of how to add a new module to the pipeline.
In summary, you need to:
- Create a new module file in the
modules
directory - Import the module in the
mag.nf
workflow and connect the input and output channels - Add the module configuration to the
modules.config
configuration file - Add any additional parameters to the
nextflow.config
file (and thenextflow_schema.json
file if they are input parameters) - Add tests for the module to the
ci.yml
file (as well as an additional test profile the the relevant files if necessary) - Update the documentation
- Update the CHANGELOG