Skip to content

Analysis pipeline for capture-based viral metagenomics projects.

License

Notifications You must be signed in to change notification settings

CompGenLabUB/CAPTVRED

Repository files navigation

CAPTVRED

CAPTVRED PIPELINE is designed to analyze viral metagenomics datasets from Target Enrichment Sequencing (or Capture-based Metagenomics). This pipeline provides an analysis for viral identification through alignment, assembly, and taxonomic classification of the sequenced reads. The analyses focus on the set of species of interest, for which the dataset has been enriched, and other related sequences from the same taxonomic family. In the following lines we will refer to the set of genomic sequences of interest as Viral Candidates.

Getting started:

Before running the pipeline, the file system must be prepared as follows:

Prepare the files

A) Viral Candidates fasta:

A fasta file containing the sequences of viral candidates. It is assumed that capture probes were designed based on this set of genomic sequences, however, any set of sequences of interest will be appropriate to include. It must be a gzipped fasta. The sequence headers must contain the identifier code followed by a space, after the space any other information can be added if desired.

B) Samples description tabular file:

A template for this tabular file is provided (samples_definition_template.sh). It must be completed entering one sample per row. Additional metadata fields can be added if appropriate.

Samples description details This tabular file has two required fields:
  • Sample ID: This is the identifier that the pipeline will use to name all files and in the final report.

  • IlluminaID: This is the file name prefix containing raw data. Suffix of the samples indicating R1 and R2 files can be modified in the nextflow.config file or in the commandline using --R1 [default: R1_001] and --R2 [default: R2_001]. The rest of the fields proposed in the template are recommendations and will probably be used in future versions of the workflow.

C) Sequenced fastq files:

All sequenced fastq files must be placed (or linked) in the same directory, the IDs must correspond to the ones in the first column of the sample description files.

Get the pipeline

Pipeline can be downloaded via github clone repository:

git clone https://github.com/CompGenLabUB/CAPTVRED.git

or via nextflow pull command:

nextflow pull CompGenLabUB/CAPTVRED

Conda environment:

When running the pipeline, Nextflow will automatically create and activate a conda environment containing all programs and features required. If Conda software is not installed, follow the installation instructions provided in the documentation.

Important

Conda can be considerably slow in resolving the package dependencies. Make sure to install 22.11 or later version of conda and to activate the libmamba package as solver ( conda config --set solver libmamba) to resolve the environment notably faster. For more information see: https://www.anaconda.com/blog/a-faster-conda-for-a-growing-community

Details of the provided conda environment The main programs installed in conda environment are described here:
Program Version Channel
perl latest defaults
python 3.9.2 defaults
biopython latest conda-forge
bbmap latest bioconda
fastqc latest bioconda
multiqc latest bioconda
bowtie2 latest bioconda
samtools latest bioconda
seqkit latest bioconda
megahit latest bioconda
spades latest bioconda
blast latest bioconda
gawk latest conda-forge
kaiju 1.9.0 bioconda
r-base 4.0.5 r
r-ggplot2 latest r
r-tidyverse latest r
r-plyr latest r
r-gridExtra latest r
bioconductor-rtracklayer latest bioconda
bioconductor-GenomicFeatures latest bioconda
bioconductor-Rsamtools latest bioconda
bioconductor-GenomicAlignments latest bioconda
bioconductor-VariantAnnotation latest bioconda
bioconductor-ggbio latest bioconda

Create a project directory

The pipeline and all related files will be placed in this location.

mkdir -vp MYPROJECT
cd MYPROJECT

Set Up

nextflow -C init_nextflow.config                        \
         run  init_main.nf                              \
         --set_seqs /path/to/Viral_candidates_fasta.gz  \
         --setname "MY_VIRAL_CANDIDATES"

Note

Some considerations:

  • Please ensure to lauch this command from the project directory.
  • This step might take some time since it needs to download and process the reference database.
More details about the references setup

This module prepares the file setup and databases to run the pipeline afterward. In the case of databases, the main steps are:

  1. Database download: Viral reference database (RVDB) most recent version is downloaded. If the database is already downloaded in the desired location it will not be downloaded again. This step can be forced by using the flag: --db_update.
  2. Merge database with Viral Candidate sequences: In this step, sequences from the viral candidates set that are not present in the reference database are included. If the merge has been done previously it will not be repeated. The merge can be forced again by using the flag --merge_update.
  3. Split datbase: The full database is split into a subset containing only the sequences classified in the families of interest. Another subset containing the rest of the species is created at the same time. If the subset is already created it will not be created again. It can be forced by using the flag --dbsplit_update .

Note that by redoing any of the described flags, all the downstream steps will be repeated as well. Thus, by activating the flag --db_update, --merge_update is automatically activated; and by activating --merge_update, --dbsplit_update is activated as well. If the run is interrupted for any reason, remember that the -resume nextflow option will restart the pipeline from where it left off in the previous execution.

Run CAPTVRED:

cd CAPTVRED
nextflow main.nf
          --samp /path/to/samples_definition.tbl       \
          --fastq_dir /path/to/fastq/files/directory   \
          --runID "RUN_ID"

Optional parameters of the CAPTVRED pipeline:

--assembler upper case string. Available options are MEGAHIT(default) and METASPADES.
--NCPUS integer. Default value is 32.

All available options to modify pipeline parameters are described in the documentation.

Optional nextflow parameters of interest:

-resume Execute the script using the cached results, useful to continue executions that were stopped by an error.
-entry  Entry workflow name to be executed.

All allowed commands can be found in: Nextflow documentation (https://www.nextflow.io/docs/latest/index.html) > Command line inteface(CLI) > Commands > run

Run with SLURM

If you are interested in running the pipeline in a HPC cluster, a SLURM profile template (user_slurm.config) is available in the repository. The template can be modified according to your cluster characteristics. To run the pipeline using SLURM maneger -profile slurm and -params-file user_slurm.config can be added to both commands, setup and main pipeline.

Output:

CAPTVRED produces an HTML report summarizing key findings to facilitate the visualization and interpretation of the results. From this page, the user can access the quality and the computational performance reports. It also includes summary tables for metadata and sequence recovery. For each sample three tables are generated for viral assignations:

  • Read level:
READ_ID TAG READ_LENGTH BESTHIT_LEN BESTHIT_COVERAGE REFSEQ_ID KAIJU_SCORE NREADS_MAPED TAXONID SPECIES FAMILY
k127_80 B 2189 2189 100.000 NC_002023.1 NA 3178 11320 Influenza_A_virus Orthomyxoviridae
k127_67 B 5731 5731 100.000 AF208067.1 NA 10819 694005 Murine_coronavirus Coronaviridae
k127_95 B 10571 10571 100.000 NC_001474.2 NA 19942 12637 Dengue_virus Flaviviridae
k127_19 B 1608 1608 100.000 FJ390061.2 NA 2016 11320 Influenza_A_virus Orthomyxoviridae
k127_23 B 1983 1983 100.000 KX377335.1 NA 3323 64320 ZIKV Flaviviridae
k127_111 B 2081 2081 100.000 KJ633807.1 NA 2962 11320 Influenza_A_virus Orthomyxoviridae
k127_74 B 425 425 100.000 KJ633811.1 NA 275 11320
Fields description:     → READ_ID: Uniq identifier for each read/contig.
    → TAG: B for blastn, T for tblastx and K for kaiju.
    → READ_LENGTH: Read or contig length in bp.
    → BESTHIT_LEN: Length of the best hit.
    → BESTHIT_COVERAGE: Coverage of the best hit.
    → REFSEQ_ID: Assignation specie sequence id.
    → KAIJU_SCORE: Score reported by kaiju NA if blastn (default) option is running.
    → NREADS_MAPED: Number of raw reads mapped to this seqid.
    → TAXONID: Sequence taxon id.
    → SPECIES: Species name.
    → FAMILY: Species family taxonomic classification.
  • Sequence level:
SEQUENCE_ID TAXON_ID SPECIES FAMILY SEQ_LENGTH NUCS_ALN COVERAGE_PCT BHIT_IDENTITY BESTHSP_COUNT NREADS_MAPED
NC_001474.2 12637 Dengue_virus Flaviviridae 10723 10571 98.582 100.000 1 19942
KJ633811.1 11320 Influenza_A_virus Orthomyxoviridae 1027 850 82.765 100.000 2 550
NC_007357.1 11320 Influenza_A_virus Orthomyxoviridae 2341 2189 93.507 100.000 1 3178
DQ415901.1 290028 Human_CoV/HKU1 Coronaviridae 30097 29945 99.495 99.588 2 58330
OK017853.1 2833184 Sarbecovirus_sp. Coronaviridae 29369 29369 100.000 95.796 1 57998
AF255742.1 11320 Influenza_A_virus Orthomyxoviridae 1565 1411 90.160 99.929 1 1610
  • Species level:
TAXON_ID SPECIES FAMILY MIN_COV MAX_COV MEAN_COV MIN_PID MAX_PID MEAN_PID N_SEQS N_CONTIGS NREADS_MAPED INFO
1335626 Middle_East_respiratory_syndrome-related_coronavirus Coronaviridae 99.393 99.393 99.393 99.349 99.349 99.349 1 1 58734 SEQ:MK129253.1,LEN:30150,NUCALN:29967,COV:99.393,BHIDENT:99.349,N:1
694009 Severe_acute_respiratory_syndrome-related_coronavirus Coronaviridae 99.492 99.492 99.492 99.946 99.946 99.946 1 1 116308 SEQ:NC_045512.2,LEN:29903,NUCALN:29751,COV:99.492,BHIDENT:99.946,N:1
694000 Miniopterus_bat_coronavirus_1 Coronaviridae 99.463 99.463 99.463 100.000 100.000 100 1 1 55148 SEQ:NC_010437.1,LEN:28326,NUCALN:28174,COV:99.463,BHIDENT:100.000,N:1
37124 Chikungunya_virus Togaviridae 97.966 97.966 97.966 99.734 99.734 99.734 1 1 22148 SEQ:MG280943.1,LEN:11896,NUCALN:11654,COV:97.966,BHIDENT:99.734,N:1
694014 Avian_coronavirus Coronaviridae 99.352 99.352 99.352 99.993 99.993 99.993 1 1 53712 SEQ:AJ311317.1,LEN:27635,NUCALN:27456,COV:99.352,BHIDENT:99.993,N:1
2496529 Mengla_virus Filoviridae 99.169 99.169 99.169 100.000 100.000 100 1 2 35000 SEQ:NC_055510.1,LEN:18300,NUCALN:18148,COV:99.169,BHIDENT:100.000,N:2

Data:

The data for the test set is provided from CAPTVRED_testset.tar.gz file (direct link from table below). The folder contains 15 test samples (3 real metagenomics samples and 12 synthetic samples), and the script used for data generation.

The data used for the assessment of the PANDEVIR capture panel is available from PANDEVIR_assess_testset.tar.gz file (direct link from table below). The folder contains the raw reads for all the samples and a tabular file with the samples name relation.

FILE Size MD5SUM
CAPTVRED_testset.tar.gz 13G 1c8a0ca35740e4e705a85455fd8225b8
PANDEVIR_assess_testset.tar.gz 93G ce96f699e8d9fc08cf2169b3fad281c5

The linked files are quite big, thus we recommend to download them using wget command as shown in the following example, or any other download specialized tool (like, for instance, curl or FileZilla).

wget https://compgen.bio.ub.edu/datasets/CAPTVRED/CAPTVRED_testset.tar.gz
wget https://compgen.bio.ub.edu/datasets/CAPTVRED/>PANDEVIR_assess_testset.tar.gz

We also provide the corresponding md5sum checksums to ensure that those files were properly downloaded from the repository.