One question that came up in a recent team meeting about the refactored v2 pipeline is how higher-level taxa are handled during human-virus identification.
+As a reminder, the process for HV identification currently looks like this:
+-
+
- Preprocessed read pairs are mapped to a database of HV genomes with Bowtie2, retaining any read that meets a fairly low alignment-score threshold. +
- Surviving read pairs are mapped to human, cow, and assorted other contaminant sequences with Bowtie2 and BBMap, discarding any read that maps to any contaminant. +
- Surviving read pairs are merged into single sequences with BBMerge, then passed to Kraken2, which performs taxonomic assignment using the Standard database. +
- Based on the Kraken2 assignments, reads are classified into those that are (1) assigned to a human-infecting virus taxon with Kraken, (2) assigned to a non-HV taxon with Kraken, or (3) not assigned to any taxon. All reads in category (2) are filtered out. +
- Surviving reads are assigned HV status if they (a) are given HV assignments by both Bowtie2 and Kraken2, or (b) are unassigned by Kraken and align to an HV taxon with Bowtie2 with an alignment score above a user-specified threshold (typically 20). +
This all works well for reads that are assigned to a taxon that is entirely composed of human-infecting viruses. The question is, what about reads that Kraken assigns to taxa that are only partially human-infecting? For example, the virus family Coronaviridae (taxid 11118) contains several coronaviruses that infect humans (e.g. SARS-CoV-2) and many that do not (e.g. assorted bat coronaviruses). What would happen to a read that was assigned by Kraken2 to this taxid?
+The key process here is PROCESS_KRAKEN_HV
(modules/local/processKrakenHV/main.nf
), which calls bin/process_kraken_hv.py
on the output of Kraken2. This script:
-
+
- Imports a TSV of human-infecting virus taxa (generated by
FINALIZE_HUMAN_VIRUS_DB
in the index workflow) as well as the NCBI taxonomy tree structure (generated byEXTRACT_NCBI_TAXONOMY
in the index workflow). Importantly, this initial HV TSV does not include higher taxa.
+ - Iterates line-by-line over the readwise Kraken2 output as follows:
+
-
+
- Extract the name and taxid assignment for the read. +
- Check if the taxid is in the list of HV taxids; if so, return
True
.
+ - If not, get the taxid’s parent taxid from the tree structure and go to (b). +
- Repeat (b) and (c) until an HV taxid is found or the taxid being screened is 0 (unassigned), 1 (root) or 2 (Bacteria); in the latter case, return
False
.
+
+ - Save the read’s HV status, along with other information, to an output file. +
As such, the script checks whether the assigned taxid or any of its ancestors are HV taxa, but not whether any of its descendents are. Reads that are assigned to higher-level taxa will thus be treated as though they were assigned to a non-HV taxa, and filtered out during HV read identification.
+This seems suboptimal. That said, it’s not obvious what the correct behavior is here; treating reads assigned to partially-HV taxa as HV comes with its own problems1. Someone should think more about what the right approach is here.
+ + + + + + +Footnotes
+ +-
+
In particular, since the Bowtie2 database used for initial screening only contains HV genomes, this approach would likely lead to closely-related non-human-infecting virus reads being classified as HV.↩︎
+