Skip to content

Latest commit

 

History

History
85 lines (75 loc) · 5.46 KB

CHANGELOG.md

File metadata and controls

85 lines (75 loc) · 5.46 KB

v2.5.2

  • Changes to default read filtering:
    • Relaxed FASTP quality filtering (--cut_mean_quality and --average_qual reduced from 25 to 20).
    • Relaxed BBDUK viral filtering (switched from 3 21-mers to 1 24-mer).
  • Overhauled BLAST validation functionality:
    • BLAST now runs on forward and reverse reads independently
    • BLAST output filtering no longer assumes specific filename suffixes
    • Paired BLAST output includes more information
    • RUN_VALIDATION can now directly take in FASTA files instead of a virus read DB
    • Fixed issues with publishing BLAST output under new Nextflow version
  • Implemented nf-test for end-to-end testing of pipeline functionality
    • Implemented test suite in tests/main.nf.test
    • Reconfigured INDEX workflow to enable generation of miniature index directories for testing
    • Added Github Actions workflow in .github/workflows/end-to-end.yml
    • Pull requests will now fail if any of INDEX, RUN, or RUN_VALIDATION crashes when run on test data.
    • Generated first version of new, curated test dataset for testing RUN workflow. Samplesheet and config file are available in test-data. The previous test dataset in test has been removed.
  • Implemented S3 auto-cleanup:
    • Added tags to published files to facilitate S3 auto-cleanup
    • Added S3 lifecycle configuration file to ref, along with a script in bin to add it to an S3 bucket
  • Minor changes
    • Added logic to check if grouping variable in nextflow.config matches the input samplesheet, if it doesn't, the code throws an error.
    • Externalized resource specifications to resources.config, removing hardcoded CPU/memory values
    • Renamed index-params.json to params-index.json to avoid clash with Github Actions
    • Removed redundant subsetting statement from TAXONOMY workflow.
    • Added --group_across_illumina_lanes option to generate_samplesheet

v2.5.1

  • Enabled extraction of BBDuk-subset putatively-host-viral raw reads for downstream chimera detection.
  • Added back viral read fields accidentally being discarded by COLLAPSE_VIRUS_READS.

v2.5.0

  • Reintroduced user-specified sample grouping and concatenation (e.g. across sequencing lanes) for deduplication in PROFILE and EXTRACT_VIRAL_READS.
  • Generalised pipeline to detect viruses infecting arbitrary host taxa (not just human-infecting viruses) as specified by ref/host-taxa.tsv and config parameters.
  • Configured index workflow to enable hard-exclusion of specific virus taxa (primarily phages) from being marked as infecting ost taxa of interest.
  • Updated pipeline output code to match changes made in latest Nextflow update (24.10.0).
  • Created a new script bin/analyze-pipeline.py to analyze pipeline structure and identify unused workflows and modules.
  • Cleaned up unused workflows and modules made obsolete in this and previous updates.
  • Moved module scripts from bin to module directories.
  • Modified trace filepath to be predictable across runs.
  • Removed addParams calls when importing dependencies (deprecated in latest Nextflow update).
  • Switched from nt to core_nt for BLAST validation.
  • Reconfigured QC subworkflow to run FASTQC and MultiQC on each pair of input files separately (fixes bug arising from allowing arbitrary filenames for forward and reverse read files).

v2.4.0

  • Created a new output directory where we put log files called logging.
  • Added the trace file from Nextflow to the logging directory which can be used for understanding cpu, memory usage, and other infromation like runtime. After running the pipeline, plot-timeline-script.R can be used to generate a useful summary plot of the runtime for each process in the pipeline.
  • Removed CONCAT_GZIPPED.
  • Replaced the sample input format with something more similar to nf-core, called samplesheet.csv. This new input file can be generated using the script generate_samplesheet.sh.
  • Now run deduplication on paired-ends reads using clumpify in the taxonomic workflow.
  • Fragment length analysis and deduplication analysis.
    • BBtools: Extract the fragment length as well as the number of duplicates from the taxonomic workflow and add them to the hv_hits_putative_collapsed.tsv.gz.
    • Bowtie2: Conduct a duplication analysis on the aligned reads, then add the number of duplicates and fragment length to the hv_hits_putative_collapsed.tsv.gz.

v2.3.3

  • Added validation workflow for post-hoc BLAST validation of putative HV reads.

v2.3.2

  • Fixed subsetReads to run on all reads when the number of reads per sample is below the set threshold.

v2.3.1

  • Clarifications to documentation (in README and elsewhere)
  • Re-added "joined" status marker to reads output by join_fastq.py

v2.3.0

  • Restructured run workflow to improve computational efficiency, especially on large datasets
    • Added preliminary BBDuk masking step to HV identification phase
    • Added read subsampling to profiling phase
    • Deleted ribodepletion and deduplication from preprocessing phase
    • Added riboseparation to profiling phase
    • Restructured profiling phase output
    • Added addcounts and passes flags to deduplication in HV identification phase
  • Parallelized key bottlenecks in index workflow
  • Added custom suffix specification for raw read files
  • Assorted bug fixes

v2.2.1

  • Added specific container versions to containers.config
  • Added version & time tracking to workflows
  • Added index reference files (params, version) to run output
  • Minor changes to default config files

v2.2.0

  • Major refactor
  • Start of changelog