Adding single-read functionality to RAW and CLEAN #80

simonleandergrimm · 2024-10-28T16:45:36Z

This PR adds support for single-read (single-end) sequencing data to the RAW and CLEAN stages of the pipeline while maintaining existing paired-end functionality. This allows the pipeline to process both single-end and paired-end sequencing data using the same workflow infrastructure.

Key Changes

Amended generate_samplesheet.sh so it can also take in single-read data.
Added the run_dev_se.nf workflow, which will be the workflow in which single-read functionality is added up until all steps of the pipeline have single-read functionality. At that point, we can replace run.nf with run_dev_se.nf.
Added read_type parameter ("single_end" or "paired_end") in run_dev_se.config to control pipeline behavior
Split FASTP process into FASTP_SINGLE and FASTP_PAIRED variants
Split TRUNCATE_CONCAT process into _SINGLE and _PAIRED variants
Edited RAW, CLEAN, QC, and HV_SCREEN subworkflows to either take in the single_end or paired_end version of processes.
- Edited HV_SCREEN only because it would otherwise fail due to not identifying the fastp process correctly.
Updated bin/summarize-multiqc-single.R so it takes in a read-type variable, which triggers if/else branches throughout the script to change data processing accordingly.

Testing

I added test directories with example data for both single and paired-end cases

test-single-read/ - Contains single-end test data and configuration
test-paired-end/ - Contains paired-end test data and configuration

I validated the pipeline changes in this notebook: https://data.securebio.org/simons-notebook/posts/2024-10-24-mgs-single-read-eval/

… for single read analyses.

…-multiqc-paired.R

…on. Renamed Multiqc to not be confusing regardings its naming as "Single"

…d version

… or paired-end read version of fastp

…read or paired-end read version of fastp

… paired-end read version of SUMMARIZE_MULTIQC

…for paired-end vs single read runs.

…n outputs

…d paired end read runs

…on/*/raw

…book/notebooks/2024-10-17_crits-christoph-2-4-0.html to create analyses of single and paired-end read data.

…samples being paired-end or not.

… pair information for single read data. Also dropped some code which combines values across read pairs, for single read data. I dropped the renaming of tab_tsv to tab_tsv_2 for paired end data, so I didn't have to create two different versions of the combine step at the end of the subscript. ``` tab <- tab_json %>% inner_join(tab_tsv, by="sample") ```

…ead data, as I instead amended the existing script to be able to handle both single read paired end data.

…ad version. Renamed Multiqc to not be confusing regardings its naming as "Single"" This reverts commit 01ea0c5.

…rizeMultiqc" This reverts commit ad8faf9.

harmonbhasin

Given that we'll have new tests, the test directory shouldn't matter too much, but I would restructure it for now, or just wait until we push the new tests.

simonleandergrimm · 2024-11-24T02:19:11Z

Got back to comments. @harmonbhasin please take another look. FWIW, after positive review by @willbradshaw , I'd be keen to pull this into dev. Having this feature branch exist alongside dev for prolonged periods can create additional work as I need to update the PR when dev changes.

The testing directory for the single-end work needs to be kept separate for now. One can't run run.nf on single-end data right now as the processing steps after RAW/CLEAN and PROFILE aren't ready to handle single-end data and will fail. So, as a temporary measure, until full single-end functionality is implemented, I've created separate run_dev_se.config and run_dev_se.nf files that handle the pipeline from RAW to PROFILE only. Since we have this separate run_dev_se, we also need a separate testing directory with a config file that executes the workflow. @harmonbhasin happy to discuss how this fits with your incoming testing regime.

All of this doesn't affect dev runs using run.nf, as explained in this notebook, which I updated to show results from the newest version of the branch: https://data.securebio.org/simons-notebook/posts/2024-10-24-mgs-single-read-eval/

harmonbhasin · 2024-11-25T19:02:51Z

bin/generate_samplesheet.sh

I'm not sure if you're interested in this, but if you want to turn this script into python, I wouldn't be mad lol

configs/run.config

test-dev-se/nextflow.config

test-dev-se/paired-end-samplesheet.csv

test/nextflow.config

harmonbhasin

This looks good to me, I'm currently testing out some last things with the test workflow, but if things workout well, you should be able to integrate the testing framework tomorrow.

simonleandergrimm · 2024-11-29T02:42:49Z

@willbradshaw Pending your review this is ready to go in.

simonleandergrimm added 30 commits October 21, 2024 19:19

Adding single read option to raw/main.nf

15354f6

Adding WIP version of run.nf to enable testing raw and clean versions…

ad2115d

… for single read analyses.

Created separate versions of summarize-multiqc-single.R and summarize…

03ee37a

…-multiqc-paired.R

Split processes in fastp to a single read and paired-end read version.

b517340

Split processes in MultiQC to a single read and paired-end read versi…

01ea0c5

…on. Renamed Multiqc to not be confusing regardings its naming as "Single"

Deleted summarizeMultiqcSingle, which was superseded by summarizeMultiqc

ad8faf9

Split processes in truncateConcat to a single read and paired-end rea…

ef0e9c8

…d version

Created a single_end if clause in Clean to either use the single read…

2535ccd

… or paired-end read version of fastp

Created a single_end if clause in hv_screen to either use the single …

cbcb109

…read or paired-end read version of fastp

Created a single_end if clause in qc to either use the single read or…

c7f8c83

… paired-end read version of SUMMARIZE_MULTIQC

Renamed test dir to test-paired-end. Added clause in nextflow.config …

ff0a8be

…for paired-end vs single read runs.

Edited gitignore to leave out test-paired-end and test-single-read ru…

6048dd3

…n outputs

Fixed name of test-single-end dir to test-single-read

92270e5

Created a version of test dir that allows the run of single-read data.

b13ac94

Added script to quickly download the s3 output of test single read an…

dff2302

…d paired end read runs

Added nextflow config for test paired and test single read.

64bb7f4

Fixed if clause in main.nf

5bd1aec

Updated gen samplesheet scripts to pull in data from s3://nao-mgs-sim…

c8fd3ac

…on/*/raw

Updated gitignore

578fde0

Activated CLEAN subworkflow in run.nf

59218b9

Starting to adapt Will's https://data.securebio.org/wills-public-note…

fd9dc1e

…book/notebooks/2024-10-17_crits-christoph-2-4-0.html to create analyses of single and paired-end read data.

Adding ignoring mgs-results to gitignore

81ff0ba

Adding Will's auxiliary scripts to run his quarto notebooks.

590b2c3

Merge branch 'master' into single-read-raw

6a650b4

Amended qmd somewhat so data imports work.

9f1eb03

Added a flag to summarize-multiqc-single.R that provides info on the …

9622004

…samples being paired-end or not.

Deleting seperate version of summarize-multiqc I created for paired r…

f8d9c28

…ead data, as I instead amended the existing script to be able to handle both single read paired end data.

Revert "Split processes in MultiQC to a single read and paired-end re…

8e1c7b5

…ad version. Renamed Multiqc to not be confusing regardings its naming as "Single"" This reverts commit 01ea0c5.

Revert "Deleted summarizeMultiqcSingle, which was superseded by summa…

0ba0552

…rizeMultiqc" This reverts commit ad8faf9.

harmonbhasin requested changes Nov 21, 2024

View reviewed changes

simonleandergrimm added 6 commits November 23, 2024 19:14

Adding improved configs

8e201e7

dropped single end definition in run file.

591138d

Adding params to single end variable invocation

27244bd

removed whitespace

517961f

updating nextflow.config of test

c28749f

fixed single_end config in normal run workflow

e132ec4

simonleandergrimm mentioned this pull request Nov 23, 2024

Once new testing regime is on dev, update testing setup in single-read branches #107

Open

simonleandergrimm added 6 commits November 23, 2024 22:49

make single-end variable logical.

51b9cf3

Reverted to old gitignore structure.

12c3fdd

Changed test dirs to only have one dir for run_dev_se.

4fd3ce6

Adding WIP progress

d460813

Merge branch 'dev' into single-read-raw-clean

f412b07

Fixing single_end being unbound.

3d10bb0

simonleandergrimm requested a review from harmonbhasin November 24, 2024 02:19

simonleandergrimm removed their assignment Nov 24, 2024