Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding single-read functionality to RAW and CLEAN #80

Open
wants to merge 77 commits into
base: dev
Choose a base branch
from

Conversation

simonleandergrimm
Copy link
Collaborator

@simonleandergrimm simonleandergrimm commented Oct 28, 2024

This PR adds support for single-read (single-end) sequencing data to the RAW and CLEAN stages of the pipeline while maintaining existing paired-end functionality. This allows the pipeline to process both single-end and paired-end sequencing data using the same workflow infrastructure.

Key Changes

  • Amended generate_samplesheet.sh so it can also take in single-read data.
  • Added the run_dev_se.nf workflow, which will be the workflow in which single-read functionality is added up until all steps of the pipeline have single-read functionality. At that point, we can replace run.nf with run_dev_se.nf.
  • Added read_type parameter ("single_end" or "paired_end") in run_dev_se.config to control pipeline behavior
  • Split FASTP process into FASTP_SINGLE and FASTP_PAIRED variants
  • Split TRUNCATE_CONCAT process into _SINGLE and _PAIRED variants
  • Edited RAW, CLEAN, QC, and HV_SCREEN subworkflows to either take in the single_end or paired_end version of processes.
    • Edited HV_SCREEN only because it would otherwise fail due to not identifying the fastp process correctly.
  • Updated bin/summarize-multiqc-single.R so it takes in a read-type variable, which triggers if/else branches throughout the script to change data processing accordingly.

Testing

I added test directories with example data for both single and paired-end cases

  • test-single-read/ - Contains single-end test data and configuration
  • test-paired-end/ - Contains paired-end test data and configuration

I validated the pipeline changes in this notebook: https://data.securebio.org/simons-notebook/posts/2024-10-24-mgs-single-read-eval/

…on. Renamed Multiqc to not be confusing regardings its naming as "Single"
… paired-end read version of SUMMARIZE_MULTIQC
… pair information for single read data. Also dropped some code which combines values across read pairs, for single read data.

I dropped the renaming of tab_tsv to tab_tsv_2 for paired end data, so I didn't have to create two different versions of the combine step at the end of the subscript.
```
 tab <- tab_json %>% inner_join(tab_tsv, by="sample")
```
…ead data, as I instead amended the existing script to be able to handle both single read paired end data.
…ad version. Renamed Multiqc to not be confusing regardings its naming as "Single""

This reverts commit 01ea0c5.
Copy link
Collaborator

@harmonbhasin harmonbhasin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we'll have new tests, the test directory shouldn't matter too much, but I would restructure it for now, or just wait until we push the new tests.

@simonleandergrimm
Copy link
Collaborator Author

simonleandergrimm commented Nov 24, 2024

Got back to comments. @harmonbhasin please take another look. FWIW, after positive review by @willbradshaw , I'd be keen to pull this into dev. Having this feature branch exist alongside dev for prolonged periods can create additional work as I need to update the PR when dev changes.

The testing directory for the single-end work needs to be kept separate for now. One can't run run.nf on single-end data right now as the processing steps after RAW/CLEAN and PROFILE aren't ready to handle single-end data and will fail. So, as a temporary measure, until full single-end functionality is implemented, I've created separate run_dev_se.config and run_dev_se.nf files that handle the pipeline from RAW to PROFILE only. Since we have this separate run_dev_se, we also need a separate testing directory with a config file that executes the workflow. @harmonbhasin happy to discuss how this fits with your incoming testing regime.

All of this doesn't affect dev runs using run.nf, as explained in this notebook, which I updated to show results from the newest version of the branch: https://data.securebio.org/simons-notebook/posts/2024-10-24-mgs-single-read-eval/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if you're interested in this, but if you want to turn this script into python, I wouldn't be mad lol

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

configs/run.config Outdated Show resolved Hide resolved
test/nextflow.config Outdated Show resolved Hide resolved
Copy link
Collaborator

@harmonbhasin harmonbhasin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I'm currently testing out some last things with the test workflow, but if things workout well, you should be able to integrate the testing framework tomorrow.

@simonleandergrimm
Copy link
Collaborator Author

@willbradshaw Pending your review this is ready to go in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants