Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bin QC Improvements #707

Open
wants to merge 31 commits into
base: dev
Choose a base branch
from
Open

Bin QC Improvements #707

wants to merge 31 commits into from

Conversation

dialvarezs
Copy link
Contributor

@dialvarezs dialvarezs commented Oct 27, 2024

This PR adds:

  • CheckM2 as an alternative for bin qc
  • Updates CheckM and GUNC modules
  • A new BIN_QC subworkflow, integrating CheckM, CheckM2, BUSCO and GUNC

Closes #607.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

- Update modules
- Update integration in mag and with other tools (bin_summary, gtdb-tk)
- Update test
- Update schema
@dialvarezs dialvarezs marked this pull request as draft October 27, 2024 21:31
@jfy133
Copy link
Member

jfy133 commented Oct 27, 2024

@nf-core-bot fix linting

@jfy133
Copy link
Member

jfy133 commented Oct 27, 2024

Before you continue (sorry this is a bit late):

I generally don't like to deprecate old version of tools for a while, but rather keep them as alternative tools.

In some cases people want to stick with the original version for compatibility with previous runs

Could you 'revert' (or reinstall) the old checkm module and wrap it in an if/else statement (but within the subworkflow :) )

@muabnezor did a similar thing when adding porechop_ABI hree: #674

@dialvarezs
Copy link
Contributor Author

@jfy133 that makes sense, I will revert the CheckM removal. Bad for me for not asking before 😅.

@dialvarezs dialvarezs changed the title feat: Replace from CheckM to CheckM2 feat: Add CheckM2 Oct 27, 2024
@dialvarezs dialvarezs marked this pull request as ready for review October 28, 2024 10:30
@dialvarezs
Copy link
Contributor Author

dialvarezs commented Oct 28, 2024

Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow?

@jfy133
Copy link
Member

jfy133 commented Oct 29, 2024

Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow?

Yes that would be perfect! We need to subworkflow the sh*t out of this monster 😅 thank you!!!

@dialvarezs dialvarezs force-pushed the dev-checkm2 branch 2 times, most recently from a4f42ef to da52285 Compare November 1, 2024 00:27
@prototaxites
Copy link
Contributor

I think this looks good, though I don't have the latitude to run any manual tests of it myself at the moment - so probably wait for someone else to sign off on it!

My only question is whether it might be good to move all the binqc database preparation steps (e.g. CHECKM2_DATABASEDOWNLOAD, and the bits that initialise the files/channels) inside the BINQC workflow where they are consumed? I think keeping conceptually-related code together will probably help with maintenance down the line, and the main mag.nf workflow is already over-full. Plus, there's less subworkflow inputs and outputs to keep track of. Might be a job for another PR though!

@dialvarezs
Copy link
Contributor Author

@prototaxites Absolutely, that makes sense to me. That would help to simplify the mag workflow a bit. What do you think about this @jfy133 ?

@jfy133
Copy link
Member

jfy133 commented Nov 12, 2024

I've pondered that a few times, however many of these download steps take a very long time, and thus from a user PoV I think it makes sense to have it triggered right at the beginning of the pipeline so by the time assembly and binning is done it's ready to go rather than getting all the way to binning and then having to wait the same length of time again before you can start the binning QC.

In my mind to clear up the code I would rather have a dB download subworkflow for all DB Downloadsto make the code clearer. Maybe from a related module PoV it's not as efficient but functionally they are related

Any counter arguments?

@dialvarezs
Copy link
Contributor Author

In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something?

@prototaxites
Copy link
Contributor

prototaxites commented Nov 12, 2024

In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something?

Yes, that's my understanding - Nextflow processes kick off as soon as any inputs are available, so processes beginning with file/URL input from parameters should start as soon as the pipeline begins, no matter how many subworkflows deep they are (there might be a latency hit if you go 10,000 subworkflows deep...).

Also not opposed to a "database download" subWF - but that seems a little more limited in scope. In particular I'm thinking about all the steps that are like "if(params.some_db) { db = Channel.value(file(path)) } else { db = Channel.empty }" - there are a lot of these input databases, often just a zip or fasta file, and building these in a single WF would make something with a lot of outputs. Hence why I think it may be more maintainable to keep that code near to where the outputs are consumed (set the PhiX fasta in the "short read preprocessing" subWF, lambda in the "long read preprocessing", etc.). But either way cleaning up the main WF should help - this all just may be beyond the scope of this PR.

@jfy133
Copy link
Member

jfy133 commented Nov 13, 2024

In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something?

Yes, that's my understanding - Nextflow processes kick off as soon as any inputs are available, so processes beginning with file/URL input from parameters should start as soon as the pipeline begins, no matter how many subworkflows deep they are (there might be a latency hit if you go 10,000 subworkflows deep...).

Hmm fair. I might be feeling over defensive due to the huge number of conditions mag has... so maybe this is indeed the case.

Also not opposed to a "database download" subWF - but that seems a little more limited in scope. In particular I'm thinking about all the steps that are like "if(params.some_db) { db = Channel.value(file(path)) } else { db = Channel.empty }" - there are a lot of these input databases, often just a zip or fasta file, and building these in a single WF would make something with a lot of outputs. Hence why I think it may be more maintainable to keep that code near to where the outputs are consumed (set the PhiX fasta in the "short read preprocessing" subWF, lambda in the "long read preprocessing", etc.). But either way cleaning up the main WF should help - this all just may be beyond the scope of this PR.

Fair point.

Ok - if you're feeling to up to it @dialvarezs go ahead and move the relevant database downloads to the BINQC subworklow(s) :)

Note you don't need a separate CheckM download CI check -> I would rather you just include an UNTAR step in the pipeline isntead.

I added the seaprte CheckM CI tests originally due to instable downloads when connecting to the servers in Australia, but as these databases are now on Zenodo this should rarely happen.

@dialvarezs
Copy link
Contributor Author

@jfy133 I'll get back to this on Friday or Saturday.

Ok - if you're feeling to up to it @dialvarezs go ahead and move the relevant database downloads to the BINQC subworklow(s) :)

I think this is a step in the right direction to improve the modularity of mag, so I will.

Note you don't need a separate CheckM download CI check -> I would rather you just include an UNTAR step in the pipeline isntead.

Got it, I will remove the checkm2 ci test I added. Should I remove the checkm ci test as well?

@jfy133
Copy link
Member

jfy133 commented Nov 14, 2024

@jfy133 I'll get back to this on Friday or Saturday.

Ok - if you're feeling to up to it @dialvarezs go ahead and move the relevant database downloads to the BINQC subworklow(s) :)

I think this is a step in the right direction to improve the modularity of mag, so I will.

Note you don't need a separate CheckM download CI check -> I would rather you just include an UNTAR step in the pipeline isntead.

Got it, I will remove the checkm2 ci test I added. Should I remove the checkm ci test as well?

Yes you probably can! Thank you!

Copy link
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me!

I just completed a comparison test between checkm/checkm2 and just need to compare the results.

Then I'll run with BUSCO/GUNC next week (have to stay home with the kiddo tomorow as well unfortunately) to make sure they aren't broken and then I think we are good to go.

Do you want to include the moving of the database downloading in this PR @dialvarezs or in a follow up one?

conf/modules.config Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
docs/output.md Outdated Show resolved Hide resolved
modules/local/bin_summary.nf Outdated Show resolved Hide resolved
nextflow.config Outdated Show resolved Hide resolved
nextflow_schema.json Outdated Show resolved Hide resolved
subworkflows/local/bin_qc.nf Outdated Show resolved Hide resolved

BUSCO(ch_input_bins_for_qc, ch_db_for_busco)

BUSCO_SUMMARY(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing versions file being mixed into ch_versions for this process, the SAVE_DOWNLOAD doesn't have one, although given that process is entirely empty, shouldn't that be replaced by a publishDir specification in modules.conf 🤔

(I know you didn't write this, but might be worth fixing this now :) )

Copy link
Contributor Author

@dialvarezs dialvarezs Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the comment on that module, the problem with simply uisng publishDir is that the same files would get written multiple times. I couldn't think of a more elegant solution, so I will add the version, unless you have a better idea.

subworkflows/local/bin_qc.nf Outdated Show resolved Hide resolved
subworkflows/local/bin_qc.nf Show resolved Hide resolved
@jfy133
Copy link
Member

jfy133 commented Nov 21, 2024

Note to self: Ah crap I didn't update the directory name, will have to run the two checkm tests again 🤦

  nextflow run ../main.nf -profile test,docker --outdir ./results_checkm --binqc_tool checkm
  nextflow run dialvarezs/mag -r dev-checkm2 -profile test,docker --outdir ./results_checkm --binqc_tool checkm2

@dialvarezs
Copy link
Contributor Author

Hi @jfy133, sorry for the delay, I didn't get enough time on last weekend for this. But I will address your comments shortly, and I plan to include the database downloads in this PR.

@jfy133
Copy link
Member

jfy133 commented Nov 22, 2024

No worries!

No stress if you can't make it ATM, but I think we are close to a release once this is in!

@dialvarezs
Copy link
Contributor Author

If you have time, can you review this PR? nf-core/modules#6920
I also plan to update the GUNC/CHEKM modules in this PR, and also the GUNC PR should solve the problem with gunc outputs.

@jfy133
Copy link
Member

jfy133 commented Nov 25, 2024

If you have time, can you review this PR? nf-core/modules#6920 I also plan to update the GUNC/CHEKM modules in this PR, and also the GUNC PR should solve the problem with gunc outputs.

Done!

@jfy133
Copy link
Member

jfy133 commented Nov 25, 2024

My kid is sick again 🙄 but if you can update GUNC in this PR before then, then I will be working on mag Thursday so can finish reviewing this PR again :) (but no stress!)

@dialvarezs
Copy link
Contributor Author

I made all the important updates. I will run a complete test with every QC tool in out HPC to check if everything is alright, and with that done it should be ready.

@dialvarezs
Copy link
Contributor Author

It should be ready now. There is a last minor issue that should be solved by this PR: nf-core/modules#7119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants