-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bin QC Improvements #707
base: dev
Are you sure you want to change the base?
Bin QC Improvements #707
Conversation
- Update modules - Update integration in mag and with other tools (bin_summary, gtdb-tk) - Update test - Update schema
@nf-core-bot fix linting |
Before you continue (sorry this is a bit late): I generally don't like to deprecate old version of tools for a while, but rather keep them as alternative tools. In some cases people want to stick with the original version for compatibility with previous runs Could you 'revert' (or reinstall) the old checkm module and wrap it in an if/else statement (but within the subworkflow :) ) @muabnezor did a similar thing when adding porechop_ABI hree: #674 |
@jfy133 that makes sense, I will revert the CheckM removal. Bad for me for not asking before 😅. |
c63e084
to
b1b6518
Compare
Sorry, I didn't catch your last comment about including both tools in a single workflow. With that in mind, would make sense to include BUSCO as well, and just make a "bin_qc" subworkflow? |
Also, simplify bin_summary regarding bin qc
Yes that would be perfect! We need to subworkflow the sh*t out of this monster 😅 thank you!!! |
a4f42ef
to
da52285
Compare
4007932
to
0eb167a
Compare
I think this looks good, though I don't have the latitude to run any manual tests of it myself at the moment - so probably wait for someone else to sign off on it! My only question is whether it might be good to move all the binqc database preparation steps (e.g. CHECKM2_DATABASEDOWNLOAD, and the bits that initialise the files/channels) inside the BINQC workflow where they are consumed? I think keeping conceptually-related code together will probably help with maintenance down the line, and the main mag.nf workflow is already over-full. Plus, there's less subworkflow inputs and outputs to keep track of. Might be a job for another PR though! |
@prototaxites Absolutely, that makes sense to me. That would help to simplify the mag workflow a bit. What do you think about this @jfy133 ? |
I've pondered that a few times, however many of these download steps take a very long time, and thus from a user PoV I think it makes sense to have it triggered right at the beginning of the pipeline so by the time assembly and binning is done it's ready to go rather than getting all the way to binning and then having to wait the same length of time again before you can start the binning QC. In my mind to clear up the code I would rather have a dB download subworkflow for all DB Downloadsto make the code clearer. Maybe from a related module PoV it's not as efficient but functionally they are related Any counter arguments? |
In most cases, database downloads don’t depend on anything, so, if I'm not wrong, the processes should start immediately regardless of where they are in placed in the code. Or am I missing something? |
Yes, that's my understanding - Nextflow processes kick off as soon as any inputs are available, so processes beginning with file/URL input from parameters should start as soon as the pipeline begins, no matter how many subworkflows deep they are (there might be a latency hit if you go 10,000 subworkflows deep...). Also not opposed to a "database download" subWF - but that seems a little more limited in scope. In particular I'm thinking about all the steps that are like |
Hmm fair. I might be feeling over defensive due to the huge number of conditions mag has... so maybe this is indeed the case.
Fair point. Ok - if you're feeling to up to it @dialvarezs go ahead and move the relevant database downloads to the BINQC subworklow(s) :) Note you don't need a separate CheckM download CI check -> I would rather you just include an UNTAR step in the pipeline isntead. I added the seaprte CheckM CI tests originally due to instable downloads when connecting to the servers in Australia, but as these databases are now on Zenodo this should rarely happen. |
@jfy133 I'll get back to this on Friday or Saturday.
I think this is a step in the right direction to improve the modularity of mag, so I will.
Got it, I will remove the checkm2 ci test I added. Should I remove the checkm ci test as well? |
Yes you probably can! Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me!
I just completed a comparison test between checkm/checkm2 and just need to compare the results.
Then I'll run with BUSCO/GUNC next week (have to stay home with the kiddo tomorow as well unfortunately) to make sure they aren't broken and then I think we are good to go.
Do you want to include the moving of the database downloading in this PR @dialvarezs or in a follow up one?
|
||
BUSCO(ch_input_bins_for_qc, ch_db_for_busco) | ||
|
||
BUSCO_SUMMARY( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing versions file being mixed into ch_versions
for this process, the SAVE_DOWNLOAD doesn't have one, although given that process is entirely empty, shouldn't that be replaced by a publishDir
specification in modules.conf
🤔
(I know you didn't write this, but might be worth fixing this now :) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the comment on that module, the problem with simply uisng publishDir
is that the same files would get written multiple times. I couldn't think of a more elegant solution, so I will add the version, unless you have a better idea.
Note to self: Ah crap I didn't update the directory name, will have to run the two checkm tests again 🤦
|
Hi @jfy133, sorry for the delay, I didn't get enough time on last weekend for this. But I will address your comments shortly, and I plan to include the database downloads in this PR. |
No worries! No stress if you can't make it ATM, but I think we are close to a release once this is in! |
If you have time, can you review this PR? nf-core/modules#6920 |
Done! |
My kid is sick again 🙄 but if you can update GUNC in this PR before then, then I will be working on mag Thursday so can finish reviewing this PR again :) (but no stress!) |
I made all the important updates. I will run a complete test with every QC tool in out HPC to check if everything is alright, and with that done it should be ready. |
It should be ready now. There is a last minor issue that should be solved by this PR: nf-core/modules#7119 |
This PR adds:
BIN_QC
subworkflow, integrating CheckM, CheckM2, BUSCO and GUNCCloses #607.
PR checklist
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).