Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMC 3 stops during stage 2 when using BFC-corrected reads #42

Closed
flopezo opened this issue Oct 30, 2017 · 6 comments
Closed

KMC 3 stops during stage 2 when using BFC-corrected reads #42

flopezo opened this issue Oct 30, 2017 · 6 comments

Comments

@flopezo
Copy link

flopezo commented Oct 30, 2017

Hello,

I'm trying to use KMC v3 with reads previously corrected with BFC. However, KMC stops during stage 2, there is no warning or error message, and the stats table shows only 0s. I ran Jellyfish v2 with the same corrected reads without a problem. Below are the commands that I'm using.

Correct reads
bash -c "bfc -s 200m -k33 -t 16 <(seqtk mergepe reads_1.fastq.gz reads_2.fastq.gz) <(seqtk mergepe reads_1.fastq.gz reads_2.fastq.gz) | gzip -1 > bfc-corrected.fastq.gz"

Count k-mers
kmc -k21 -ci2 -m100 -t12 -v bfc-corrected.fastq.gz bfc-corrected_kmc3 ./tmp

This is an example of a read pair after BFC correction:
@E00476:214:HHLTNALXX:8:1101:21217:1186 ec:Z:0_0:104_0_3:0_0
aTAACATATAATGTTTTTAAATAAATTTTAATTTAATTGGAATACTTATTTATTCAATAAAATTATTAACAATAATTTACCTCTATTTTGGTTTCAATTAAATAAATTTATAgAGAAATAaTAAATAAATAAAGCTTCTAACTTTATAATA
+
&???????????????????????????????????????+??????+??????++??+++???+???????????????+???+?????+????++??+++?+++??????%++++???%++?????+?+???+????+?++????++??
@E00476:214:HHLTNALXX:8:1101:21217:1186 ec:Z:0_0:103_0_3:0_0
aTATATTTTTGTTTATTATTTTAAGTATAGGTTAATTGAAGAATTATTTAATTTATTAAAATTAGATTATTTTGTTTATTATAAAATATTTTATTTTTTTTTTATAATTATAATTTTTTATTATTTTTTATTTgATTAAAATaTATGAATA
+
&?????????????????????????????????????????????????????++????????????????????+??????????+?????????????????++?++++++?????++????????++??%+??++++?#???+++?+

I would really appreciate any help.

@marekkokot
Copy link
Contributor

Hello,

Thanks for reporting that issue.
Are your input files publicly available, if yes could you point me how I can get them? If no could you at least specify the size of input files.
On your short example KMC works fine on my machine.
By "stats table" you mean something like that:

1st stage: 0.390857s
2nd stage: 1.96024s
Total    : 2.3511s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :          262
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :          262
   No. of unique counted k-mers       :            0
   Total no. of k-mers                :          262
   Total no. of reads                 :            2
   Total no. of super-k-mers          :           37

Or you mean some other "stats table" (which?)? I am asking because this table should be printed after finishing stage 2.
BTW. KMC usually do not need that much amount of memory, unless you have really big files.

@flopezo
Copy link
Author

flopezo commented Oct 30, 2017

Hi Marek,

The input files are not publicly available. The input includes 211,646,643 interleaved read pairs, and the size of the gzipped FASTQ file is approximately 36GB.

Yes, I meant the table printed after finishing stage 2. I used this command kmc3 -k21 -ci2 -t12 -v bfc-corrected.fastq.gz bfc-corrected_kmc3 ./tmp, and the output is below:

******* Stage 1 configuration: *******

No. of bins                  : 512
Bin part size                : 65536
Input buffer size            : 16777216

No. of readers               : 1
No. of splitters             : 11

Max. mem. size               : 12000MB
Max. mem. per storer         :  6088MB
Max. mem. for single package :    23MB

Max. mem. for PMM (bin parts):  9367MB
Max. mem. for PMM (FASTQ)    :  1819MB
Max. mem. for PMM (reads)    :     2MB
Max. mem. for PMM (b. reader):   805MB

Stage 1: 100%

******* Stage 2 configuration: *******
No. of threads               : 12

Max. mem. for 2nd stage      :    16MB

Stage 2: 100%
1st stage: 586.464s
2nd stage: 1.68055s
Total    : 588.144s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            0
   No. of unique counted k-mers       :            0
   Total no. of k-mers                :            0
   Total no. of reads                 :            0
   Total no. of super-k-mers          :            0

I usually use the default setting for memory (-m12), but I was trying different settings and forgot to change that one. Thank you.

@marekkokot
Copy link
Contributor

Oh, OK you mean 0s as zeros, not as zero seconds :)
Hmm, for now, I don't know the reason for this behavior :(
Do you compile KMC on your own and use the last commit on github or do you use our precompiled version?
I will try to generate some files to reproduce this behavior, but if you notice it again on public files or on a smaller example that you could send me it would be really helpful.

@flopezo
Copy link
Author

flopezo commented Oct 30, 2017

I tried with both KMC v2 installed with Conda and your KMC v3 pre-compiled version. I don't understand what might be the problem. KMC works fine when I use the raw FASTQ files; that is, without prior error correction with BFC. I will send you an e-mail and attach a few thousand reads.

@marekkokot
Copy link
Contributor

Hi,
Ok I know the reason. There are tabs ('\t') in reads' headers. I am not sure if fastq file format allows using tabs in headers, you are the first person that reports a problem related to tabs in headers (even the first published version of KMC assumes no tabs in headers).

I am not sure if we should allow tabs, what do you think?. On the other hand at first look it seems that there is only a little change required in KMC code, so maybe I will do it i the next couple of days.

Anyway thanks for reporting that bug and using KMC.

@flopezo
Copy link
Author

flopezo commented Oct 31, 2017

Thank you for your help! I also think that FASTQ headers are not supposed to have tabs. At least in Illumina reads, a space should precede the read number element. I have read that the format of reads corrected with BFC might cause parsing problems in other tools, such as SGA and khmer.

However, BFC-corrected reads have been used with various genome assemblers, and I have used them without any prior reformatting in assemblies with SPAdes.

Anyway, I will replace tabs with spaces and try again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants