KMC 3 stops during stage 2 when using BFC-corrected reads #42

flopezo · 2017-10-30T17:20:46Z

Hello,

I'm trying to use KMC v3 with reads previously corrected with BFC. However, KMC stops during stage 2, there is no warning or error message, and the stats table shows only 0s. I ran Jellyfish v2 with the same corrected reads without a problem. Below are the commands that I'm using.

Correct reads
bash -c "bfc -s 200m -k33 -t 16 <(seqtk mergepe reads_1.fastq.gz reads_2.fastq.gz) <(seqtk mergepe reads_1.fastq.gz reads_2.fastq.gz) | gzip -1 > bfc-corrected.fastq.gz"

Count k-mers
kmc -k21 -ci2 -m100 -t12 -v bfc-corrected.fastq.gz bfc-corrected_kmc3 ./tmp

This is an example of a read pair after BFC correction:
@E00476:214:HHLTNALXX:8:1101:21217:1186 ec:Z:0_0:104_0_3:0_0
aTAACATATAATGTTTTTAAATAAATTTTAATTTAATTGGAATACTTATTTATTCAATAAAATTATTAACAATAATTTACCTCTATTTTGGTTTCAATTAAATAAATTTATAgAGAAATAaTAAATAAATAAAGCTTCTAACTTTATAATA
+
&???????????????????????????????????????+??????+??????++??+++???+???????????????+???+?????+????++??+++?+++??????%++++???%++?????+?+???+????+?++????++??
@E00476:214:HHLTNALXX:8:1101:21217:1186 ec:Z:0_0:103_0_3:0_0
aTATATTTTTGTTTATTATTTTAAGTATAGGTTAATTGAAGAATTATTTAATTTATTAAAATTAGATTATTTTGTTTATTATAAAATATTTTATTTTTTTTTTATAATTATAATTTTTTATTATTTTTTATTTgATTAAAATaTATGAATA
+
&?????????????????????????????????????????????????????++????????????????????+??????????+?????????????????++?++++++?????++????????++??%+??++++?#???+++?+

I would really appreciate any help.

The text was updated successfully, but these errors were encountered:

marekkokot · 2017-10-30T18:18:07Z

Hello,

Thanks for reporting that issue.
Are your input files publicly available, if yes could you point me how I can get them? If no could you at least specify the size of input files.
On your short example KMC works fine on my machine.
By "stats table" you mean something like that:

1st stage: 0.390857s
2nd stage: 1.96024s
Total    : 2.3511s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :          262
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :          262
   No. of unique counted k-mers       :            0
   Total no. of k-mers                :          262
   Total no. of reads                 :            2
   Total no. of super-k-mers          :           37

Or you mean some other "stats table" (which?)? I am asking because this table should be printed after finishing stage 2.
BTW. KMC usually do not need that much amount of memory, unless you have really big files.

flopezo · 2017-10-30T18:55:34Z

Hi Marek,

The input files are not publicly available. The input includes 211,646,643 interleaved read pairs, and the size of the gzipped FASTQ file is approximately 36GB.

Yes, I meant the table printed after finishing stage 2. I used this command kmc3 -k21 -ci2 -t12 -v bfc-corrected.fastq.gz bfc-corrected_kmc3 ./tmp, and the output is below:

******* Stage 1 configuration: *******

No. of bins                  : 512
Bin part size                : 65536
Input buffer size            : 16777216

No. of readers               : 1
No. of splitters             : 11

Max. mem. size               : 12000MB
Max. mem. per storer         :  6088MB
Max. mem. for single package :    23MB

Max. mem. for PMM (bin parts):  9367MB
Max. mem. for PMM (FASTQ)    :  1819MB
Max. mem. for PMM (reads)    :     2MB
Max. mem. for PMM (b. reader):   805MB

Stage 1: 100%

******* Stage 2 configuration: *******
No. of threads               : 12

Max. mem. for 2nd stage      :    16MB

Stage 2: 100%
1st stage: 586.464s
2nd stage: 1.68055s
Total    : 588.144s
Tmp size : 0MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :            0
   No. of unique counted k-mers       :            0
   Total no. of k-mers                :            0
   Total no. of reads                 :            0
   Total no. of super-k-mers          :            0

I usually use the default setting for memory (-m12), but I was trying different settings and forgot to change that one. Thank you.

marekkokot · 2017-10-30T19:32:41Z

Oh, OK you mean 0s as zeros, not as zero seconds :)
Hmm, for now, I don't know the reason for this behavior :(
Do you compile KMC on your own and use the last commit on github or do you use our precompiled version?
I will try to generate some files to reproduce this behavior, but if you notice it again on public files or on a smaller example that you could send me it would be really helpful.

flopezo · 2017-10-30T20:48:11Z

I tried with both KMC v2 installed with Conda and your KMC v3 pre-compiled version. I don't understand what might be the problem. KMC works fine when I use the raw FASTQ files; that is, without prior error correction with BFC. I will send you an e-mail and attach a few thousand reads.

marekkokot · 2017-10-30T23:25:27Z

Hi,
Ok I know the reason. There are tabs ('\t') in reads' headers. I am not sure if fastq file format allows using tabs in headers, you are the first person that reports a problem related to tabs in headers (even the first published version of KMC assumes no tabs in headers).

I am not sure if we should allow tabs, what do you think?. On the other hand at first look it seems that there is only a little change required in KMC code, so maybe I will do it i the next couple of days.

Anyway thanks for reporting that bug and using KMC.

flopezo · 2017-10-31T13:31:38Z

Thank you for your help! I also think that FASTQ headers are not supposed to have tabs. At least in Illumina reads, a space should precede the read number element. I have read that the format of reads corrected with BFC might cause parsing problems in other tools, such as SGA and khmer.

However, BFC-corrected reads have been used with various genome assemblers, and I have used them without any prior reformatting in assemblies with SPAdes.

Anyway, I will replace tabs with spaces and try again.

marekkokot added the possible bug label Oct 30, 2017

flopezo closed this as completed Oct 31, 2017

marekkokot mentioned this issue Sep 13, 2019

Can't count kmer on fastq file #137

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KMC 3 stops during stage 2 when using BFC-corrected reads #42

KMC 3 stops during stage 2 when using BFC-corrected reads #42

flopezo commented Oct 30, 2017

marekkokot commented Oct 30, 2017

flopezo commented Oct 30, 2017

marekkokot commented Oct 30, 2017

flopezo commented Oct 30, 2017

marekkokot commented Oct 30, 2017

flopezo commented Oct 31, 2017

KMC 3 stops during stage 2 when using BFC-corrected reads #42

KMC 3 stops during stage 2 when using BFC-corrected reads #42

Comments

flopezo commented Oct 30, 2017

marekkokot commented Oct 30, 2017

flopezo commented Oct 30, 2017

marekkokot commented Oct 30, 2017

flopezo commented Oct 30, 2017

marekkokot commented Oct 30, 2017

flopezo commented Oct 31, 2017