Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clarification] Why are alignments fractional? #54

Open
kmuench opened this issue Sep 11, 2023 · 5 comments
Open

[Clarification] Why are alignments fractional? #54

kmuench opened this issue Sep 11, 2023 · 5 comments

Comments

@kmuench
Copy link

kmuench commented Sep 11, 2023

Hello - thank you for engineering such a helpful tool.

I'm running through the tutorial and examining the file "gene_wise_quantifications_combined.tsv", which I understand contains genewise/transcript counts along with raw counts.

I notice that the count data is between 0 and 76.83, with plenty of fractional counts. In this pre-normalized data, why do fractional counts occur? What do they represent? I was under the impression that the gene/transcript-wise quantification at this step was simply counting (i.e. integer quantification) of segemehl alignments to a region.

Thank you for your help!

@Tillsa
Copy link
Collaborator

Tillsa commented Sep 12, 2023

Per default each alignment of a read is normalized by the total number of alignments the given read has. Additionally each alignment is normalized by the number of features it overlaps with. Both normalization methods are turned on per default an will result in fractions. You can turn off these options via:
--no_count_split_by_alignment_no
--no_count_splitting_by_gene_no
Further information can be found in the gene quantification documentation

We'll change the documentation to make this point more clear!

@kmuench
Copy link
Author

kmuench commented Sep 12, 2023

Ah, thank you so much for explaining!

My ultimate goal is to use the output of the gene quantification step (ideally) and feed it into a DESeq2-based downstream pipeline - we have an unusual experimental design and I'd like to try altering the design matrix beyond what we'd be able to do in READemption.

Is the split-multimapped-read count table "gene_wise_quantifications_combined.tsv" also the count matrix that is fed into DESeq2? (The source of my confusion was that I thought DESeq2 only accepted matrix integer inputs.)

And am I correct in understanding that --unique_only would count only the reads that are not multimapping?

Said another way, setting no_count_split_by_alignment_no (and no_count_splitting_by_gene_no) to true would mean that reads are double-counted (e.g. a read mapping to multiple genes is +1 for the count of each gene, or +0?), right? Not that they are thrown out?

Thank you very much!

@Tillsa
Copy link
Collaborator

Tillsa commented Sep 13, 2023

"And am I correct in understanding that --unique_only would count only the reads that are not multimapping?"
Yes!
"Said another way, setting no_count_split_by_alignment_no (and no_count_splitting_by_gene_no) to true would mean that reads are double-counted (e.g. a read mapping to multiple genes is +1 for the count of each gene, or +0?), right? Not that they are thrown out?"
Yes, you are right!
"The source of my confusion was that I thought DESeq2 only accepted matrix integer inputs"
I believe it also accepts floats as inputs, but maybe they are rounded by DESeq2

@kmuench kmuench changed the title [Clarification] Why are fractional a [Clarification] Why are alignments fractional? Sep 19, 2023
@kmuench kmuench closed this as completed Sep 19, 2023
@kmuench
Copy link
Author

kmuench commented Sep 28, 2023

Hello! I had a follow-up to this -- when I ran this command: reademption gene_quanti --unique_only -p 4 --features gene,cds,region,exon --project_path READemption_analysis , I still produced an output with fractional reads (e.g., 547.33, 258.5). My understanding is that there should not be any fractional reads because the --unique_only option would eliminate multimapping reads, and therefore there are no reads to be split to produce the multimapping fractional counts. Could you help me understand what is happening here? (I am looking at the file "gene_wise_quantifications_combined.csv")

@kmuench kmuench reopened this Sep 28, 2023
@Tillsa
Copy link
Collaborator

Tillsa commented Oct 10, 2023

The fractions likely result from normalizing alignments by the number of features they overlap with. So if a read aligns to a single position, but the position is part of two features (e.g. gene and exon) each feature gets a count of 0.5.
You can turn this option off by using the parameter --no_count_splitting_by_gene_no.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants