Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mm10plus #30

Open
hanhyebin opened this issue Jul 26, 2021 · 3 comments
Open

mm10plus #30

hanhyebin opened this issue Jul 26, 2021 · 3 comments

Comments

@hanhyebin
Copy link

Hi,

I see that tabula muris senis used "mm10plus" as genome reference. I am assuming it is a modified version of mm10. If so, may I know what modifications/adjustments were made?

Thanks!

@aopisco
Copy link
Contributor

aopisco commented Jul 27, 2021

@hanhyebin the reference genome is available at s3://czb-tabula-muris-senis/reference-genome/

@hanhyebin
Copy link
Author

Thanks but I wanted to know more so how if differs from mm10.

The reason I ask is that I am trying to integrate this dataset with other datasets and if mm10plus is much different than mm10, I will need to realign it to mm10 (which I can do) but if there is not much difference between the two, I can continue to use it as is.

Thank you in advance.

@txemaheredia
Copy link

I am also having issues with this.

I have downloaded the .h5ad files for all datasets, and I find in the matrices genes that are not present in this release.

For example, the dataset droplet-Liver contains the gene "Fam150a". I have just downloaded the reference .tgz from aws, and the gene "Fam150a" (nor "Fam150", nor "am150") do not exist in gencode.vM19/genes/genes.gtf

Out of 20138 genes in the object matrix, there are 2081 genes that do not exist in the gencodeM19 gtf file.

> length(rownames(seu)[!rownames(seu) %in% gencodeM19_genes$gene_name])
[1] 2081
> head(rownames(seu)[!rownames(seu) %in% gencodeM19_genes$gene_name], 20)
 [1] "Fam150a"       "3110035E14Rik" "6030422M02Rik" "4932411L15"    "Gm106"         "Tceb1"        
 [7] "1110058L19Rik" "Bai3"          "Fam123c"       "4632411B12Rik" "6330578E17Rik" "D1Bwg0212e"   
[13] "2610017I09Rik" "2900092D14Rik" "A530098C11Rik" "1700029F09Rik" "4832428D23Rik" "Dnahc7b"      
[19] "Sdpr"          "Obfc2a" 

I've found some random gtf file in the internet when googling for mm10plus ( http://waxmanlabvm.bu.edu/kkarri/G171/ref/updated-usethis-mm10plus-pcg-ercc-lnc-nodups-mcherry/genes/ ). This file does indeed include the genes "Fam150a" and "Tceb1". It doesn't match 100% of the genes present in the object matrix. However, this file contains 426 genes that were not present in the gencodeM19 file.

> sum(rownames(seu)[!rownames(seu) %in% gencodeM19_genes$gene_name] %in% mm10plus_genes$gene_name)
[1] 426

Which annotation was used to create the matrices for these datasets? Am I messing this up big time, or is there a serious mismatch between the data matrices and the gtf files provided?

Thanks in advance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants