-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gene2transcripts and gene2transcripts_v2 don't like HGNC IDs. #578
Comments
This will be a documentation change @ifokkema. I will not just accept the numeric value as it may get confused witht the numeric value of NIH gene IDs. |
This is annotation provided by Ensembl directly. Not from us. Ensembl need to be more responsible for their standards. We correct as much as we can :) |
OK, all these are fixed. Goint to close, but need to update the server still @ifokkema , so please nudge me next weeek |
Makes sense! And a doc fix is just fine!
Weeiiiird! OK, thanks!
Excellent, thanks a lot! There's no rush, but when you do update the server, please let me know and I'll have another look! |
I want to say weird, but Ensembl do this sort of thing There was a bug though. The gene symbol was coming out as MT not MT-ATP6 which is now fixed. Also, some slight changes that will happen now that I fixed the code once we update the databases in the next few days. Hope to release the new software version next week |
Hi Pete!
|
r.e. HGNC:7414, looks like there is another issue that is causing MT to migrate into the db instead of MT- something. I will look at this. See if I can patch rather than do a new release. The numeric HGNC entry should return an error. But do we want it to. The main reason we may want to add HGNC is that we may in the furue WANT TO use other numeric gene searches??? |
Cool, thanks!
Personally, in LOVD, I consider all numeric references to genes as HGNC IDs. My logic is simply that the HGNC hands out the gene symbols and they name the genes. They're the representative source, so I use their IDs. I actually don't know why they prefix their numeric IDs with "HGNC:" as I've never seen other resources prefix their numeric IDs. I do see the benefit, of course, as it identifies the ID. However, I also see a downside, as it causes inconsistent use of the prefix and, therefore, ambiguity in the ID. Either way, I show NCBI gene IDs, but I don't use them as keys or so. So they don't clash in LOVD. NCBI gene IDs are only used for linking to the NCBI. If you want to keep the possibility open to use multiple numeric identifiers, by all means, don't accept the numeric input. However, I would recommend returning an error rather than an HTTP 500. |
The only other I am aware of if GenBank gene IDs. But we do not currently use these. So I see no issue with dropping the HGNC from the input really |
OK, the code is fixed, but I think it will need another database build for HGNC:7414. This is not a quick process. I need to liase with @John-F-Wagstaff. |
We may still want to allow users to include 'HGNC:', even if we do allow just the plain number, as others including the NCBI do (in their genbank records for example you get '/db_xref="HGNC:HGNC:25180"'). Also the number of users that write things down without context, unless prompted, is large enough that I would prefer to keep the 'HGNC:' prefix on the output too. The only transcript we include currently for this in the underlying VVTA is ENST00000361899.2. The RefSeq record that can be found in the HGNC record for this is a 'YP_' with a DBSOURCE of " REFSEQ: accession NC_012920.1", it is currently a "PROVISIONAL REFSEQ" and has no associated transcript. We don't include any protein sequences without transcripts so this is missed out. I am intending to build a new version of the VVTA soon, @Peter-J-Freeman should I bump this up the priority queue? |
I think new versions of all db's is needed. I found some more errors in validator. New line characters in some fields. Explains why the updates weren't successful! 🙄 |
The RefSeq alignments have moved to https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/historical/GRCh38/current/ but have not updated since 2023-09-18 . Hence me not moving too fast on this, new RefSeq transcript data is all well and good but without alignments to go with it not much use for the alignment database, I can't load identifiers that don't have alignments. I can get you updated RefSeqGene data, HGNC gene data for 2024/06 and, Ensembl Release 112(which was done in may) though. I will try to get it done before Friday. Have you seen any issues in the VVTA data other than the Ensembl stuff that we already patched? |
Nope, no additional errors I am aware of. Just the patch to make sure of. The RefSeqGene FE data will be handled via the validator database. We just need to make sure the sequences are in SeqRepo which I believe they are now. Perhaps we need to contact RefSeq and find out why they stopped producing alignments. We need this data and it is vital. Surely they need it too |
Also @John-F-Wagstaff once done, please can I have the file of all transcript IDs in VVTA. Thanks |
The link to the RefSeq alignments just does not look right. The URL path includes the directory I wonder if, instead, the current data are to be found in Homo_sapiens.gene_info.gz |
@leicray They archived the current alignments and put a link in the ftp to this location, after starting to produce this newer file set in parallel for a while. It is called historical because it includes all historic transcript variants back to a certain date cut off, as well as the current data. Yes the naming is bad. I have checked elsewhere and the RefSeq annotation pipeline last ran to completion on a human genome at that date too, so there should not be newer alignment data either way. Unfortunately the gene_info files only includes map locations per whole gene, in the form of 19q13.43 which does not work for us. @Peter-J-Freeman I will get the transcript ID's to you as soon as the database is finished. |
Thanks, guys!
Oh, yes, never remove allowed input or change the formatting of a variable in the output in a "live" API that doesn't have versioning! I'm personally OK with adding additional fields to a JSON API, as I assume that existing implementations won't crash if additional data is returned. Other implementers are more strict and even increase the version number when adding fields. In any case, allowing more diverse input doesn't change existing implementations ever, so IMO never requires an increment of the version number. |
The updated code will accept "HGNC:1234" or "1234" and return the same result. Just not pushing yet because having a few database difficulties :) |
We will be updating the version numbers for all tools because recent changes to the VV engine required breaking changes, and I like to keep all major versions of all tools the same. May not be engineering correct, but prevents my brain fron hurting |
Now working and active on the server @ifokkema on my system, this import json
import VariantValidator
vval = VariantValidator.Validator()
gene = '7414'
select_transcripts = None
g_and_t = vval.gene2transcripts(gene, validator=vval, select_transcripts=select_transcripts, transcript_set="ensembl")
print(json.dumps(g_and_t, sort_keys=True, indent=4, separators=(',', ': '))) will now return {
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "mitochondria",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "201"
},
"coding_end": 681,
"coding_start": 1,
"description": "ATP6-201",
"genomic_spans": {},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
} We will roll out new database builds ASAP to make this work on the server. This is to show what a patch would look like, but we want to make a full db release |
Hmm, seems I need to fix the alignments. They are missing!!! Will look into this since it works for other genes e.g. COL1A1 |
Now also fixed, but again, will not work until the dbs are recreated. Will take a few weeks |
I meant the API version, e.g.,
Excellent, thanks! |
Hi Pete, I'm going through old emails; this doesn't work yet (sending HGNC:7414 to the gene2transcripts_v2 when the gene is a mitochondrial gene). Is the mentioned database build delayed, or didn't it fix the problem? Thanks! |
Not sure why this keeps popping back up. Will look asap |
Lookin again at this On my setup, local, I see
Which looks like the HGNC ID is working but the Symbol is not. |
And now without the
So all is working. Now to test the server since the db is good and the code is good |
The server setup is showing
So is working. So, now to look at whether the REST interface is the issue |
local rest interface [
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "mitochondria",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "201"
},
"coding_end": 681,
"coding_start": 1,
"description": "ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
},
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "mitochondria",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "201"
},
"coding_end": 681,
"coding_start": 1,
"description": "ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
}
] Which makes me happy because we can now generate c. and p. for mito genes thanks to Ensembl |
gives an error [
{
"error": "Unable to recognise gene symbol MT",
"requested_symbol": "MT"
},
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "chrMT:8527:9207",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "ATP6"
},
"coding_end": 681,
"coding_start": 1,
"description": "MT-ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
}
]
I will look at the rest interface. May need an update |
OK, I updated the server with the latest local version. local versions are
Live versions are
which were out due to being on different branches. Gonna test a local from master install the versions are now in line withe the live {'variantvalidator_version': '2.2.1.dev734+ga70a50c', 'variantvalidator_hgvs_version': '2.2.0', 'vvta_version': 'vvta_2024_09', 'vvseqrepo_db': '/Users/user/variantvalidator_data/seqdata/VV_SR_2024_09/master', 'vvdb_version': 'vvdb_2024_8'} note: The uopdated VVTA and SR do not affect this, we already know from the above the validartor db has the correct info [
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "mitochondria",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "201"
},
"coding_end": 681,
"coding_start": 1,
"description": "ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
},
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "mitochondria",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "201"
},
"coding_end": 681,
"coding_start": 1,
"description": "ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
}
] So it does not seem to be software. @John-F-Wagstaff , I can only think that there is something odd with the mounting to APACHE |
@John-F-Wagstaff @ifokkema , It looks to me from the error [
{
"error": "Unable to recognise gene symbol MT",
"requested_symbol": "MT"
}, That decoding in mod_wsgi apache is deleting the "-" character. We have seen this before when trying to pass HGVS intrinic descriptions. I think it is somewhere in the VVweb code. The "-" character when passed can become a space. |
Thank you for the research! How awesome is it, by the way, being able to handle MT variants 😍 !
Very interesting! But wasn't the issue with intronic variants the "+" character, maybe? That needs to be URL encoded to "%2B" to not be interpreted as a space, indeed. However, the hyphen doesn't have a URL-encoded equivalent. There is no encoding, as far as I know, that translates a hyphen in a space. Google doesn't help me much here. The only thing that I am thinking of is hyphens can be used as argument separators, but then they still need whitespace... I don't know enough about mod_wsgi to know what's going on here... 🤔 |
It doesn't seem to be the hyphen. I realized there are other gene symbols with hyphens, like A1BG-AS1. Both [
{
"current_name": "A1BG antisense RNA 1",
"current_symbol": "A1BG-AS1",
"hgnc": "HGNC:37133",
"previous_symbol": "NCRNA00181,A1BGAS,A1BG-AS",
"requested_symbol": "A1BG-AS1",
"transcripts": []
}
] So it's not the hyphen. Right? |
Thanks @ifokkema. This is useful. Althought really confusing. Why is it happening with this symbol? I'll keep digging. p.s. Can LOVD use ensembl transcripts for MT? |
Hold on [
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "chrMT:8527:9207",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "ATP6"
},
"coding_end": 681,
"coding_start": 1,
"description": "MT-ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
}
] It just worked. Submitted as a single entry |
@ifokkema , please test |
I guess all MT symbols, but I could double-check that, if you'd like.
We have had an "Ensembl ID" field for our transcripts since forever, but we never really used it. For MT genes, we used to have a "fake" NCBI ID that triggered Mutalyzer to use the annotation given in the MT GenBank file as a transcript. Now, I guess we'll solve it using the Ensembl IDs that VV gives us! |
Interesting? [
{
"error": "Unable to recognise gene symbol MT",
"requested_symbol": "MT"
}
] Using the gene symbol, [
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": []
}
] Uhhhh odd! So there's something up in the conversion between HGNC ID to gene symbol, but only for the MT genes? |
Oh yes, so it is decoding of the HGNC ID. I concur https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A7414/False/all/GRCh38?content-type=application%2Fjson works fine on my laptop running on local but not on the live server |
I am now wondering if that version of the database contains a duplicate entry. My local version does not. I am making a new build anyway. Will complete over the weekend. We will then NUKE the old database and install the new and test again before digging further |
Alright! Sounds good! I'm anyway working on other stuff right now. I decided to torture myself and rebuild our HGVS tool from the ground up before doing the data analysis and writing that paper... let's hope that was a good idea 😆 |
Sounds fun :P |
Attempting to convert over 3000 lines of unreadable and unmanageable code into something readable and manageable in a completely different structure, what's not to like? 😂 |
Describe the bug
API endpoints
gene2transcripts
andgene2transcripts_v2
allow for genes to be passed as "HGNC:2197". That's great for genes that have recently changed their symbols, and I'm going to use this now. However, the "HGNC:" addition is required but undocumented. If sent as "2197", calls return an HTTP 500. It actually took me some time to realize I needed to add "HGNC:" and I was preparing this bug report as an "it doesn't work" when I realized what the required format was.To Reproduce
Steps to reproduce the behavior:
Expected behavior
gene2transcripts
endpoint ("v1") can also have documented that HGNC IDs are accepted, this is currently also undocumented on the swagger interface.Thank you!
EDIT
The text was updated successfully, but these errors were encountered: