Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gene2transcripts and gene2transcripts_v2 don't like HGNC IDs. #578

Open
1 of 4 tasks
ifokkema opened this issue Jan 23, 2024 · 45 comments
Open
1 of 4 tasks

gene2transcripts and gene2transcripts_v2 don't like HGNC IDs. #578

ifokkema opened this issue Jan 23, 2024 · 45 comments

Comments

@ifokkema
Copy link
Collaborator

ifokkema commented Jan 23, 2024

Describe the bug
API endpoints gene2transcripts and gene2transcripts_v2 allow for genes to be passed as "HGNC:2197". That's great for genes that have recently changed their symbols, and I'm going to use this now. However, the "HGNC:" addition is required but undocumented. If sent as "2197", calls return an HTTP 500. It actually took me some time to realize I needed to add "HGNC:" and I was preparing this bug report as an "it doesn't work" when I realized what the required format was.

To Reproduce
Steps to reproduce the behavior:

  1. Sending a gene symbol works and lists the transcripts. https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/COL1A1/False/refseq/GRCh37?content-type=application%2Fjson
  2. Sending "HGNC:2197" works and lists the transcripts. https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC:2197/False/refseq/GRCh37?content-type=application%2Fjson
  3. Sending just the numeric ID doesn't work and returns an HTTP 500. https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/2197/False/refseq/GRCh37?content-type=application%2Fjson

Expected behavior

  • Either the numeric ID should be interpreted as an HGNC ID, or the API should have documented on the swagger interface how the HGNC ID should be passed. Preferably, by the mentioning of an example.
  • The gene2transcripts endpoint ("v1") can also have documented that HGNC IDs are accepted, this is currently also undocumented on the swagger interface.

Thank you!

EDIT

  • Also; not all HGNC IDs work. HGNC:7414 doesn't work, while its gene symbol, MT-ATP6, does work.
[
  {
    "error": "Unable to recognise gene symbol NO DATA",
    "requested_symbol": "NO DATA"
  }
]
  • Also; when using MT-ATP6, VV uses "MT" as the letter for chromosome "M".
[
  {
    "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
    "current_symbol": "MT-ATP6",
    "hgnc": "HGNC:7414",
    "previous_symbol": "MTATP6,RP",
    "requested_symbol": "MT-ATP6",
    "transcripts": [
      {
        "annotations": {
          "chromosome": "MT",
          "db_xref": {
            "CCDS": null,
            "ensemblgene": "ENSG00000198899",
            "hgnc": "HGNC:7414",
            "ncbigene": null,
            "select": "Ensembl"
          },
          "ensembl_select": true,
          "mane_plus_clinical": false,
          "mane_select": false,
          "map": "chrMT:8527:9207",
          "note": "mitochondrially encoded ATP synthase membrane subunit 6",
          "refseq_select": false,
          "variant": "ATP6"
        },
        "coding_end": 681,
        "coding_start": 1,
        "description": "MT-ATP6-201",
        "genomic_spans": {},
        "length": 681,
        "reference": "ENST00000361899.2",
        "translation": "ENSP00000354632.2"
      }
    ]
  }
]
@Peter-J-Freeman
Copy link
Collaborator

This will be a documentation change @ifokkema. I will not just accept the numeric value as it may get confused witht the numeric value of NIH gene IDs.

@Peter-J-Freeman
Copy link
Collaborator

Peter-J-Freeman commented Feb 16, 2024

Also; when using MT-ATP6, VV uses "MT" as the letter for chromosome "M".

This is annotation provided by Ensembl directly. Not from us. Ensembl need to be more responsible for their standards. We correct as much as we can :)

@Peter-J-Freeman
Copy link
Collaborator

OK, all these are fixed. Goint to close, but need to update the server still @ifokkema , so please nudge me next weeek

@ifokkema
Copy link
Collaborator Author

This will be a documentation change @ifokkema. I will not just accept the numeric value as it may get confused witht the numeric value of NIH gene IDs.

Makes sense! And a doc fix is just fine!

This is annotation provided by Ensembl directly. Not from us. Ensembl need to be more responsible for their standards. We correct as much as we can :)

Weeiiiird! OK, thanks!

OK, all these are fixed. Goint to close, but need to update the server still @ifokkema , so please nudge me next weeek

Excellent, thanks a lot! There's no rush, but when you do update the server, please let me know and I'll have another look!

@Peter-J-Freeman
Copy link
Collaborator

I want to say weird, but Ensembl do this sort of thing

There was a bug though. The gene symbol was coming out as MT not MT-ATP6 which is now fixed. Also, some slight changes that will happen now that I fixed the code once we update the databases in the next few days.

Hope to release the new software version next week

@ifokkema
Copy link
Collaborator Author

Hi Pete!
Just to be sure:

  • Sending a numeric HGNC ID still returns an HTTP 500 - is this intended, or should it show a warning/error now?
  • Using HGNC:7414 for mitochondrial genes doesn't work yet; the gene symbol shows up as "MT", as you mentioned in February.

@Peter-J-Freeman
Copy link
Collaborator

r.e. HGNC:7414, looks like there is another issue that is causing MT to migrate into the db instead of MT- something. I will look at this. See if I can patch rather than do a new release.

The numeric HGNC entry should return an error. But do we want it to. The main reason we may want to add HGNC is that we may in the furue WANT TO use other numeric gene searches???

@ifokkema
Copy link
Collaborator Author

r.e. HGNC:7414, looks like there is another issue that is causing MT to migrate into the db instead of MT- something. I will look at this. See if I can patch rather than do a new release.

Cool, thanks!

The numeric HGNC entry should return an error. But do we want it to. The main reason we may want to add HGNC is that we may in the furue WANT TO use other numeric gene searches???

Personally, in LOVD, I consider all numeric references to genes as HGNC IDs. My logic is simply that the HGNC hands out the gene symbols and they name the genes. They're the representative source, so I use their IDs. I actually don't know why they prefix their numeric IDs with "HGNC:" as I've never seen other resources prefix their numeric IDs. I do see the benefit, of course, as it identifies the ID. However, I also see a downside, as it causes inconsistent use of the prefix and, therefore, ambiguity in the ID. Either way, I show NCBI gene IDs, but I don't use them as keys or so. So they don't clash in LOVD. NCBI gene IDs are only used for linking to the NCBI. If you want to keep the possibility open to use multiple numeric identifiers, by all means, don't accept the numeric input. However, I would recommend returning an error rather than an HTTP 500.

@Peter-J-Freeman
Copy link
Collaborator

The only other I am aware of if GenBank gene IDs. But we do not currently use these. So I see no issue with dropping the HGNC from the input really

@Peter-J-Freeman
Copy link
Collaborator

OK, the code is fixed, but I think it will need another database build for HGNC:7414. This is not a quick process. I need to liase with @John-F-Wagstaff.

@John-F-Wagstaff
Copy link
Collaborator

We may still want to allow users to include 'HGNC:', even if we do allow just the plain number, as others including the NCBI do (in their genbank records for example you get '/db_xref="HGNC:HGNC:25180"'). Also the number of users that write things down without context, unless prompted, is large enough that I would prefer to keep the 'HGNC:' prefix on the output too.

The only transcript we include currently for this in the underlying VVTA is ENST00000361899.2. The RefSeq record that can be found in the HGNC record for this is a 'YP_' with a DBSOURCE of " REFSEQ: accession NC_012920.1", it is currently a "PROVISIONAL REFSEQ" and has no associated transcript. We don't include any protein sequences without transcripts so this is missed out.

I am intending to build a new version of the VVTA soon, @Peter-J-Freeman should I bump this up the priority queue?

@Peter-J-Freeman
Copy link
Collaborator

I think new versions of all db's is needed. I found some more errors in validator. New line characters in some fields. Explains why the updates weren't successful! 🙄

@John-F-Wagstaff
Copy link
Collaborator

The RefSeq alignments have moved to https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/historical/GRCh38/current/ but have not updated since 2023-09-18 . Hence me not moving too fast on this, new RefSeq transcript data is all well and good but without alignments to go with it not much use for the alignment database, I can't load identifiers that don't have alignments.

I can get you updated RefSeqGene data, HGNC gene data for 2024/06 and, Ensembl Release 112(which was done in may) though. I will try to get it done before Friday.

Have you seen any issues in the VVTA data other than the Ensembl stuff that we already patched?

@Peter-J-Freeman
Copy link
Collaborator

Nope, no additional errors I am aware of. Just the patch to make sure of. The RefSeqGene FE data will be handled via the validator database. We just need to make sure the sequences are in SeqRepo which I believe they are now. Perhaps we need to contact RefSeq and find out why they stopped producing alignments. We need this data and it is vital. Surely they need it too

@Peter-J-Freeman
Copy link
Collaborator

Also @John-F-Wagstaff once done, please can I have the file of all transcript IDs in VVTA. Thanks

@leicray
Copy link
Contributor

leicray commented Jul 16, 2024

The link to the RefSeq alignments just does not look right. The URL path includes the directory historical and that simply feels wrong. Why would the current alignments by classified as historical?

I wonder if, instead, the current data are to be found in Homo_sapiens.gene_info.gz

@John-F-Wagstaff
Copy link
Collaborator

@leicray They archived the current alignments and put a link in the ftp to this location, after starting to produce this newer file set in parallel for a while. It is called historical because it includes all historic transcript variants back to a certain date cut off, as well as the current data. Yes the naming is bad. I have checked elsewhere and the RefSeq annotation pipeline last ran to completion on a human genome at that date too, so there should not be newer alignment data either way. Unfortunately the gene_info files only includes map locations per whole gene, in the form of 19q13.43 which does not work for us.

@Peter-J-Freeman I will get the transcript ID's to you as soon as the database is finished.

@ifokkema
Copy link
Collaborator Author

Thanks, guys!

We may still want to allow users to include 'HGNC:', even if we do allow just the plain number, as others including the NCBI do (in their genbank records for example you get '/db_xref="HGNC:HGNC:25180"'). Also the number of users that write things down without context, unless prompted, is large enough that I would prefer to keep the 'HGNC:' prefix on the output too.

Oh, yes, never remove allowed input or change the formatting of a variable in the output in a "live" API that doesn't have versioning! I'm personally OK with adding additional fields to a JSON API, as I assume that existing implementations won't crash if additional data is returned. Other implementers are more strict and even increase the version number when adding fields. In any case, allowing more diverse input doesn't change existing implementations ever, so IMO never requires an increment of the version number.

@Peter-J-Freeman
Copy link
Collaborator

The updated code will accept "HGNC:1234" or "1234" and return the same result.

Just not pushing yet because having a few database difficulties :)

@Peter-J-Freeman
Copy link
Collaborator

We will be updating the version numbers for all tools because recent changes to the VV engine required breaking changes, and I like to keep all major versions of all tools the same. May not be engineering correct, but prevents my brain fron hurting

@Peter-J-Freeman
Copy link
Collaborator

Sending a numeric HGNC ID still returns an HTTP 500 - is this intended, or should it show a warning/error now?

Now working and active on the server @ifokkema

on my system, this

import json
import VariantValidator
vval = VariantValidator.Validator()
gene = '7414'
select_transcripts = None
g_and_t = vval.gene2transcripts(gene, validator=vval, select_transcripts=select_transcripts, transcript_set="ensembl")
print(json.dumps(g_and_t, sort_keys=True, indent=4, separators=(',', ': ')))

will now return

{
    "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
    "current_symbol": "MT-ATP6",
    "hgnc": "HGNC:7414",
    "previous_symbol": "MTATP6,RP",
    "requested_symbol": "MT-ATP6",
    "transcripts": [
        {
            "annotations": {
                "chromosome": "MT",
                "db_xref": {
                    "CCDS": null,
                    "ensemblgene": "ENSG00000198899",
                    "hgnc": "HGNC:7414",
                    "ncbigene": null,
                    "select": "Ensembl"
                },
                "ensembl_select": true,
                "mane_plus_clinical": false,
                "mane_select": false,
                "map": "mitochondria",
                "note": "mitochondrially encoded ATP synthase membrane subunit 6",
                "refseq_select": false,
                "variant": "201"
            },
            "coding_end": 681,
            "coding_start": 1,
            "description": "ATP6-201",
            "genomic_spans": {},
            "length": 681,
            "reference": "ENST00000361899.2",
            "translation": "ENSP00000354632.2"
        }
    ]
}

We will roll out new database builds ASAP to make this work on the server. This is to show what a patch would look like, but we want to make a full db release

@Peter-J-Freeman
Copy link
Collaborator

Hmm, seems I need to fix the alignments. They are missing!!! Will look into this since it works for other genes e.g. COL1A1

@Peter-J-Freeman
Copy link
Collaborator

Now also fixed, but again, will not work until the dbs are recreated. Will take a few weeks

@ifokkema
Copy link
Collaborator Author

We will be updating the version numbers for all tools because recent changes to the VV engine required breaking changes, and I like to keep all major versions of all tools the same. May not be engineering correct, but prevents my brain fron hurting

I meant the API version, e.g., /api/v1/method?arguments vs /api/v2/method?arguments. The versions in the meta data of the output are a different thing. I meant that as long as the endpoint isn't versioned, stuff from the output shouldn't be removed, input requirements shouldn't be changed, but additions to the output are generally OK.

Now also fixed, but again, will not work until the dbs are recreated. Will take a few weeks

Excellent, thanks!

@ifokkema
Copy link
Collaborator Author

OK, the code is fixed, but I think it will need another database build for HGNC:7414. This is not a quick process. I need to liase with @John-F-Wagstaff.

Hi Pete, I'm going through old emails; this doesn't work yet (sending HGNC:7414 to the gene2transcripts_v2 when the gene is a mitochondrial gene). Is the mentioned database build delayed, or didn't it fix the problem? Thanks!

@Peter-J-Freeman
Copy link
Collaborator

Not sure why this keeps popping back up. Will look asap

@Peter-J-Freeman
Copy link
Collaborator

Lookin again at this

On my setup, local, I see

>>> import json
>>> import VariantValidator
>>> vval = VariantValidator.Validator()
>>> gene = '["HGNC:7414", "MT-APT6"]'
>>> g_and_t = vval.gene2transcripts(gene)
>>> print(json.dumps(g_and_t, sort_keys=True, indent=4, separators=(',', ': ')))
[
    {
        "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
        "current_symbol": "MT-ATP6",
        "hgnc": "HGNC:7414",
        "previous_symbol": "MTATP6,RP",
        "requested_symbol": "MT-ATP6",
        "transcripts": []
    },
    {
        "error": "Unable to recognise gene symbol MT-APT6",
        "requested_symbol": "MT-APT6"
    }
]
>>> 

Which looks like the HGNC ID is working but the Symbol is not.

@Peter-J-Freeman
Copy link
Collaborator

Peter-J-Freeman commented Nov 15, 2024

And now without the typo

>>> import json
>>> import VariantValidator
>>> vval = VariantValidator.Validator()
>>> gene = '["HGNC:7414", "MT-ATP6"]'
>>> g_and_t = vval.gene2transcripts(gene)
>>> print(json.dumps(g_and_t, sort_keys=True, indent=4, separators=(',', ': ')))
[
    {
        "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
        "current_symbol": "MT-ATP6",
        "hgnc": "HGNC:7414",
        "previous_symbol": "MTATP6,RP",
        "requested_symbol": "MT-ATP6",
        "transcripts": []
    },
    {
        "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
        "current_symbol": "MT-ATP6",
        "hgnc": "HGNC:7414",
        "previous_symbol": "MTATP6,RP",
        "requested_symbol": "MT-ATP6",
        "transcripts": []
    }
]
>>> 

So all is working. Now to test the server since the db is good and the code is good

@Peter-J-Freeman
Copy link
Collaborator

The server setup is showing

import json
>>> import VariantValidator
>>> vval = VariantValidator.Validator()
>>> gene = '["HGNC:7414", "MT-ATP6"]'
>>> g_and_t = vval.gene2transcripts(gene)
>>> print(json.dumps(g_and_t, sort_keys=True, indent=4, separators=(',', ': ')))
[
    {
        "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
        "current_symbol": "MT-ATP6",
        "hgnc": "HGNC:7414",
        "previous_symbol": "MTATP6,RP",
        "requested_symbol": "MT-ATP6",
        "transcripts": []
    },
    {
        "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
        "current_symbol": "MT-ATP6",
        "hgnc": "HGNC:7414",
        "previous_symbol": "MTATP6,RP",
        "requested_symbol": "MT-ATP6",
        "transcripts": []
    }
]
>>> 

So is working. So, now to look at whether the REST interface is the issue

@Peter-J-Freeman
Copy link
Collaborator

Peter-J-Freeman commented Nov 15, 2024

local rest interface
http://127.0.0.1:8000/VariantValidator/tools/gene2transcripts_v2/HGNC%3A7414%7CMT-ATP6/False/all/GRCh38?content-type=application%2Fjson

[
  {
    "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
    "current_symbol": "MT-ATP6",
    "hgnc": "HGNC:7414",
    "previous_symbol": "MTATP6,RP",
    "requested_symbol": "MT-ATP6",
    "transcripts": [
      {
        "annotations": {
          "chromosome": "MT",
          "db_xref": {
            "CCDS": null,
            "ensemblgene": "ENSG00000198899",
            "hgnc": "HGNC:7414",
            "ncbigene": null,
            "select": "Ensembl"
          },
          "ensembl_select": true,
          "mane_plus_clinical": false,
          "mane_select": false,
          "map": "mitochondria",
          "note": "mitochondrially encoded ATP synthase membrane subunit 6",
          "refseq_select": false,
          "variant": "201"
        },
        "coding_end": 681,
        "coding_start": 1,
        "description": "ATP6-201",
        "genomic_spans": {
          "NC_012920.1": {
            "end_position": 9207,
            "exon_structure": [
              {
                "cigar": "681=",
                "exon_number": 1,
                "genomic_end": 9207,
                "genomic_start": 8527,
                "transcript_end": 681,
                "transcript_start": 1
              }
            ],
            "orientation": 1,
            "start_position": 8527,
            "total_exons": 1
          }
        },
        "length": 681,
        "reference": "ENST00000361899.2",
        "translation": "ENSP00000354632.2"
      }
    ]
  },
  {
    "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
    "current_symbol": "MT-ATP6",
    "hgnc": "HGNC:7414",
    "previous_symbol": "MTATP6,RP",
    "requested_symbol": "MT-ATP6",
    "transcripts": [
      {
        "annotations": {
          "chromosome": "MT",
          "db_xref": {
            "CCDS": null,
            "ensemblgene": "ENSG00000198899",
            "hgnc": "HGNC:7414",
            "ncbigene": null,
            "select": "Ensembl"
          },
          "ensembl_select": true,
          "mane_plus_clinical": false,
          "mane_select": false,
          "map": "mitochondria",
          "note": "mitochondrially encoded ATP synthase membrane subunit 6",
          "refseq_select": false,
          "variant": "201"
        },
        "coding_end": 681,
        "coding_start": 1,
        "description": "ATP6-201",
        "genomic_spans": {
          "NC_012920.1": {
            "end_position": 9207,
            "exon_structure": [
              {
                "cigar": "681=",
                "exon_number": 1,
                "genomic_end": 9207,
                "genomic_start": 8527,
                "transcript_end": 681,
                "transcript_start": 1
              }
            ],
            "orientation": 1,
            "start_position": 8527,
            "total_exons": 1
          }
        },
        "length": 681,
        "reference": "ENST00000361899.2",
        "translation": "ENSP00000354632.2"
      }
    ]
  }
]

Which makes me happy because we can now generate c. and p. for mito genes thanks to Ensembl

@Peter-J-Freeman
Copy link
Collaborator

http://127.0.0.1:8000/VariantValidator/tools/gene2transcripts_v2/HGNC%3A7414%7CMT-ATP6/False/all/GRCh38?content-type=application%2Fjson

gives an error

[
{
"error": "Unable to recognise gene symbol MT",
"requested_symbol": "MT"
},
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "chrMT:8527:9207",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "ATP6"
},
"coding_end": 681,
"coding_start": 1,
"description": "MT-ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
}
]

I will look at the rest interface. May need an update

@Peter-J-Freeman
Copy link
Collaborator

OK, I updated the server with the latest local version.

local versions are

[VariantValidator](https://github.com/openvar/rest_variantValidator) version 2.2.1.dev685+g607f552
[VariantFormatter](https://github.com/openvar/variantFormatter) version 2.2.1.dev66+g99f5b9a
[vv_hgvs](https://github.com/openvar/vv_hgvs) version 2.2.0
[VVTA](https://www528.lamp.le.ac.uk/) release vvta_2024_09
[vvSeqRepo](https://www528.lamp.le.ac.uk/) release VV_SR_2024_09

Live versions are

[VariantValidator](https://github.com/openvar/rest_variantValidator) version 2.2.1.dev734+ga70a50c
[VariantFormatter](https://github.com/openvar/variantFormatter) version 2.2.1.dev73+g6cb7954
[vv_hgvs](https://github.com/openvar/vv_hgvs) version 2.2.0
[VVTA](https://www528.lamp.le.ac.uk/) release vvta_2024_01
[vvSeqRepo](https://www528.lamp.le.ac.uk/) release VV_SR_2024_04

which were out due to being on different branches. Gonna test a local from master install

the versions are now in line withe the live

{'variantvalidator_version': '2.2.1.dev734+ga70a50c', 'variantvalidator_hgvs_version': '2.2.0', 'vvta_version': 'vvta_2024_09', 'vvseqrepo_db': '/Users/user/variantvalidator_data/seqdata/VV_SR_2024_09/master', 'vvdb_version': 'vvdb_2024_8'}

note: The uopdated VVTA and SR do not affect this, we already know from the above the validartor db has the correct info

http://127.0.0.1:8000/VariantValidator/tools/gene2transcripts_v2/HGNC%3A7414%7CMT-ATP6/False/all/GRCh38?content-type=application%2Fjson

[
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "mitochondria",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "201"
},
"coding_end": 681,
"coding_start": 1,
"description": "ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
},
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "mitochondria",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "201"
},
"coding_end": 681,
"coding_start": 1,
"description": "ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
}
]

So it does not seem to be software. @John-F-Wagstaff , I can only think that there is something odd with the mounting to APACHE

@Peter-J-Freeman
Copy link
Collaborator

@John-F-Wagstaff @ifokkema , It looks to me from the error

[
{
"error": "Unable to recognise gene symbol MT",
"requested_symbol": "MT"
},

That decoding in mod_wsgi apache is deleting the "-" character. We have seen this before when trying to pass HGVS intrinic descriptions. I think it is somewhere in the VVweb code. The "-" character when passed can become a space.

@ifokkema
Copy link
Collaborator Author

Thank you for the research! How awesome is it, by the way, being able to handle MT variants 😍 !

That decoding in mod_wsgi apache is deleting the "-" character. We have seen this before when trying to pass HGVS intrinic descriptions. I think it is somewhere in the VVweb code. The "-" character when passed can become a space.

Very interesting! But wasn't the issue with intronic variants the "+" character, maybe? That needs to be URL encoded to "%2B" to not be interpreted as a space, indeed. However, the hyphen doesn't have a URL-encoded equivalent. There is no encoding, as far as I know, that translates a hyphen in a space. Google doesn't help me much here. The only thing that I am thinking of is hyphens can be used as argument separators, but then they still need whitespace... I don't know enough about mod_wsgi to know what's going on here... 🤔

@ifokkema
Copy link
Collaborator Author

ifokkema commented Nov 15, 2024

It doesn't seem to be the hyphen. I realized there are other gene symbols with hyphens, like A1BG-AS1.

Both
https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A37133/mane/all/GRCh38?content-type=application%2Fjson
and
https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/A1BG-AS1/mane/all/GRCh38?content-type=application%2Fjson
output:

[
  {
    "current_name": "A1BG antisense RNA 1",
    "current_symbol": "A1BG-AS1",
    "hgnc": "HGNC:37133",
    "previous_symbol": "NCRNA00181,A1BGAS,A1BG-AS",
    "requested_symbol": "A1BG-AS1",
    "transcripts": []
  }
]

So it's not the hyphen. Right?

@Peter-J-Freeman
Copy link
Collaborator

It doesn't seem to be the hyphen. I realized there are other gene symbols with hyphens, like A1BG-AS1.

Thanks @ifokkema. This is useful. Althought really confusing. Why is it happening with this symbol? I'll keep digging.

p.s. Can LOVD use ensembl transcripts for MT?

@Peter-J-Freeman
Copy link
Collaborator

Hold on

https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/MT-ATP6/False/all/GRCh38?content-type=application%2Fjson

[
{
"current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
"current_symbol": "MT-ATP6",
"hgnc": "HGNC:7414",
"previous_symbol": "MTATP6,RP",
"requested_symbol": "MT-ATP6",
"transcripts": [
{
"annotations": {
"chromosome": "MT",
"db_xref": {
"CCDS": null,
"ensemblgene": "ENSG00000198899",
"hgnc": "HGNC:7414",
"ncbigene": null,
"select": "Ensembl"
},
"ensembl_select": true,
"mane_plus_clinical": false,
"mane_select": false,
"map": "chrMT:8527:9207",
"note": "mitochondrially encoded ATP synthase membrane subunit 6",
"refseq_select": false,
"variant": "ATP6"
},
"coding_end": 681,
"coding_start": 1,
"description": "MT-ATP6-201",
"genomic_spans": {
"NC_012920.1": {
"end_position": 9207,
"exon_structure": [
{
"cigar": "681=",
"exon_number": 1,
"genomic_end": 9207,
"genomic_start": 8527,
"transcript_end": 681,
"transcript_start": 1
}
],
"orientation": 1,
"start_position": 8527,
"total_exons": 1
}
},
"length": 681,
"reference": "ENST00000361899.2",
"translation": "ENSP00000354632.2"
}
]
}
]

It just worked. Submitted as a single entry

@Peter-J-Freeman
Copy link
Collaborator

@ifokkema , please test

@ifokkema
Copy link
Collaborator Author

It doesn't seem to be the hyphen. I realized there are other gene symbols with hyphens, like A1BG-AS1.

Thanks @ifokkema. This is useful. Althought really confusing. Why is it happening with this symbol? I'll keep digging.

I guess all MT symbols, but I could double-check that, if you'd like.

p.s. Can LOVD use ensembl transcripts for MT?

We have had an "Ensembl ID" field for our transcripts since forever, but we never really used it. For MT genes, we used to have a "fake" NCBI ID that triggered Mutalyzer to use the annotation given in the MT GenBank file as a transcript. Now, I guess we'll solve it using the Ensembl IDs that VV gives us!

@ifokkema
Copy link
Collaborator Author

It just worked. Submitted as a single entry

Interesting?

https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A7415/mane/all/GRCh38?content-type=application%2Fjson
still gives:

[
  {
    "error": "Unable to recognise gene symbol MT",
    "requested_symbol": "MT"
  }
]

Using the gene symbol,
https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/MT-ATP6/mane/all/GRCh38?content-type=application%2Fjson
now works:

[
  {
    "current_name": "mitochondrially encoded ATP synthase membrane subunit 6",
    "current_symbol": "MT-ATP6",
    "hgnc": "HGNC:7414",
    "previous_symbol": "MTATP6,RP",
    "requested_symbol": "MT-ATP6",
    "transcripts": []
  }
]

Uhhhh odd! So there's something up in the conversion between HGNC ID to gene symbol, but only for the MT genes?

@Peter-J-Freeman
Copy link
Collaborator

Oh yes, so it is decoding of the HGNC ID. I concur

https://rest.variantvalidator.org/VariantValidator/tools/gene2transcripts_v2/HGNC%3A7414/False/all/GRCh38?content-type=application%2Fjson works fine on my laptop running on local but not on the live server

@Peter-J-Freeman
Copy link
Collaborator

I am now wondering if that version of the database contains a duplicate entry. My local version does not.

I am making a new build anyway. Will complete over the weekend. We will then NUKE the old database and install the new and test again before digging further

@ifokkema
Copy link
Collaborator Author

Alright! Sounds good! I'm anyway working on other stuff right now. I decided to torture myself and rebuild our HGVS tool from the ground up before doing the data analysis and writing that paper... let's hope that was a good idea 😆

@Peter-J-Freeman
Copy link
Collaborator

Sounds fun :P

@ifokkema
Copy link
Collaborator Author

Attempting to convert over 3000 lines of unreadable and unmanageable code into something readable and manageable in a completely different structure, what's not to like? 😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants