-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the "full" variant description for intronic variants #616
Comments
It is probable that the layout of the validation results page could be altered to accommodate your request. That said, I have a long-standing dislike of descriptions such as I had this argument several years ago when I was a member of the HGVS nomenclature committee, but the others decided to adopt the format is now used. Just out of interest, are there any known examples where it's essential that the genome build be specified to enable validation with tools other than VariantValidator? |
That actually depends on the exact interpretation of
Do you mean as a separate input field? I'm not sure - Mutalyzer simply requires the NC-based syntax; idem for LOVD. I'm not that familiar with all the other tools out there. |
For what reason is the interpretation of @Peter-J-Freeman and/or @John-F-Wagstaff are better placed than I am to explain the process by which VariantValidator validates a description. The sequential steps may differ from the process used by Mutalyzer. I wonder if the adopted format for describing intronic sequence variants was influenced by what best suited Mutalyzer.
I think that I did not ask my question clearly enough. Let's try again. Are there any real-world examples where a given intronic variant description is valid with respect to just one genome build, but not the other? |
we already make these recommendation in the Recommended Variant Descriptions table. Also, if you print the pdf, this is very clear.
I disagree. the HGVS nomenclature states that variants should be described at g. c. and p. If the g. is provided i.e. in the top table, the format NC_000020.10(NM_001283009.2):c.1266+3_1266+80del is redundant and needlessly long. It is also not very interoperable since most databases use both the g. and the c. My preference when advising in publications is to ditch NC_000020.10(NM_001283009.2):c.1266+3_1266+80del and use the g. and the c. I think this is related to education not processing and I think we do a decent job already. As for the way Mutalyzer and other platforms like VEP handle NM_ vs NC_, we have reference sequences. The variant MUST be in the context of the actual reference sequence, hence VV does make correction to descriptions based on the content of the reference sequence under review. There is a simple solution for Mutalyzer and others, drop NM_ and use ENST then this goes away, as does any worries over alignment gaps :) |
Simply because the HVNC didn't specify how it should be implemented on a technical level, and Mutalyzer and VariantValidator implemented it in different ways, which then highlighted the ambiguity of how the HVNC "defines" the NC(NM) mapping (mostly, the lack of a clear definition).
I don't know. I don't think there are meeting notes from that period, so we may need to rely on what people remember if we would want to find out.
Yes, but I don't think I tagged it or stored it somewhere with a label so I can find it back. I ran into it by accident because an intronic variant didn't validate, and I had just always used the hg19 NCs. The variant was valid in the context of hg38, so it clarified for me that the variant must have been called on hg38.
People rarely read properly, though...
The NC(NM) description may be redundant when the genomic variant is present, but that doesn't make the NM-based description valid HGVS nomenclature.
What is not interoperable? Do you mean that most databases use NM-based c. descriptions instead of NC(NM)-based descriptions?
Education is indeed lacking. In the sense that, in general, people don't understand the complexity. That's also why they often don't use valid descriptions at all.
That will then first require the entire user base of Mutalyzer, VEP, and VV, to switch over 😉 |
your point is? ;)
However, we also try to make descriptions precise, so providing the g. and the c. is fine and negated the need for NC(NM) otherwise we are needlessly complicating descriptions. So, the simple solution is to ensure the correct NC_ is added to the top table (since, as you say, the intronic sequence varies dependant on the genopme build) since we provide both in a separate table. We could make this clearer, but I personally think it is pointless to provide the NC_(NM_) if the correct g. is provided. And, as you say, all users would have to swiutch to the NC_(NM_) format because I know of pretty much nowhere that is is used :)
But I totally agree and hopefully the professional standard will address this So I think the only real action could be to add the correct g. into the top table and perhaps add some information to the interface to state what authors should use when publishing. This we could do for sure |
We currently have 3039 transcripts with variation in the internal transcript exon structure between mappings, though this includes alt mappings (NG_ and older blat alignments where excluded). Many of these will just be truncated mappings, but other more complex differences do exist. For those transcripts with multiple mappings sourced from identical within transcript exon position sets we have 11182 transcripts where the introns for different mappings of these transcripts have differing lengths for the same intron. Of these 7464 are between the different versions of the main (NC_) chromosomes, so the rest are main versus alt differences for the same genome build. This ignores SNPs or other length neutral changes between reference genomes or main and alt mappings. As such the same NM_ definition definitely could mean different things depending on the NC_ or alt version it is paired with. Just specifying GRCh37/8 won't be enough for alt sequences. I think this means that any support of alts particularly in batch inputs is basically dependent on the 2 sequence bracketed form just to make the input boxes make sense, let alone provide accurate answers. We need to be able to support alts, particularly if we want to be able to handle rare disease data. I unfortunately also have to agree that if you give the users a pair of definitions that "should go together" they often won't bother with the genomic one. The original quote from Murphy that lead to Murphy's law is after all "If there are two or more ways to do something and one of those results in a catastrophe, then someone will do it that way." Edit: all of the problems with Ensembl having the same id for sequences that had different sequence content for the different genome builds was because, you should always know which genome build you are working with so it won't be a problem... |
On a side note, and slightly relevant. I mapped the MANE Select from ensembl COL1A1 and RefSeq to GRCh38 NM_000088.4:c.589-1_589delGGinsG > NC_000017.11:g.50198002_50198003delCCinsC I thought MANE Select were identical from start to finish, but I guess the alignments can be different. @John-F-Wagstaff can you please check these alignmnts against the source. I am really surprised by this because the UCSC database shows different. I will keep trying to figure out what is going on. |
@John-F-Wagstaff , is this to do with the dodgy ENST data we got from an archive. It's COL1A1 playing up again and this is post your patch on my local system. |
ugh yes, looks like we need a fix for the alignment table as well as the exons, I will get a fix to you ASAP |
Arguably, LOVD does just that. If I select a gene, say COL1A1, and then select the Variants tab, a long list of variants is displayed including intronic variants such as However, no information is displayed regarding the corresponding genome build. That information can only be found by clicking on one or other of the two displayed instances of the variant. Once there, the header at the top of the page says: I cannot find where the variant is described as Am I missing something in the interface that ought to be more obvious to me? |
@leicray scroll down to the recommended variant description table or print the PDF. |
My comment was entirely about LOVD. Nothing to do with VariantValidator. @ifokkema had asked about examples of databases that do not display "NC(NM)-based descriptions". |
I'm not sure if we need to do anything our end. We display all relevant descriptions including the genome build selected. So the full description can be used, but is really not necessary since we provide the Genome build and the relevant HGVS genomic description. |
I agree that we do not have to do anything more at our end. |
This sounds like an argument for dropping the NM description entirely, not an argument for using the invalid NM-based description over the valid NC(NM)-based description. Like I said, the NC(NM) description may be redundant when the genomic variant is present, but that doesn't make the NM-based description valid HGVS nomenclature.
This is Heidi's argument for removing the parentheses from predicted protein descriptions - "Because others aren't using the standards, neither should we..."
LOVD is far from perfect, but it doesn't print
The genomic DNA field shows the build in the header, it's on the detailed page, and on all data entry forms. It's not perfect, but we do mention it. We'll run into issues when we start supporting multiple genome builds, but that's a different story for which I still don't have a solution. Anyway, we're not perfect, but we don't write
I never said we do that 😅 The HGVS nomenclature states you can use the variant description without the reference sequence as long as that is mentioned elsewhere, and that is what we're currently doing. We mention the genome build, the NC, the NM, and then display the DNA descriptions (g. and c.) without the reference sequences. The page titles, a feature built separately, then uses an invalid gene-based format for all cDNA descriptions. I don't remember how that happened, but I'll fix it. What I asked was what you meant with, "It is also not very interoperable since most databases use both the g. and the c.".
I didn't ask for that; I know, for instance, that ClinVar shows invalid NM-based intronic variant descriptions. |
I see what you are saying but I slightly disagree. I'm not saying use it because others do. Rather, I see this as another area where there is duplication in the HGVS guidelines. HGVS states that variants should be described at all relevant levels, usually g. c. p. In my opinion, and certainly based on my editing experience (especially when reviewing ACMG papers) is essential since it stops a lot of significan errors which ought to be avoided. In this case, providing the full NC_(NM_):c. description contains redundancy i.e. duplicates information. It would be useful to discuss this in an HGVS meeting because I am not saying either is incorrect. I agree that the NC_(NM_):c. is correct, but I also argue that there is no need to use it if HGVS is correctly applied and the g., c. are both provided.
Absolutely it is, and I think we can make a much stronger statement about the use of the variant descriptions in the recommended variant descriptions table. We could also pull the NC_(NM_):c. descriptions into the top table as in re-structure the layout. My worry is that by dropping NM_:c. descriptions we lose a format that is used in all databases like LOVD and ClinVar, journals, dbSNP etc. More than happy to look at the layout, but are you suggesting we do drop the NM_:c. descriptions and just show the NC_(NM_):c.. My feeling is that this would lose us users and open up a lot of complaints :). So, we do provide all the correct descriptions, so this seems to be a matter of adjusting how we display the dayta. The alternatives I can think of are:
|
I would certainly not wish to see removal of NM_:c. descriptions from the VV output. Arguably, we could rearrange the results page order to emphasise that there are HGVS recommended variant descriptions. However, I would prefer just to emphasise use of HGVS recommended variant descriptions. I would certainly be against any suggestion that an NM_:c. description submitted by a user should be immediately converted to an NC_(NM_):c. description at the top of the results page. That would be confusing for users. I agree with @Peter-J-Freeman that this needs to be carefully discussed by the HVNC as there do seem to be two sets of recommendations. I would be happy to participate in such discussions in my role as an "emeritus" committee member. Finally, as Garry Cutting has often said "Do not let perfection be the enemy of progress". |
Of the NC reference sequence? Or of the c. and p. descriptions? I believe, but the HVNC may want to correct me, that it's the "separation" that I mentioned that is the problem. As far as I know, each and every variant description should, by itself, be interpretable. That would mean that the c. notation should be interpretable, with or without a g. notation somewhere near (in the same table, sentence, etc). Therefore, with that assumption, the NC in the NC(NM) is not considered redundant, as it facilitates the interpretation of the variant description.
I believe Alex opened up the agenda already for the next meeting; we could put it in there to discuss it?
I totally get that feeling... and of course, the whole NM/NC(NM) debate only applies to intronic variants, so we're talking about a subset of all c. descriptions.
Only for intronic variants, but yes... since, as far as I interpret the HGVS rules, the NM:c. description of an intronic variant is always invalid, also when elsewhere the g. description is given. I'm wondering, though, related to the Aries integration, if authors provide a list of variant descriptions as used in their manuscript and the g. and c. notations end up on a different line, how will VV determine which genome build to use for intronic variants? Will it again be an input field? If so, we'd be assuming (but probably, rightfully so), that all descriptions in one manuscript will use the same genome build.
Perhaps, depending on what the HVNC says, any clear indication that the NM:c. description for intronic variants is not a valid description by itself would already be a great addition. Easier access to what is the valid standalone HGVS c. description for that variant, would help the user find the HGVS-compliant description they might want to use somewhere.
Even for intronic variants? I wasn't trying to suggest doing it for exonic variants...
If you could point out what in the HGVS documentation conflicts with what other part, there is a better chance of having a good discussion within the committee of what conflict needs resolution. Otherwise, I think the clearest question would be "Is an NM:c. intronic variant description valid when the genome build is mentioned elsewhere?"
That is definitely true, although I don't know what progress we're holding back 😝 |
I only intended this comment to refer to intronic variants. I still maintain that NC_(NM_) descriptions are inherently confusing. Parentheses are defined in HGVS as follows Placing parentheses around the NM_ implies uncertainty. I cannot find any alternative account of the use of parentheses, but I may have missed something. The next point is the the parentheses could be interpreted to indicate that the NM_ component of the description is, in some way, additional but non-essential (subsidiary) information. (I am thinking here in terms of how parentheses are used in normal English grammar.) If (NM_) is just additional, but non-essential. information, an NC_(NM_):c. description becomes a NC_:c. description which is certainly not valid. The variant description |
The parentheses in NC(NM) format don't indicate uncertainty, indeed. The definition given on the "general" recommendations page doesn't explain the use of parentheses in reference sequences, indeed.
It's not meant as uncertainty or non-essential additional information. It is meant as additional information, but in the opposite way that you mention. It's the NM that provides the context, not the NC. This is also why Mutalyzer interprets NC(NM) so differently from VariantValidator. For Mutalyzer, the NC in NC(NM) provides all the sequence. For VariantValidator, both reference sequences provide sequence; exons are provided by the NM, and intronic sequences by the NC. This allows VariantValidator to drop the NC from the NC(NM) descriptions and create variant valid descriptions (for exonic positions, that is), while The NM in NC(NM) is meant as a form of selection; the NM annotation (positions) is selected from within the NC reference sequence. Mutalyzer takes the data from the GenBank file, VariantValidator takes the mappings from the official alignments and constructs a new sequence based on that. With this, we are actually moving into the domain "what does NC(NM) actually mean?", were Mutalyzer and VV are doing completely opposite things. The HVNC does not define it well enough. |
I think that I agree with you.
The definitions need to be updated to indicate that there are two usages of parentheses.
It's clear that VariantValidator and Mutalyzer work in different ways. What's also clear is that Mutalyzer is incapable of working out an intronic variant in the absence of the corresponding NC_ sequence record. If I submit The wording of the second part is interesting. It implies that the designation of introns is inherent in genomic reference sequences. As far as I can see, NC_(NM_) is part of the HGVS guidelines solely to satisfy the operational need of Mutalyzer. I would again argue that any need to specify the genome build because of possible intronic sequence differences between builds could, where necessary, be satisfied by descriptions such as This needs open and honest discussion at HVNC with participation of people, such as me, from outside the committee. |
@leicray
I agree, although I think the definitions currently explain the use of characters in variant descriptions. Technically, this is the use of a character in the reference sequence. But I've added it to my (very long) list to figure out where that should go.
Well... technically, VV has the same issue. VV just fetches the NC based on the genome build input. The issue is hidden, but, in reality, both tools have the same limitation.
Only because there is a genome build as an input.
Yes, that's what I meant when I said Mutalyzer uses the NC for all sequences in an NC(NM) context, while VV uses the NC only for the intronic sequences.
Although I can't be sure whether VV was considered when that syntax was invented, any tool processing intronic variants will require an NC for this. The logic on how to obtain that NC can differ (VV uses a genome build input for this), but requiring an NC input is unambiguous (genome build is not, actually).
That would be ambiguous; does
Any tool will need to, however, and it will need to be able to do so unambiguously. Not only that, but I think we also already have lots of examples where people don't use tools to create their variant descriptions and, therefore, make mistakes. Tools can do this for people, just like applying the 3' forward rule. I wouldn't expect people to do that manually, either.
Sure! The official way to go about this is to start a new discussion (see the list of discussions). A discussion is also easier to add to the HVNC meeting agenda, and it allows all committee members to catch up without long email threads. |
Describe the bug
VariantValidator provides incorrect HGVS descriptions for intronic variants. E.g.,
NM_001283009.2:c.1266+3_1266+80del
NC_000020.10(NM_001283009.2):c.1266+3_1266+80del
Obviously, these descriptions are valid in the context of a given genome build but not as a standalone description. Also the "HGVS-compliant variant descriptions" table contains the description
NM_001283009.2:c.1266+3_1266+80del
, but it's not HGVS-compliant. The only place on the page where the correct description is given is further down the page in the table "Transcript and protein descriptions".To Reproduce
See links above.
Expected behavior
All mentions of
NM_001283009.2:c.1266+3_1266+80del
should be changed toNC_000020.11(NM_001283009.2):c.1266+3_1266+80del
, using the genome build that was used for the input.Additional context
Although VV validates
NM_001283009.2:c.1266+3_1266+80del
well because a genome build must always be selected, it would make sense to educate users to always use the NC in the description for intronic variants.The text was updated successfully, but these errors were encountered: