Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow c. positions for intergenic variants in the context of a genomic reference sequence #652

Open
ifokkema opened this issue Oct 8, 2024 · 12 comments

Comments

@ifokkema
Copy link
Collaborator

ifokkema commented Oct 8, 2024

Is your feature request related to a problem? Please describe.
Yesterday, the HVNC approved the suggestion to update the nomenclature to allow c. positions for intergenic variants in the context of a genomic reference sequence. See the HVNC issue on this subject. This allows notations like NC_000023.10(NM_004006.2):c.-128354C>T and treats positions beyond the UTR the same as positions in introns. Currently, VV does not support this, which causes issues with variants (partially) intergenic.

Describe the solution you'd like
Whereas it would be entirely up to you when to take initiative in providing mappings to transcripts when given a genomic variant, I see a few changes that should be considered:

  • (required) When variants like NC_000023.10(NM_004006.2):c.-128354C>T are given as input, VV should support mapping these to the NC (NC_000023.10:g.33357783G>A, in this case, for hg19).
  • (recommended) When variants span an entire gene or part thereof, e.g., a deletion with one intragenic breakpoint or two intergenic breakpoints but overlapping with a certain gene, provide mappings for each affected transcript.
  • (optional) When a genomic variant is given and the user requests a mapping on a transcript that lies on the same chromosome, map the variant to the requested transcript and provide a c. notation.

Describe alternatives you've considered
N/A

Additional context
Related to #333.

@Peter-J-Freeman
Copy link
Collaborator

@ifokkema . We are waiting on the outcome of

HGVSnomenclature/hgvs-nomenclature#186 (comment)

Before updating

@Peter-J-Freeman
Copy link
Collaborator

I would not support this format because it does not support the intron format. It breaks it :). See the link to the above HGVS discussion

@ifokkema
Copy link
Collaborator Author

I would not support this format because it does not support the intron format. It breaks it :). See the link to the above HGVS discussion

How do you mean it breaks the intron format? It's basically the same format as the current UTR format, but then calculating positions into the NC, the same way the intronic variants do. Or do you mean that an "intronic" implementation of this description would natively be "c.-100-100", that should now become "c.-200"?

@Peter-J-Freeman
Copy link
Collaborator

I think the c.-100-100 is more correct, assuming a 100 nt UTR.

My reasoning is that an intron in the UTR would be c.-50-100 where the intron requires statement of the last base of the intron boundary. We should treat positions beyond the UTR in exactly the same way. c.-200 would not therefore be correct brecause you need to state the last position of the UTR, then go into the flanking region.

Same at the 3* end. c.*200 should be c.*100+100 not c.*200.

Using c.-200 or *200 would imply that the UTR is still going

@Peter-J-Freeman
Copy link
Collaborator

perhaps I misunderstood your initial example?

@ifokkema
Copy link
Collaborator Author

perhaps I misunderstood your initial example?

No, you nailed it!

I think the c.-100-100 is more correct, assuming a 100 nt UTR.

There are indeed good arguments for that format, and I believe the earlier suggestion was close to that (c.-100-u100). The HVNC, however, voted against using that "intron-like" format. I believe the general idea was that it was easily confused with a (perhaps deep) intronic sequence, whereas the c.-200 format more clearly indicated it could affect the promotor region. It could also, indeed, falsely suggest that this position is located in the RNA, but this was considered less important of a misinterpretation than the interpretation that it is a (deeply) intronic position.

@Peter-J-Freeman
Copy link
Collaborator

Well, I will trust you to be sensible in the vote. I think that the only format that is useble is the c.-X-X etc. The addition of the u in indeed unnecessary and over complicated. The -200 format is misleading and does not need to have a deeply intronic/intergenic position to be clear e.g. -100-1 is very clear.

@ifokkema
Copy link
Collaborator Author

Well, I will trust you to be sensible in the vote. I think that the only format that is useble is the c.-X-X etc. The addition of the u in indeed unnecessary and over complicated. The -200 format is misleading and does not need to have a deeply intronic/intergenic position to be clear e.g. -100-1 is very clear.

Although the vote has already occurred, the NC(NM) discussion is also related and still needs to be resolved. It might re-open the discussion about the notation, as a "split" intron-line notation makes more sense when NC(NM) is defined to use the NM for the exons, while the current proposal makes more sense when the NC is used for all of the sequence.

I honestly think the HVNC should start meeting on a monthly basis since we have so many things to discuss, but we'll have to do with the current schedule. Next meeting is Dec 9th (moved from Dec 2nd), and I do believe/hope the NC(NM) issue is on the agenda.

@Peter-J-Freeman
Copy link
Collaborator

Peter-J-Freeman commented Nov 28, 2024

Don'g get me stated on NC_(NM_). As I sais in a previous email, it is not sufficient as a unique identifier. It is also dangerous. NC(NM)UCSC alignment gives in very important clinical genes a different outcome than NC(NM)RefSeq alignments. Also, handling of gapped alignments. This really needs a lot more thougt. It is being over simplified and is goping to lead to miss/missed diagnoses and lack of reproducability and findability in some clinical genes

P.s. I have lots of examples :)

@leicray
Copy link
Contributor

leicray commented Nov 28, 2024

I have never understood the rationale of the choice of reference sequence order when specifying an intronic variant. The variant description NM_000088.4:c.543G>C is valid. However, NM_000088.4:c.543+1G>C is not formally valid because the reference sequence NM_000088.4 does not contain any information that confirms the identity of the nucleotide at the +1 position. Hence, the current HGVS recommendations specify that the description ought to be NC_000017.11(NM_000088.4):c.543+1G>C which defies logic regarding the order of the two reference sequence identifiers the description.

The extra information required to confirm that the nucleotide at the +1 position is derived from the the genomic sequence, in this instance GRCh38, which is NC_000017.11. Note that the actual nucleotide position chromosome 17 is not actually specified in the fully qualified description. I do recognise the need to specify the chromosome build, especially for deep-intronic variants. For example, the imaginary variant description NM_00123456789.1:c.765+55G>T might be valid in the context of NC_000017.11 but not in the context of NC_000017.10 because of a sequence difference in the chromosomal nucleotide corresponding to the c.765+55 position.

In written English (and perhaps in other languages) brackets are commonly used to enclose information that might be regarded as being subsidiary to what is being described or explained. For that reason, I think that the logical order for the presentation of intronic variants ought to be, for example, NM_000088.4(NC_000017.11):c.543+1G>C. In other words, the chromosomal reference sequence is subsidiary to the transcript reference sequence. The former is required for a fully-qualified variant description, but its omission yields NM_000088.4:c.543+1G>C which is still understandable. However, omission of the latter yields NC_000017.11:c.543+1G>C which is not understandable, even even through the use of computational tools.

I have never (to my knowledge) ever seen a reasoned argument on behalf "NC_(NM_)" as the logical order.

@ifokkema
Copy link
Collaborator Author

@leicray

I have never (to my knowledge) ever seen a reasoned argument on behalf "NC_(NM_)" as the logical order.

As far as I know, the logic was that the annotation of the NM was located within the Genbank file of the NC (or NC slice, more likely). As we already had the format reference_sequence(selector), the result was NC(NM). But please do take part in the (very long, by now) discussion here: HGVSnomenclature/hgvs-nomenclature#182
The NM(NC) format was already suggested as an alternative, so you can add your support for that there.

@leicray
Copy link
Contributor

leicray commented Nov 28, 2024

Thank you for the pointer to the discussion. I will have another look in the early morning when my head is clearer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants