Add error codes to all warnings #637

Peter-J-Freeman · 2024-07-18T13:49:20Z

Is your feature request related to a problem? Please describe.
We need consistent error codes for all errors and warnings so other tools can searh for them without error messages breaking flow.

Loop in @ifokkema

Peter-J-Freeman · 2024-07-18T13:50:05Z

We also need to document the codes and come up with a release strategy

Peter-J-Freeman · 2024-07-18T14:00:58Z

This arose from a recent push that broke LOVD. Sorry @ifokkema.

#635 (comment)

So I have updated the warning to include the error code "TranscriptVersionWarning: A more recent version of the selected reference sequence ENST00000382483.3 is available for genome " "build GRCh38 (ENST00000382483.4)"

However, we have lots of error messages still lacking codes. This will be a big effort that I think is vital though @John-F-Wagstaff

John-F-Wagstaff · 2024-07-18T14:34:39Z

Hello Pete
This looks like a good idea in the long term, but also like a lot of work. Do you want me to add this to my current priorities and join in on this? or keep track of it for now and help, at request, for specific regions?

Code wise we probably want to create a organised pattern for our warning/error/info codes and probably want to work out and document the codes before we start to work on it. How are you thinking of structuring this?

The current suggestion seems to follow the pattern of camel case with an order of
region-problem-severity: description.
This would then make patterns like -
[Transcript/hgvs/GenomeBuild][Version/DataAvailability/Missmatch][Warning/Error/Caution]: description,
for example -
hgvsDataAvailabilityError: This sequence NM_XXXXX.X could not be found in our sequence store
but other people do things like catagory-severity:number description, so for example
hgvsError:1 This sequence NM_XXXXX.X could not be found in our sequence store

Either way we need a constant pattern, since a lot of the time other consumers of these codes will only care about some specific part of the code, e.g. is this an error a significant warning or cautionary information, or is this a transcript error. If it is not consistent internally then our other tools, and the other users that consume these errors won't be able to actually make consistent use of them.

Any thoughts on the pattern?

Peter-J-Freeman · 2024-07-18T14:38:11Z

I like the camel case codes. We have a few in the test set alrady. Look in the test_warnings. Lets drag these into this issue and agree a pattern. We can work together on the coding. Yes, I think this woul be best to prevent us from going crazy.

What we can do is find the errors, and come up with the error code for each error together, then apply throughout in turn. I think the low hanging fruit is our own errors. Then we get more complex trying to resolve how to handle the HGVS.py errors

Peter-J-Freeman · 2024-07-18T14:40:16Z

The core search term to spot our warnings in the VV Repo will be .warnings.append( I would think

ifokkema · 2024-07-24T14:45:48Z

Thanks for this, guys! It will make our code a lot simpler and reduce the chances of breaking stuff in the LOVD/VV communication.

Code wise we probably want to create a organised pattern for our warning/error/info codes and probably want to work out and document the codes before we start to work on it.

Agreed; this was a lot of effort for me (I use codes rather than description parsing for internal communication between functions), but it really helped get everything consistent. I personally started with my test sets, assuming that most possible errors would be in there. I resorted my tests based on the type of error and then started adding the codes.

Either way we need a constant pattern, since a lot of the time other consumers of these codes will only care about some specific part of the code, e.g. is this an error a significant warning or cautionary information, or is this a transcript error.

Exactly; currently, I have a list of regexes that parse the descriptions from VV and categorize them so that I know what is what. I also rewrite some warnings and errors but still rely on consistent descriptions to recognize what we're receiving. Being able to check for some codes that I want to rewrite and knowing what the rest is based on the structure of the code will reduce the need for me to make updates whenever VV is updated. I may, in the future, even incorporate the VV error codes in the LOVD3 source code. Currently, I'm assigning my own codes based on the VV descriptions.

John-F-Wagstaff · 2024-08-01T14:42:27Z

We are adding code to better handle allele variant descriptions. This is done by separating the variant out into position edits, validating the parts and finally re-composing the variant(s) from the component validated position edits. This generates the possibility of new and "exiting" errors/warnings when data fails to match up, which will also need their own prefix.

So in the case of a Genomic/Alt/LRG/RSG mapping fails/does not exist, for some but not all of the position edits, or a, Genomic/LRG/RSG mapping where the target of the mapping is different for some of the position edits (this seems most likely as a non-error case for LRG/RSG mappings for genomic alleles), would there be a useful difference between an error/warning suffix like AlleleReasembalyDataMissmatchWarning and AlleleReasembalyDataMissingWarning, for end user purposes, or would just the more generic AlleleReasembalyWarning be sufficient on its own for both types of problem.

If you think that AlleleReasembalyLRGWarning/AlleleReasembalyGenomicWarning etc would be more useful then this is also a possibility.

ifokkema · 2024-08-13T14:14:29Z

Hi John, was this question (also) for me or for Pete? In case my input is requested, I'm not sure I fully follow the question. Do you have (theoretic if needed) examples?

John-F-Wagstaff · 2024-08-13T16:20:48Z

Hello Ivo

Yes it was mostly a question for you, Pete thinks the more generic warning for any issue putting an allele back together probably should be OK, but I was concerned that they might be too simple to actually be usefully informative that way, so I just wanted to check before putting them in. We don't want to have to change error prefixes after the fact. Both of these errors/warnings would be triggered by hgvs alleles (see Here for examples if you are unfamiliar) which need splitting up and analysing as parts, when the parts are re-composed into a singe allele again. So the question was, could you as an end user see some utility in having the separate error prefixes for the different types of problem our would just one (provisionally AlleleReasembalyWarning) be better?

If we did split this up I would currently split the warnings into 2 categories:

The first type of warning (provisionally AlleleReasembalyDataMissmatchWarning unless you or Pete had a better name) would be triggered if we had an allele description that was genomic type (but which could be main Genomic, Alt, LRG, or RSG) but for which there was no single transcript mapping that contained the locations of all of the variants described. An example of this would be an allele description that included changes to 2 neighbouring genes, all of which validate but not on the same transcript (which could then have differing LRG/RSG mappings too). There could be no single set of transcript mapping data for this allele, and any between genome build mappings (GRCh37<->GRCh38) will be using data from multiple transcript alignments. (I can also see some arguments for returning this when the input positions are partly intergenic but sometimes map to one transcript.)

The second type of warning (which might be better as an error, provisionally AlleleReasembalyDataMissingWarning) would be triggered if we had an allele describing a set of positions, but only some of them could be mapped to the current reference. I can see three types of data that might trigger this, without actual errors

If we, as above, had a single input g type allele description, that affected two genes, then we could get two sets of transcript mappings, each only representing parts of the input allele description. These HGVS transcript results would need a warning to note that while they where valid internally they did not include all of the input's described positions.
Alternatively allele descriptions contained within one of the longer splice variants of a gene would, produce mappings for other transcripts, but some might turn up without the full set of input position edits.
Finally this might be triggered by data returned for alleles that partly overlap a gene in/with a RSG/LRG/Alt sequence. If there was a transcript mapping to both the main genomic reference and one of these we could get partial g type mappings back. For now if the input was a transcript, or a main chromosomal position, I am currently thinking that I would just remove the partial alt mapping or RSG/LRG mapping annotations from the returned data, treating it as a non error, since such returned data would be centred around the original genomic/transcript input. This might change in the future (and if so might need different error prefixes so you can tell which mappings are affected hence the suggestion that we might want AlleleReasembalyLRGWarning / AlleleReasembalyGenomicAltWarning etc if we did this to define the error out, but this would probably best be left to a later question if needed).

As an example of how you might use this, you might allow data to be used if it only has a AlleleReasembalyDataMissmatchWarning attached, since it is complete if not ideal in it's consistency, but exclude data with AlleleReasembalyDataMissingWarning as incomplete. Alternatively for users wanting data specific to individual genes data with AlleleReasembalyDataMissingWarning: attached might be fit for their purposes, if incomplete, but AlleleReasembalyDataMissmatchWarning would indicate data that might be affecting multiple genes and could need to be excluded. The question was whether you thought this would be useful in practice, or whether the detail was overkill? Sometimes it is hard to get perspective on this kind of thing when you are in the middle of writhing the code.

With thanks

ifokkema · 2024-09-09T15:36:37Z

@John-F-Wagstaff
Sorry for the late reply! When I see a long text, I know I need to sit down and properly read this with focus, and there are not many days when I can do that.

The first type of warning (provisionally AlleleReasembalyDataMissmatchWarning unless you or Pete had a better name) would be triggered if we had an allele description that was genomic type (but which could be main Genomic, Alt, LRG, or RSG) but for which there was no single transcript mapping that contained the locations of all of the variants described. An example of this would be an allele description that included changes to 2 neighbouring genes, all of which validate but not on the same transcript (which could then have differing LRG/RSG mappings too). There could be no single set of transcript mapping data for this allele, and any between genome build mappings (GRCh37<->GRCh38) will be using data from multiple transcript alignments. (I can also see some arguments for returning this when the input positions are partly intergenic but sometimes map to one transcript.)

Hmm... but that's not a problem of the actual variant... it's a technical issue when mapping/projecting that variant onto the transcript level. In fact, whether you can or can not create a description of that allele syntax on the transcript level at all, i.e., one description with multiple reference sequences, is under debate.

That said, there are some typos in AlleleReasembalyDataMissmatchWarning so if you do want to use that, probably AlleleReassemblyDataMismatchWarning is meant.

The second type of warning (which might be better as an error, provisionally AlleleReasembalyDataMissingWarning) would be triggered if we had an allele describing a set of positions, but only some of them could be mapped to the current reference.

That's an error if the positions don't apply to the current reference, indeed, of the same type of error as g.9999999999999999999del on an NC that's shorter than that. But only if we're talking about the current (input) reference.

I can see three types of data that might trigger this, without actual errors

If we, as above, had a single input g type allele description, that affected two genes, then we could get two sets of transcript mappings, each only representing parts of the input allele description. These HGVS transcript results would need a warning to note that while they where valid internally they did not include all of the input's described positions.

However, this is again not a problem of the input variant, but of the technical process/decision making when mapping/projecting the variant onto the transcript level.

Alternatively allele descriptions contained within one of the longer splice variants of a gene would, produce mappings for other transcripts, but some might turn up without the full set of input position edits.

Also here, not a problem of the variant. I'd say that if your input is transcript-level, NO other transcript-level output should be generated that does not include all of the input.

Finally this might be triggered by data returned for alleles that partly overlap a gene in/with a RSG/LRG/Alt sequence. If there was a transcript mapping to both the main genomic reference and one of these we could get partial g type mappings back. For now if the input was a transcript, or a main chromosomal position, I am currently thinking that I would just remove the partial alt mapping or RSG/LRG mapping annotations from the returned data, treating it as a non error, since such returned data would be centred around the original genomic/transcript input. This might change in the future (and if so might need different error prefixes so you can tell which mappings are affected hence the suggestion that we might want AlleleReasembalyLRGWarning / AlleleReasembalyGenomicAltWarning etc if we did this to define the error out, but this would probably best be left to a later question if needed).

This one I don't follow, sorry. What's the input here?

As an example of how you might use this, you might allow data to be used if it only has a AlleleReasembalyDataMissmatchWarning attached, since it is complete if not ideal in it's consistency, but exclude data with AlleleReasembalyDataMissingWarning as incomplete. Alternatively for users wanting data specific to individual genes data with AlleleReasembalyDataMissingWarning: attached might be fit for their purposes, if incomplete, but AlleleReasembalyDataMissmatchWarning would indicate data that might be affecting multiple genes and could need to be excluded. The question was whether you thought this would be useful in practice, or whether the detail was overkill?

I think we should first discuss what variants should be handled at all... because we might not want to / be able to handle any or all of those variants.

For my use case, in LOVD, we discourage the use of the allele syntax. Within LOVD, you can indicate that two variants are on the same allele. This allows you to submit them separately, so they'll collect all their annotation separately, as well. Within the same gene, it may make sense to use the allele syntax, especially for frameshift variants that are sequential and together cause only a deletion/insertion instead of a fully destructive frameshift, but since tooling (mapping etc) never used to handle it, we don't support it either.

John-F-Wagstaff · 2024-09-30T16:51:24Z

Hello @ifokkema Sorry about my own delay replying, I have had fairly bad covid, which resurged on me and caused complications. I went from sick enough that I felt unable to produce a good reply to needing to catch up with stuff urgently needed for our next VV release.

Thanks for spotting the typos on "AlleleReassemblyDataMismatchWarning" this is actually intended to go on an "intergenic" type variant, which would not count as mapping to multiple references as it would not contain any transcript data, beyond any warnings generated by component mappings. It would be paired with an explanatory text like "AlleleReassemblyDataMismatchWarning: This allele variant is reported as intergenic due to not fully mapping inside any single transcript, however some transcript mappings are present. You are encouraged to submit the individual components of this allele variant for validation to get further detail on these mappings."

Although this might count as a "technical issue when mapping/projecting that variant onto the transcript level" you can imagine that even somewhat informed users who, for example, submit an allele affecting a promoter and it's associated transcript would react badly to just getting an intergenic variant returned without explanation. Many users are not informed, and hurried uses will often not consider things logically without clear reasons given either, people do in fact use our site to check things just before submission deadlines. We could switch this to an "Info" rather than "Warning" but regardless of the error code we chose to attach we must inform users when we make this kind of decision.

For AlleleReasembalyDataMissingWarning this would turn up in the same example as above (a genomic allele affecting a promoter and locations within it's associated transcript) as a warning attached to the set of variation affecting just the transcript. This would happen even if we dispose of any incomplete transcript mappings when we have complete transcript inputs, as you suggest. Again if we return any type of incomplete data but don't tell the users what is going on they will at least occasionally get confused and need help at best, or at worst use the data improperly and consider it our fault that we did not warn them. An example warning text (with the varying data set based on user input in angle brackets) would be something like "AlleleReasembalyDataMissingWarning: This mapping of the input allele is incomplete, and included in part to show which allele components map to the transcript reference , please consult the full <genomic/transcript> mapping of <intergenic/transcript variant> for more details."

The key question I had was, when phrased in terms of this example of a genomic allele affecting a promoter, and locations within it's associated transcript: If you got back an output containing at least 2 allele variant descriptions, one of which is complete in allele terms, but lacks all transcript type annotations, and the other of which is incomplete (being limited to transcript mapping variants) but is more fully annotated, would you expect it to be useful to have different error codes/prefixes for the different types of output?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add error codes to all warnings #637

Add error codes to all warnings #637

Peter-J-Freeman commented Jul 18, 2024

Peter-J-Freeman commented Jul 18, 2024

Peter-J-Freeman commented Jul 18, 2024

John-F-Wagstaff commented Jul 18, 2024

Peter-J-Freeman commented Jul 18, 2024

Peter-J-Freeman commented Jul 18, 2024

ifokkema commented Jul 24, 2024

John-F-Wagstaff commented Aug 1, 2024

ifokkema commented Aug 13, 2024

John-F-Wagstaff commented Aug 13, 2024 •

edited

Loading

ifokkema commented Sep 9, 2024

John-F-Wagstaff commented Sep 30, 2024

Add error codes to all warnings #637

Add error codes to all warnings #637

Comments

Peter-J-Freeman commented Jul 18, 2024

Peter-J-Freeman commented Jul 18, 2024

Peter-J-Freeman commented Jul 18, 2024

John-F-Wagstaff commented Jul 18, 2024

Peter-J-Freeman commented Jul 18, 2024

Peter-J-Freeman commented Jul 18, 2024

ifokkema commented Jul 24, 2024

John-F-Wagstaff commented Aug 1, 2024

ifokkema commented Aug 13, 2024

John-F-Wagstaff commented Aug 13, 2024 • edited Loading

ifokkema commented Sep 9, 2024

John-F-Wagstaff commented Sep 30, 2024

John-F-Wagstaff commented Aug 13, 2024 •

edited

Loading