Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VRS representations to the VV output #655

Open
Peter-J-Freeman opened this issue Nov 8, 2024 · 8 comments
Open

Add VRS representations to the VV output #655

Peter-J-Freeman opened this issue Nov 8, 2024 · 8 comments
Assignees

Comments

@Peter-J-Freeman
Copy link
Collaborator

Is your feature request related to a problem? Please describe.
There is a good reason to represent VRS in journal article metadata.

Describe the solution you'd like
VRS representations in VV outputs. It would seem like each variant requires a single VRS representation. I would like to get input from the authors on this.

Describe alternatives you've considered
We could also add SPDI and ClinGen Allele Registry IDs, if these can be also obtained.

Additional context
Would VRS be able to use VVTA? What is the data provider? We would beed to check this and potentially add in some glue code

Here is the VRS Repo

https://github.com/ga4gh/vrs-python

@Peter-J-Freeman
Copy link
Collaborator Author

@John-F-Wagstaff One for you to start looking into. I recommend opening dialogue with Reece Hart, Larry Babb and Alex Wagner. We have their detais.

Leaving this one to you so that you can investigate how VRS could run on VVTA rather than UTA. We may need a custom dataprovider in VRS-Python, or a modification that we may be able to request with the team there.

@John-F-Wagstaff
Copy link
Collaborator

Firstly there is a VRS repo, separate from the vrs-python project, namely https://github.com/ga4gh/vrs . If we want an easy VRS input/output filter vrs-python is probably the best starting project, but it is not required.

Given this VRS compatible output should not need any of that, if we don't want it to. The fundamental nature of the spec is relatively simple, if clunky from a human perspective. The question is, is it more complex to build a VRS output generator for our pre-validated output, including the risk of adding our own bugs? or is it more complex to lug the extra set of (unused by us and redundant) validation from "VRS-python" around? The spec was basically written to be easy to implement in python, so worst case complexity for this is the slightly clunky version. Unless you specifically want to get into contact with other people over this there will be no need for any outside help to get this done reasonably quickly.

Part of this depends on how easy it is to use just the bits of VRS python that we actually need. This should be possible, if the documentation is valid, both the SeqRepo and UTA dataproviders are supposed to only be necessary if we want to validate inputs. We don't want to validate, since this would be an output filter for pre-validated variant inputs. As such, if we can trust the docs, it looks like missing them should not cause issues for the VRS output parts.

If we do have to include them then SeqRepo part should also be fine with our local version of SeqRepo, if we want to use it, we have not needed to branch the code, we just don't want to be tied to their data release rules/limits/schedule, though we might chose to do different for performance reasons etc. We might even want to use the VVTA->UTA compatibility views to run the UTA code unmodified, which should work as-is (though this wont have the best performance and unlike the SeqRepo connection this does not seem to have any use case for us).

The biggest inconvenience I can see if we don't use either is generating the ids for the sequence. We would otherwise get our hash ids via our local build of SeqRepo, but we will want to generate them in advance if we don't, on the other hand this is one table in a database of our choice at worst.

This is likely to be a fairly big project but probably only on the same scale as the allele work at worst, how urgently do you want this scheduled? I am going to keep on working on the performance issues particularly WRT to dels/indels that you asked me to drop my allele work to do, as a high priority due to our server performance issues, and then switch back to trying to finish off the tests for the fallback on the allele type variants, to finally get that finished, before pushing on this, unless you specifically want this bumped up the priority list.

@larrybabb
Copy link

@John-F-Wagstaff lots of great points, ideas, and concerns. I'm going to ask two of our devs, Kori @korikuzma and Kyle @theferrit32, that have considerable experience creating and applying VRS-python to weigh in along with @ahwagner.

My cursory feedback is that vrs-python is the best starting place. Of course, you can roll your own implementation as the ClinGen Allele Registry did in C++, but they are behind a release currently and working to upgrade. I believe it should be relatively straight forward to back vrs-python with your equivalent store of seqrepo and uta. I believe you will need the former at minimum but you will not need the uta if you are not using vrs-python to deal with transcript alignment - which I believe you will have already deteremined.

I think the larger concern may be around the use of the hgvs package, which I believe you've forked and modified/improved from the biocommons version. If you use hgvs expressions as the input to generate the VRS classes this may cause some issues. So I might suggest looking into passing the 'chr', 'start', 'stop', 'ref', 'alt' form that we refer to as 'gnomad'. Even though this format is vcf-based and therefore left-shifted in cases where there are repeating nucleotides. I believe the fully-justified normalization step will make it all work out fine. It is also possible that our use of the hgvs package to parse hgvs will not actuallly get into the aspects of the biocommons hgvs package that you guys had to correct/improve.

I think a small pilot task to try it out would be super helpful. (assuming you agree and have the bandwidth to do it and report back with any concerning findings).

@korikuzma @theferrit32 could you please validate paragraph 3 in @John-F-Wagstaffs post above?

Part of this depends on how easy it is to use just the bits of VRS python that we actually need. This should be possible, if the documentation is valid, both the SeqRepo and UTA dataproviders are supposed to only be necessary if we want to validate inputs. We don't want to validate, since this would be an output filter for pre-validated variant inputs. As such, if we can trust the docs, it looks like missing them should not cause issues for the VRS output parts.

@John-F-Wagstaff
Copy link
Collaborator

The main README.md says that these are both only required for ga4gh.vrs.extras. As such unless I have misread things completely, if we make the object directly we should be fine. Otherwise they can't by any means be called "extras".

Obviously the convenience functions in vrs.extras look useful, making the objects without them might be needed if there is extra validation going on (we have had performance issues caused by re-validation of known good data) but hopefully won't. I don't intend to re-implement anything I don't have too, and they also don't look like they would use anything that might cause problems, but even if we do we should have a better fallback than re-doing the VRS creation code, if that assumption (that VRS extras is not needed to create basic VRS objects directly from known good input and ready derived IDs) is valid.

@korikuzma
Copy link

korikuzma commented Nov 12, 2024

@John-F-Wagstaff correct. You do not need SeqRepo or UTA if you are just generating VRS objects. For example,

from ga4gh.vrs import models

allele_dict = {
  "type": "Allele",
  "location": {
    "type": "SequenceLocation",
    "sequenceReference": {
      "type": "SequenceReference",
      "refgetAccession": "SQ.Pw3Ch0x3XWD6ljsnIfmk_NERcZCI9sNM"
    },
    "start": 80656509,
    "end": 80656510
  },
  "state": {
    "type": "LiteralSequenceExpression",
    "sequence": "TT"
  }
}
allele = models.Allele(**allele_dict)

You are also able to generate VRS computed identifiers without SeqRepo/UTA:

from ga4gh.core import ga4gh_identify

ga4gh_identify(allele)

Note: Above used ga4gh.vrs==2.0.0a12

@John-F-Wagstaff
Copy link
Collaborator

Thank you.

It is as I hoped. We may be able to benefit from more than this, and it would be a shame to not take advantage of the existing work where it can be productive, but knowing that this is possible means I can just get started with the simpler integration as/when I get time to work on this, and we can expand our out/inputs later as necessary/useful.

@Peter-J-Freeman
Copy link
Collaborator Author

This is really great news @John-F-Wagstaff. Thanks @larrybabb and @korikuzma . We willk deploy VRS ASAP and look at what else we can do with the VRS-Python too.

@John-F-Wagstaff when you have something up and running paste some data here so we can get feedback from the others. Make sure we are using VRS optimally

@korikuzma
Copy link

korikuzma commented Nov 13, 2024

Awesome. Please don't hesitate to reach out if y'all have any other questions! Always happy to assist if I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants