Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize VCF import #87

Open
5 tasks
Arkanosis opened this issue Sep 28, 2018 · 6 comments
Open
5 tasks

Optimize VCF import #87

Arkanosis opened this issue Sep 28, 2018 · 6 comments
Assignees
Labels
optimization Issue regarding performance and code optimization

Comments

@Arkanosis
Copy link
Member

Arkanosis commented Sep 28, 2018

Import in regovar/core/managers/imports/vcf_manager.py is slow. Therefore:

  • find a test VCF to import;
  • create an empty pg database;
  • import the VCF into the database;
  • profile the import;
  • either improve the existing code or rewrite it (possibly ditching pysam, sqlalchemy or python in the process, if they happen to prevent some optimization).

@Oodnadatta @ikit : do you please have a relevant, annotated VCF for me, that is slow enough to import? The slower the better (I can shorten it as needed if that's really too slow for me). Thanks!

@Arkanosis Arkanosis added the optimization Issue regarding performance and code optimization label Sep 28, 2018
@Arkanosis Arkanosis self-assigned this Sep 28, 2018
@ikit
Copy link
Member

ikit commented Sep 29, 2018

@Arkanosis : You will find several annotated VCF on Brownie, in the /var/regovar/files directory

@ikit
Copy link
Member

ikit commented Sep 29, 2018

The new importer should be wrap in a pipeline to become the new official vcf importer : https://github.com/REGOVAR-Pipelines/VCFImporter

@Arkanosis
Copy link
Member Author

Ok great! :)

@Arkanosis
Copy link
Member Author

Which means by the way that we really have no reason to use python more than anything else, right?

@ikit
Copy link
Member

ikit commented Sep 29, 2018

yes

@ikit
Copy link
Member

ikit commented Sep 29, 2018

And in addition to this task, we should have a reflexion on the DB schema, I have the intuition that we can remove the "variant id". I thinks that we need it only in the "working table" but if we can avoid to use it ("insert or update" query) it at the step of the import, we will be able to increase a lot performance.

see : https://github.com/REGOVAR/Regovar/blob/master/regovar/core/managers/imports/vcf_manager.py#L569

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Issue regarding performance and code optimization
Projects
None yet
Development

No branches or pull requests

2 participants