-
Notifications
You must be signed in to change notification settings - Fork 23
Mutalyzer Lsdb
Mutalyzer can help curators of gene variant databases (also known as Locus-Specific DataBases (LSDBs)) to improve the quality of variant descriptions. Curators using the Leiden Open (source) Variation Database (LOVD) software can turn on the Mutalyzer nomenclature checker module. This module modifies the variant submission form to allow the DNA change value to be checked with the Mutalyzer Name Checker on a remote server. Mutalyzer will need a suitable (genomic) reference sequence, but LOVD can store and provide the accession number of an LRG, RefSeqGene or custom Mutalyzer UD reference sequence for each gene.
Curators using other gene variant database software can use Web services or Mutalyzer's web interface to check separate variants (Name Checker) or lists of variants (Name Checker batch interface). Using transcript reference sequences has become very popular, because these were more stable than chromosomal reference sequences in the past and also result in smaller position numbers in variant descriptions. One inevitable problem associated with transcript reference sequences is their lack of intergenic and intronic sequences (See [http://www.hgvs.org/mutnomen/refseq.html] for a discussion). One of the advantages of Mutalyzer's Reference File Loader is that it can generate transcript-related variant descriptions using a genomic reference sequence.
Mutalyzer currently accepts LRGs and files in GenBank format (See reference sequence).
According to the recommendations of the Human Genome Sequence Variation Society (HGVS) all gene variant databases should use the stable Locus Reference Genomic (LRG) sequences (Dalgleish et al., 2010). Information on LRGs, and how to get one for your gene of interest, can be found on the LRG website.
Do yourself and your users a favor: use well-annotated genomic reference sequences to describe changes
If you are using only transcript or protein reference sequences in your database, you have already limited your possibilities to catch all genetic alterations affecting the gene, transcript and protein of interest. After all, transcripts and proteins are reflections of the cell's genetic information, which has been processed by the transcription and translation machinery. Starting at the genomic DNA level, it is possible to reconstruct how a specific transcript may have been produced and how sequence changes may lead to specific protein alterations. When you describe the variant at the amino acid level only, it will be impossible in most cases to convert this to an unambiguous description at the genomic DNA level. What if you want to perform an in silico analysis to investigate potential splicing or other RNA level effects on the observed reduction in protein activity?
Mutalyzer can help by converting any description at the genomic DNA level into descriptions at the transcript and protein level using the Position Converter. The Position Converter's database does not contain non-RefSeq transcript mappings.
Tips for curators:
-
Search the UCSC Genome Browser using the accession number of the non-RefSeq transcript or use Blat to map the transcript sequence. Please note that the UCSC Genome Browser will not display small exons (2-10 nucleotides), because Blat is unable to map these exons in RefSeq transcripts correctly. These exons are displayed in NCBI's MapViewer and Ensembl and also may be visible in UCSC's Ensembl Gene Prediction track. Example: CDH23 exon 33 (3 nts, hg19 chr 10:73,494,399-73,494,428) invisible in
NM_022124.4
on the RefSeq Genes track, but present inENST00000224721
in the Ensembl Gene Prediction track. -
Upload the chromosomal range of the transcript with sufficient flanking sequences with the Reference File Loader.
One of the challenges facing curators is the description of variants in highly polymorphic genes. The RefSeqGene sequence will only represent one of the alleles present in a specific population. Alleles from other populations may carry multiple variants relative to the RefSeqGene sequence. A disease-causing variant found in one of these populations can be described relative to the RefSeqGene sequence and have its functional effect predicted. This prediction would neglect the context of the other variants present on that allele and thus their potential modifying effects. On the other hand, the disease-causing variant might be invisible if all variants were included in an allele description.
For the HLA cluster on human chromosome 6 alternative assemblies with their own accession numbers have been made to solve this problem. A similar approach can be followed for highly polymorphic genes, where haplotype-specific reference sequences based on the RefSeqGene sequence might be created. A haplotype-specific reference sequence could be described as:
Hap1_Accession_number = NG_xxxxxxxx:g.[variant_1; variant_2; ...; variant_x] (1)
If necessary, additional subhaplotype reference sequences can be described in a similar manner. The disease-causing variant g.12345A>T identified on haplotype 1 can then simply be described as:
Hap1_Accession_number:g.12345A>T (2)
When a description relative to the RefSeqGene sequence is needed, the
disease-causing variant from (2)
can simply added to the variant list in (1)
:
NG_xxxxxxxx:g.[variant_1; variant_2; ...; variant_x; 12345A>T] (3)
The Mutalyzer [NameChecker Name Checker] should return the same mutated protein
sequence for (2)
and (3)
.
The Reference File Loader can be used to load custom Mutalyzer UD reference sequences to test this approach. When it works, we recommend submission of the haplotype-specific reference sequences to GenBank.
The Reference File Loader exercise shows how a reference sequence for specific allele could be created.