Skip to content

A bioinformatics toolset for multiomics analysis

License

Notifications You must be signed in to change notification settings

danoguevara/OmicsKit

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmicsKit

R-CMD-check License: CC BY-NC-SA 4.0 Lifecycle: experimental

The goal of OmicsKit is to help in manipulating tables and generating plots for multi-omics analyses including genomics, transcriptomics, proteomics, methylomics and immunoinformatics.


Installation

You can install the development version of OmicsKit from GitHub with:

# Install the remotes package if needed
install.packages("remotes")

# Install from GitHub
remotes::install_github("BigMindLab/OmicsKit")

# Call library for usage
library("OmicsKit")

Key features

Gene Annotations

Retrieve information from Ensembl and BioMart to annotate gene counts tables, including transcript and gene names, genomic coordinates, and cross-references from various annotation databases.

# Example on generating transcript annotations file from Ensembl release 103

tx2gene <- get_annotations(rownames(txi$counts),
                           version = "103",
                           format = "xlsx")
#>         transcriptID          geneID  symbol        biotype chromosome
#> 1 ENST00000000233.10 ENSG00000004059    ARF5 protein_coding          7
#> 2  ENST00000000412.8 ENSG00000003056    M6PR protein_coding         12
#> 3 ENST00000000442.11 ENSG00000173153   ESRRA protein_coding         11
#> 4  ENST00000001008.6 ENSG00000004478   FKBP4 protein_coding         12
#> 5  ENST00000001146.7 ENSG00000003137 CYP26B1 protein_coding          2
#> 6  ENST00000002125.9 ENSG00000003509 NDUFAF7 protein_coding          2
#>   gene_start  gene_end gene_length
#> 1  127588386 127591700        3315
#> 2    8940361   8949761        9401
#> 3   64305497  64316743       11247
#> 4    2794970   2805423       10454
#> 5   72129238  72147862       18625
#> 6   37231631  37253403       21773
#>                                                                                    description
#> 1                                  ADP ribosylation factor 5 [Source:HGNC Symbol;Acc:HGNC:658]
#> 2            mannose-6-phosphate receptor, cation dependent [Source:HGNC Symbol;Acc:HGNC:6752]
#> 3                           estrogen related receptor alpha [Source:HGNC Symbol;Acc:HGNC:3471]
#> 4                                   FKBP prolyl isomerase 4 [Source:HGNC Symbol;Acc:HGNC:3720]
#> 5           cytochrome P450 family 26 subfamily B member 1 [Source:HGNC Symbol;Acc:HGNC:20581]
#> 6 NADH:ubiquinone oxidoreductase complex assembly factor 7 [Source:HGNC Symbol;Acc:HGNC:28816]

Dimensionality Reduction

Generate a range of visually appealing plots for high-dimensional data. Includes unsupervised clustering methods as well.

  1. PCA (Principal Component Analysis).
nice_PCA(object = transf.data,
         PCs = c(1, 2),
         ntop =  200,
         returnData = FALSE,
         variables = c(fill = "group", shape = "sex"),
         legend_names = c(fill = "Group", shape = "Sex"),
         size = 9,
         alpha = 1,
         shapes = 21:22,
         colors = my_colors,
         legend_title = 10,
         legend_elements = 8,
         legend_pos = c(0.80, 0.80, "right"),
         labels = c(var = "id", size = 3))

  1. tSNE (t-distributed Stochastic Neighbor Embedding).
nice_tSNE(object = transf.data,
          seed = 0,
          perplexity = 3,
          max_iterations = 10000,
          returnData = FALSE,
          variables = c(fill = "group", shape = "sex"),
          legend_names = c(fill = "Group", shape = "Sex"),
          size = 9,
          alpha = 1,
          shapes = 21:22,
          colors = my_colors,
          legend_title = 10,
          legend_elements = 8,
          legend_pos = c(0.80, 0.80, "right"),
          labels = c(var = "num", size = 3))

  1. UMAP (Uniform Manifold Approximation and Projection).
nice_UMAP(object = transf.data,
          neighbors = 4,
          components = 3,
          epochs = 10000,
          returnData = FALSE,
          variables = c(fill = "group", shape = "sex"),
          legend_names = c(fill = "Group", shape = "Sex"),
          shapes = 21:22,
          colors = my_colors,
          size = 9,
          alpha = 1,
          legend_title = 10,
          legend_elements = 8,
          legend_pos = c(0.80, 0.80, "right"),
          labels = c(var = "num", size = 3))

Counts Normalization

Compute and extract normalized counts such as TPM, RPKM, FPKM, and the normalized counts from DESeq2.

# Retrieve TPMs
gene.tpm <- tpm(raw_counts = counts.gene,
                gene_lengths = counts.gene_annotations$gene_length)

# Convert to data frame
gene.tpm <- data.frame(gene.tpm)

# Add annotations
gene.tpm.annotated <- add_annotations(object = gene.tpm,
                                      reference = geneID.details,
                                      variables = annotations)
#>                        S7505     S7588       S7644          geneID   symbol
#> ENSG00000000003  0.007897175 0.0133242 0.005832601 ENSG00000000003   TSPAN6
#> ENSG00000000005  0.000000000 0.0000000 0.000000000 ENSG00000000005     TNMD
#> ENSG00000000419  0.425217339 0.3134232 0.266468292 ENSG00000000419     DPM1
#> ENSG00000000457  0.319121112 0.2230616 0.152358455 ENSG00000000457    SCYL3
#> ENSG00000000460  0.008210803 0.0082671 0.003325549 ENSG00000000460 C1orf112
#> ENSG00000000938 11.005523543 5.4143061 4.858797243 ENSG00000000938      FGR
#>                        biotype chromosome gene_start  gene_end gene_length
#> ENSG00000000003 protein_coding          X  100627108 100639991       12884
#> ENSG00000000005 protein_coding          X  100584936 100599885       14950
#> ENSG00000000419 protein_coding         20   50934867  50958555       23689
#> ENSG00000000457 protein_coding          1  169849631 169894267       44637
#> ENSG00000000460 protein_coding          1  169662007 169854080      192074
#> ENSG00000000938 protein_coding          1   27612064  27635185       23122
#>                                                                                                    description
#> ENSG00000000003                                              tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
#> ENSG00000000005                                                tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
#> ENSG00000000419 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005]
#> ENSG00000000457                                   SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285]
#> ENSG00000000460                        chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565]
#> ENSG00000000938              FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697]

Differential Expression Results

Filter and export differential expression analysis results into MS Excel or CSV formats. Filtering criteria include:

  1. Expression change (log2 fold change).

  2. Significance (False Discovery Rate, FDR).

  3. Detectability (Requena et al., 2024, Nat. Comms.).

detect_list <- detect_filter(norm.counts = normalized.counts[, 1:21],
                             df.BvsA = res.T_N.sig,
                             df.CvsA = res.M_N.sig,
                             df.DvsA = NULL,
                             cutoffs = c(50, 50, 0),
                             samples.baseline = 1:3,
                             samples.condition1 = 4:6,
                             samples.condition2 = 7:9,
                             samples.condition3 = NULL)
#> $Comparison1
#>                  baseMean log2FoldChange     lfcSE      stat       pvalue
#> ENSG00000015532 257.21417     -1.5441220 0.4673795 -3.303786 9.538854e-04
#> ENSG00000040633 439.99903      0.4652085 0.1393893  3.337477 8.454280e-04
#> ENSG00000043514  74.19499     -0.7064732 0.2003565 -3.526082 4.217567e-04
#> ENSG00000050748 259.94720     -0.6654422 0.1947707 -3.416541 6.342210e-04
#> ENSG00000067840 499.56771     -1.5485951 0.3585761 -4.318735 1.569261e-05
#> ENSG00000073417 126.05703     -1.2149724 0.2967272 -4.094577 4.229395e-05
#>                       padj         ensembl symbol        biotype chromosome
#> ENSG00000015532 0.24746629 ENSG00000015532  XYLT2 protein_coding         17
#> ENSG00000040633 0.24746629 ENSG00000040633  PHF23 protein_coding         17
#> ENSG00000043514 0.19637741 ENSG00000043514  TRIT1 protein_coding          1
#> ENSG00000050748 0.22608000 ENSG00000050748  MAPK9 protein_coding          5
#> ENSG00000067840 0.04826785 ENSG00000067840  PDZD4 protein_coding          X
#> ENSG00000073417 0.08672609 ENSG00000073417  PDE8A protein_coding         15
#>                 gene_start  gene_end gene_length
#> ENSG00000015532   50346126  50363138       17013
#> ENSG00000040633    7235029   7239722        4694
#> ENSG00000043514   39838110  39883511       45402
#> ENSG00000050748  180233143 180292099       58957
#> ENSG00000067840  153802166 153830565       28400
#> ENSG00000073417   84980440  85139145      158706
#>                                                                           description
#> ENSG00000015532              xylosyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:15517]
#> ENSG00000040633             PHD finger protein 23 [Source:HGNC Symbol;Acc:HGNC:28428]
#> ENSG00000043514     tRNA isopentenyltransferase 1 [Source:HGNC Symbol;Acc:HGNC:20286]
#> ENSG00000050748 mitogen-activated protein kinase 9 [Source:HGNC Symbol;Acc:HGNC:6886]
#> ENSG00000067840           PDZ domain containing 4 [Source:HGNC Symbol;Acc:HGNC:21167]
#> ENSG00000073417               phosphodiesterase 8A [Source:HGNC Symbol;Acc:HGNC:8793]
#> 
#> $DetectGenes
#>  [1] "ENSG00000015532" "ENSG00000040633" "ENSG00000043514" "ENSG00000050748"
#>  [5] "ENSG00000067840" "ENSG00000073417" "ENSG00000092820" "ENSG00000102317"
#>  [9] "ENSG00000109519" "ENSG00000115289" "ENSG00000115607" "ENSG00000118503"
#> [13] "ENSG00000118507" "ENSG00000118520" "ENSG00000121057" "ENSG00000123329"
#> [17] "ENSG00000125835" "ENSG00000134256" "ENSG00000138777" "ENSG00000145241"
#> [21] "ENSG00000145675" "ENSG00000146859" "ENSG00000147459" "ENSG00000148488"
#> [25] "ENSG00000152061" "ENSG00000154640" "ENSG00000155749" "ENSG00000156675"
#> [29] "ENSG00000160633" "ENSG00000164284" "ENSG00000167118" "ENSG00000168887"
#> [33] "ENSG00000169136" "ENSG00000171865" "ENSG00000174007" "ENSG00000181026"
#> [37] "ENSG00000181523" "ENSG00000183484" "ENSG00000183726" "ENSG00000186407"
#> [41] "ENSG00000188315" "ENSG00000197448" "ENSG00000198841" "ENSG00000234389"
#> [45] "ENSG00000267283" "ENSG00000078589" "ENSG00000100416" "ENSG00000101191"
#> [49] "ENSG00000135164" "ENSG00000135763"
#> 
#> $Comparison2
#>                  baseMean log2FoldChange     lfcSE      stat       pvalue
#> ENSG00000078589 456.36884     -0.5806294 0.1631107 -3.559726 3.712419e-04
#> ENSG00000100416 191.21091     -0.8072890 0.1874153 -4.307487 1.651198e-05
#> ENSG00000101191 807.06214     -0.4319645 0.1163969 -3.711135 2.063322e-04
#> ENSG00000121057 161.09141     -1.0596203 0.2722106 -3.892649 9.915571e-05
#> ENSG00000135164 713.36398     -2.4825790 0.6778906 -3.662212 2.500468e-04
#> ENSG00000135763  55.37574     -1.1755918 0.3069112 -3.830397 1.279366e-04
#>                       padj         ensembl symbol        biotype chromosome
#> ENSG00000078589 0.24733759 ENSG00000078589 P2RY10 protein_coding          X
#> ENSG00000100416 0.03650248 ENSG00000100416   TRMU protein_coding         22
#> ENSG00000101191 0.19621498 ENSG00000101191  DIDO1 protein_coding         20
#> ENSG00000121057 0.15908922 ENSG00000121057  AKAP1 protein_coding         17
#> ENSG00000135164 0.21630136 ENSG00000135164  DMTF1 protein_coding          7
#> ENSG00000135763 0.15908922 ENSG00000135763   URB2 protein_coding          1
#>                 gene_start  gene_end gene_length
#> ENSG00000078589   78945332  78963727       18396
#> ENSG00000100416   46330875  46357340       26466
#> ENSG00000101191   62877738  62937952       60215
#> ENSG00000121057   57085092  57121346       36255
#> ENSG00000135164   87152409  87196337       43929
#> ENSG00000135763  229626247 229660200       33954
#>                                                                                          description
#> ENSG00000078589                    P2Y receptor family member 10 [Source:HGNC Symbol;Acc:HGNC:19906]
#> ENSG00000100416               tRNA mitochondrial 2-thiouridylase [Source:HGNC Symbol;Acc:HGNC:25481]
#> ENSG00000101191                       death inducer-obliterator 1 [Source:HGNC Symbol;Acc:HGNC:2680]
#> ENSG00000121057                       A-kinase anchoring protein 1 [Source:HGNC Symbol;Acc:HGNC:367]
#> ENSG00000135164 cyclin D binding myb like transcription factor 1 [Source:HGNC Symbol;Acc:HGNC:14603]
#> ENSG00000135763                 URB2 ribosome biogenesis homolog [Source:HGNC Symbol;Acc:HGNC:28967]
#> 
#> $Comparison3
#>                  baseMean log2FoldChange     lfcSE      stat       pvalue
#> ENSG00000004777  59.42425     -0.7664742 0.1664979 -4.603506 4.154375e-06
#> ENSG00000014914  50.00342      0.9984404 0.2941904  3.393858 6.891546e-04
#> ENSG00000015532 257.21417     -1.5853106 0.3652982 -4.339771 1.426310e-05
#> ENSG00000040633 439.99903      0.4286551 0.1094576  3.916175 8.996484e-05
#> ENSG00000051108 611.57428     -0.5931505 0.1560919 -3.800009 1.446909e-04
#> ENSG00000070366 309.66550     -0.5852590 0.1641947 -3.564421 3.646608e-04
#>                        padj         ensembl   symbol        biotype chromosome
#> ENSG00000004777 0.009346896 ENSG00000004777 ARHGAP33 protein_coding         19
#> ENSG00000014914 0.229188351 ENSG00000014914   MTMR11 protein_coding          1
#> ENSG00000015532 0.019606051 ENSG00000015532    XYLT2 protein_coding         17
#> ENSG00000040633 0.077291042 ENSG00000040633    PHF23 protein_coding         17
#> ENSG00000051108 0.114745426 ENSG00000051108  HERPUD1 protein_coding         16
#> ENSG00000070366 0.167326005 ENSG00000070366     SMG6 protein_coding         17
#>                 gene_start  gene_end gene_length
#> ENSG00000004777   35774532  35788822       14291
#> ENSG00000014914  149928651 149936879        8229
#> ENSG00000015532   50346126  50363138       17013
#> ENSG00000040633    7235029   7239722        4694
#> ENSG00000051108   56932142  56944864       12723
#> ENSG00000070366    2059839   2303785      243947
#>                                                                                                        description
#> ENSG00000004777                               Rho GTPase activating protein 33 [Source:HGNC Symbol;Acc:HGNC:23085]
#> ENSG00000014914                                myotubularin related protein 11 [Source:HGNC Symbol;Acc:HGNC:24307]
#> ENSG00000015532                                           xylosyltransferase 2 [Source:HGNC Symbol;Acc:HGNC:15517]
#> ENSG00000040633                                          PHD finger protein 23 [Source:HGNC Symbol;Acc:HGNC:28428]
#> ENSG00000051108 homocysteine inducible ER protein with ubiquitin like domain 1 [Source:HGNC Symbol;Acc:HGNC:13744]
#> ENSG00000070366                       SMG6 nonsense mediated mRNA decay factor [Source:HGNC Symbol;Acc:HGNC:17809]

Case Organization

Automatically categorize results from three pairwise differential expression analyses or Gene Set Enrichment Analysis (e.g., B vs A, C vs A, C vs B) into 10 mutually exclusive cases (BigMind, 2024, manuscript in preparation).

DEGs_sig <- split_cases(df.BvsA = res.T_N,
                        df.CvsA = res.M_N,
                        df.BvsC = res.M_T,
                        unique_id = "ensembl",
                        significance_var = "padj",
                        significance_cutoff = 0.25,
                        change_var = "log2FoldChange",
                        change_cutoff = 0)

# Filter the whole detectability list by a new threshold

DEGs_sig <- lapply(DEGs_sig, function(x) {
  x <- x[rownames(x) %in% detect_list$DetectGenes, ]
  x <- x[x$padj < 0.05, ]
  return(x)
})
#> $Case6
#>                 baseMean log2FoldChange     lfcSE      stat       pvalue
#> ENSG00000109519 158.4705     -0.7058074 0.1620767 -4.354774 1.332044e-05
#>                       padj         ensembl symbol        biotype chromosome
#> ENSG00000109519 0.04826785 ENSG00000109519 GRPEL1 protein_coding          4
#>                 gene_start gene_end gene_length
#> ENSG00000109519    7058895  7068064        9170
#>                                                                    description
#> ENSG00000109519 GrpE like 1, mitochondrial [Source:HGNC Symbol;Acc:HGNC:19696]
#>                 trend
#> ENSG00000109519    dn
#> 
#> $Case8
#>                  baseMean log2FoldChange     lfcSE      stat       pvalue
#> ENSG00000130656 124.20885      4.3999189 0.9629323  4.569292 4.893748e-06
#> ENSG00000004777  59.42425     -0.7664742 0.1664979 -4.603506 4.154375e-06
#> ENSG00000137842  97.42166     -0.6824943 0.1504292 -4.536981 5.706537e-06
#>                        padj         ensembl   symbol        biotype chromosome
#> ENSG00000130656 0.009346896 ENSG00000130656      HBZ protein_coding         16
#> ENSG00000004777 0.009346896 ENSG00000004777 ARHGAP33 protein_coding         19
#> ENSG00000137842 0.009346896 ENSG00000137842   TMEM62 protein_coding         15
#>                 gene_start gene_end gene_length
#> ENSG00000130656     142728   154503       11776
#> ENSG00000004777   35774532 35788822       14291
#> ENSG00000137842   43123279 43185144       61866
#>                                                                          description
#> ENSG00000130656           hemoglobin subunit zeta [Source:HGNC Symbol;Acc:HGNC:4835]
#> ENSG00000004777 Rho GTPase activating protein 33 [Source:HGNC Symbol;Acc:HGNC:23085]
#> ENSG00000137842         transmembrane protein 62 [Source:HGNC Symbol;Acc:HGNC:26269]
#>                 trend
#> ENSG00000130656    up
#> ENSG00000004777    dn
#> ENSG00000137842    dn
#> 
#> $Case9
#>                  baseMean log2FoldChange     lfcSE      stat       pvalue
#> ENSG00000067840  499.5677     -1.5485951 0.3585761 -4.318735 1.569261e-05
#> ENSG00000102317 1755.7926     -0.5060843 0.1169605 -4.326966 1.511769e-05
#>                       padj         ensembl symbol        biotype chromosome
#> ENSG00000067840 0.04826785 ENSG00000067840  PDZD4 protein_coding          X
#> ENSG00000102317 0.04826785 ENSG00000102317   RBM3 protein_coding          X
#>                 gene_start  gene_end gene_length
#> ENSG00000067840  153802166 153830565       28400
#> ENSG00000102317   48574449  48581162        6714
#>                                                                    description
#> ENSG00000067840    PDZ domain containing 4 [Source:HGNC Symbol;Acc:HGNC:21167]
#> ENSG00000102317 RNA binding motif protein 3 [Source:HGNC Symbol;Acc:HGNC:9900]
#>                 trend
#> ENSG00000067840    dn
#> ENSG00000102317    dn

Customary plots

Generate a range of visually appealing plots to display differential expression analysis results. Here are some examples:

  1. Volcano plots

  1. Heatmaps

  1. Enrichment plots

  1. Box-Scatter-Violin (BSV) plots

Examples

For a more detailed workflow on Differential Expression Analysis with the application of the OmicsKit suit please check the custom BigMind’s pipeline for DESeq2.

Developed by

License

CC BY-NC-SA 4.0

Contact

[email protected]

About

A bioinformatics toolset for multiomics analysis

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • R 100.0%