Skip to content

hurwitzlab/planet-microbe-semantic-web-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

planet-microbe-semantic-web-analysis

Description

The planet-microbe-semantic-web-analysis Github repository contains scripts to generate the Resource Description Framework (RDF) database back-end for the planet microbe web service, as well as query and analyze data subsets retrieved from the system. The system consists of an RDF database loaded with 1) ontology-annotated functional and taxonomic data computed from marine metagenomes sourced from the Planet Microbe database, 2) ontology-annotated environmental, physicochemical, and spatiotemporal data corresponding to the same metagenomic samples, and 3) life-science ontologies from the Open Biological and Biomedical Ontology (OBO) Foundry and Library including the Gene Ontology, the NCBITaxonomy Ontology, the Environment Ontology and the Planet Microbe Application ontology. The system is accessible through an API, which can be called using python3 query script contained within this repository. The system enables users to ask novel biological questions of the marine metagenomic datasets loaded within the RDF database. Queries to the API leverage the hierarchical structure of the ontologies to sub-select for data relevant to natural language questions such as "What data do we have about metagenomes from the 'HOT 224-283' project, where we have observed occurrences of 'cellular lipid metabolic process'(es) [GO:0044255], as well as recorded 'water temperature' [ENVO:09200014] values?".

Requirements

python3 (3.8.5+)
R (4.2.2+)

Installation

git clone [email protected]:hurwitzlab/planet-microbe-semantic-web-analysis.git

Contributions

Completed as part of Kai Blumberg's PhD thesis work.

Usage

To use this repository see the following protocol: http://dx.doi.org/10.17504/protocols.io.e6nvwkw19vmk/v2

Query script reference arguments

Please see protocol above for full usage instructions.

The python3 query script usage is summarized as follows:

usage: assemble_query.py [-m str] [-b str] [-l str] [-g str] [-t str]
                         [-q str] [-ql str [str ...]] [-o str] [-p str]
                         [-u str] [-dmin int] [-dmax int]

Where the following are the table of flags that can optionally be added to a run command.

  -m str, --env_medium
                        Environmental medium 
                        Expects an ENVO CURIE from the environmental material hierarchy, 
                        E.g., ENVO:00002149

  -b str, --env_broad
                        Environment broad scale context 
                        Expects an ENVO CURIE from the biome hierarchy
                        E.g., ENVO:00000447

  -l str, --env_local
                        Environment local scale context 
                        Expects an ENVO CURIE from the astronomical body part, or layer, hierarchies
                        E.g., ENVO:01000061

  -g str, --go      
                        Gene Ontology term
                        Expects a GO CURIE 
                        E.g., GO:0015979*

  -t str, --taxon   
                        NCBI Taxonomy ontology term
                        Expects a NCBITaxon CURIE from the Bacteria or Archaea lineages
                        E.g., NCBITaxon:1117

  -q str, --quality
                        Query for subclasses of input quality argument
                        Experts a BFO, ENVO or PMO quality CURIE
                        E.g., BFO:0000019
                        See the protocol's Appendix section for list of qualities

  -ql str [str ...], --quality_list 
                        Query for a list of input quality arguments 
                        Experts BFO, ENVO or PMO quality CURIE, full list in the protocol's Appendix section for list
                        E.g., ENVO:09200014 ENVO:3100031
 
  -o str, --output  
                        Output file path to write tsv file of go term counts
                        Typical use would be `output/custom_file_name`

  -p str, --project
                        Query for project name
                         E.g., "Amazon Plume Metagenomes"

  -u str, --universal
                        File path to input sparql query file with query for
                        basic metadata universal across samples 
                        Default is: base_metadata.rq 
                        Only needs to be run once but should be run at first to get metadata table

  -dmin int, --depth_minimum int
                        Filter samples by depth with minimum value cutoff
                        Default: 0
                        E.g., 300

  -dmax int, --depth_maximum
                        Filter samples by depth with maximum value cutoff
                        E.g., 400