The planet-microbe-semantic-web-analysis Github repository contains scripts to generate the Resource Description Framework (RDF) database back-end for the planet microbe web service, as well as query and analyze data subsets retrieved from the system. The system consists of an RDF database loaded with 1) ontology-annotated functional and taxonomic data computed from marine metagenomes sourced from the Planet Microbe database, 2) ontology-annotated environmental, physicochemical, and spatiotemporal data corresponding to the same metagenomic samples, and 3) life-science ontologies from the Open Biological and Biomedical Ontology (OBO) Foundry and Library including the Gene Ontology, the NCBITaxonomy Ontology, the Environment Ontology and the Planet Microbe Application ontology. The system is accessible through an API, which can be called using python3 query script contained within this repository. The system enables users to ask novel biological questions of the marine metagenomic datasets loaded within the RDF database. Queries to the API leverage the hierarchical structure of the ontologies to sub-select for data relevant to natural language questions such as "What data do we have about metagenomes from the 'HOT 224-283' project, where we have observed occurrences of 'cellular lipid metabolic process'(es) [GO:0044255], as well as recorded 'water temperature' [ENVO:09200014] values?".
python3 (3.8.5+)
R (4.2.2+)
git clone [email protected]:hurwitzlab/planet-microbe-semantic-web-analysis.git
Completed as part of Kai Blumberg's PhD thesis work.
To use this repository see the following protocol: http://dx.doi.org/10.17504/protocols.io.e6nvwkw19vmk/v2
Please see protocol above for full usage instructions.
The python3 query script usage is summarized as follows:
usage: assemble_query.py [-m str] [-b str] [-l str] [-g str] [-t str]
[-q str] [-ql str [str ...]] [-o str] [-p str]
[-u str] [-dmin int] [-dmax int]
Where the following are the table of flags that can optionally be added to a run command.
-m str, --env_medium
Environmental medium
Expects an ENVO CURIE from the environmental material hierarchy,
E.g., ENVO:00002149
-b str, --env_broad
Environment broad scale context
Expects an ENVO CURIE from the biome hierarchy
E.g., ENVO:00000447
-l str, --env_local
Environment local scale context
Expects an ENVO CURIE from the astronomical body part, or layer, hierarchies
E.g., ENVO:01000061
-g str, --go
Gene Ontology term
Expects a GO CURIE
E.g., GO:0015979*
-t str, --taxon
NCBI Taxonomy ontology term
Expects a NCBITaxon CURIE from the Bacteria or Archaea lineages
E.g., NCBITaxon:1117
-q str, --quality
Query for subclasses of input quality argument
Experts a BFO, ENVO or PMO quality CURIE
E.g., BFO:0000019
See the protocol's Appendix section for list of qualities
-ql str [str ...], --quality_list
Query for a list of input quality arguments
Experts BFO, ENVO or PMO quality CURIE, full list in the protocol's Appendix section for list
E.g., ENVO:09200014 ENVO:3100031
-o str, --output
Output file path to write tsv file of go term counts
Typical use would be `output/custom_file_name`
-p str, --project
Query for project name
E.g., "Amazon Plume Metagenomes"
-u str, --universal
File path to input sparql query file with query for
basic metadata universal across samples
Default is: base_metadata.rq
Only needs to be run once but should be run at first to get metadata table
-dmin int, --depth_minimum int
Filter samples by depth with minimum value cutoff
Default: 0
E.g., 300
-dmax int, --depth_maximum
Filter samples by depth with maximum value cutoff
E.g., 400