Sourcepredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking. Sourcepredict solves this problem by using machine learning classification on dimensionally reduced datasets.
With conda (recommended)
$ conda install -c conda-forge -c maxibor sourcepredict
With pip
$ pip install sourcepredict
- Sink taxonomic count file (see example file and documentation)
- Source taxonomic count file (see example file and documentation)
- Source label file (see example file and documentation)
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/test/dog_test_sink_sample.csv -O dog_example.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_labels.csv -O sp_labels.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_sources.csv -O sp_sources.csv
$ sourcepredict -s sp_sources.csv -l sp_labels.csv dog_example.csv
Step 1: Checking for unknown proportion
== Sample: ERR1915662 ==
Adding unknown
Normalizing (GMPR)
Computing Bray-Curtis distance
Performing MDS embedding in 2 dimensions
KNN machine learning
Training KNN classifier on 2 cores...
-> Testing Accuracy: 1.0
----------------------
- Sample: ERR1915662
known:98.61%
unknown:1.39%
Step 2: Checking for source proportion
Computing weighted_unifrac distance on species rank
TSNE embedding in 2 dimensions
KNN machine learning
Performing 5 fold cross validation on 2 cores...
Trained KNN classifier with 10 neighbors
-> Testing Accuracy: 0.99
----------------------
- Sample: ERR1915662
Canis_familiaris:96.1%
Homo_sapiens:2.47%
Soil:1.43%
Sourcepredict result written to dog_test_sample.sourcepredict.csv
Sourcepredict output the predicted source contribution to each sink sample, and the embedding of all samples in the lower dimensional space. See documentation for details.
Depending on the normalization method (-n
), the embedding (-me
) method, the cpus available for parallel processing (-t
), and the data, the runtime should be between a few seconds and a few minutes per sink sample.
The documentation of SourcePredict is available here: sourcepredict.readthedocs.io
- The sources were obtained with a simple Nextflow pipeline, with Kraken2 using the MiniKraken2_v2_8GB.
See the documentation for more informations on how to build a custom source file. - The example source file is here modern_gut_microbiomes_sources.csv
- The example label file is here modern_gut_microbiomes_sources.csv
- Homo sapiens gut microbiome (1, 2, 3, 4, 5, 6)
- Canis familiaris gut microbiome (1)
- Soil microbiome (1, 2, 3)
If you wish to contribute to Sourcepredict, you are welcome and encouraged to contribute by opening an issue, or creating a pull-request. All contributions will be made under the GPLv3 license. More informations can found on the contributing page.
Sourcepredict has been published in JOSS.
@article{Borry2019Sourcepredict,
journal = {Journal of Open Source Software},
doi = {10.21105/joss.01540},
issn = {2475-9066},
number = {41},
publisher = {The Open Journal},
title = {Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification},
url = {http://dx.doi.org/10.21105/joss.01540},
volume = {4},
author = {Borry, Maxime},
pages = {1540},
date = {2019-09-04},
year = {2019},
month = {9},
day = {4}
}