GitHub - maxibor/sourcepredict: Prediction/source tracking of metagenomic samples source using machine learning

Sourcepredict is a Python package distributed through Conda, to classify and predict the origin of metagenomic samples, given a reference dataset of known origins, a problem also known as source tracking. Sourcepredict solves this problem by using machine learning classification on dimensionally reduced datasets.

Installation

With conda (recommended)

$ conda install -c conda-forge -c maxibor sourcepredict

With pip

$ pip install sourcepredict

Example

Input

Sink taxonomic count file (see example file and documentation)
Source taxonomic count file (see example file and documentation)
Source label file (see example file and documentation)

Usage

$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/test/dog_test_sink_sample.csv -O dog_example.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_labels.csv -O sp_labels.csv
$ wget https://raw.githubusercontent.com/maxibor/sourcepredict/master/data/modern_gut_microbiomes_sources.csv -O sp_sources.csv
$ sourcepredict -s sp_sources.csv -l sp_labels.csv dog_example.csv
Step 1: Checking for unknown proportion
  == Sample: ERR1915662 ==
	Adding unknown
	Normalizing (GMPR)
	Computing Bray-Curtis distance
	Performing MDS embedding in 2 dimensions
	KNN machine learning
	Training KNN classifier on 2 cores...
	-> Testing Accuracy: 1.0
	----------------------
	- Sample: ERR1915662
		 known:98.61%
		 unknown:1.39%
Step 2: Checking for source proportion
	Computing weighted_unifrac distance on species rank
	TSNE embedding in 2 dimensions
	KNN machine learning
	Performing 5 fold cross validation on 2 cores...
	Trained KNN classifier with 10 neighbors
	-> Testing Accuracy: 0.99
	----------------------
	- Sample: ERR1915662
		 Canis_familiaris:96.1%
		 Homo_sapiens:2.47%
		 Soil:1.43%
Sourcepredict result written to dog_test_sample.sourcepredict.csv

Output

Sourcepredict output the predicted source contribution to each sink sample, and the embedding of all samples in the lower dimensional space. See documentation for details.

Runtime

Depending on the normalization method (-n), the embedding (-me) method, the cpus available for parallel processing (-t), and the data, the runtime should be between a few seconds and a few minutes per sink sample.

Documentation

The documentation of SourcePredict is available here: sourcepredict.readthedocs.io

Sourcepredict example files

The sources were obtained with a simple Nextflow pipeline, with Kraken2 using the MiniKraken2_v2_8GB.
See the documentation for more informations on how to build a custom source file.
The example source file is here modern_gut_microbiomes_sources.csv
The example label file is here modern_gut_microbiomes_sources.csv

Environments included in the example source file

Homo sapiens gut microbiome (1, 2, 3, 4, 5, 6)
Canis familiaris gut microbiome (1)
Soil microbiome (1, 2, 3)

Contributing Code, Documentation, or Feedback

If you wish to contribute to Sourcepredict, you are welcome and encouraged to contribute by opening an issue, or creating a pull-request. All contributions will be made under the GPLv3 license. More informations can found on the contributing page.

How to cite

Sourcepredict has been published in JOSS.

@article{Borry2019Sourcepredict,
	journal = {Journal of Open Source Software},
	doi = {10.21105/joss.01540},
	issn = {2475-9066},
	number = {41},
	publisher = {The Open Journal},
	title = {Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification},
	url = {http://dx.doi.org/10.21105/joss.01540},
	volume = {4},
	author = {Borry, Maxime},
	pages = {1540},
	date = {2019-09-04},
	year = {2019},
	month = {9},
	day = {4}
}

Name		Name	Last commit message	Last commit date
Latest commit History 203 Commits
.github		.github
conda		conda
data		data
docs		docs
img		img
misc		misc
paper		paper
sourcepredict		sourcepredict
tests		tests
.coverage		.coverage
.coveragerc		.coveragerc
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
conda_env.yaml		conda_env.yaml
contributing.md		contributing.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Example

Input

Usage

Output

Runtime

Documentation

Sourcepredict example files

Environments included in the example source file

Contributing Code, Documentation, or Feedback

How to cite

About

Releases 7

Packages

Languages

License

maxibor/sourcepredict

Folders and files

Latest commit

History

Repository files navigation

Installation

Example

Input

Usage

Output

Runtime

Documentation

Sourcepredict example files

Environments included in the example source file

Contributing Code, Documentation, or Feedback

How to cite

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 7

Packages 0

Languages

Packages