Tool to retrieve protein-protein interactions and calculate protein/gene symbol ocurrence in the scientific literature (PubMed & PubMedCentral). Contains two python modules (core
and report
), and a python script (ppaxe
).
Available for python 2.7
and python 3.x
, and also as a standalone docker image.
Visit the PPaxe web application to use PPaxe on the web.
S. Castillo-Lara, J.F. Abril
PPaxe: easy extraction of protein occurrence and interactions from the scientific literature
Bioinformatics, AOP November 2018, bty988.
To download and use the ppaxe Docker image:
docker pull compgenlabub/ppaxe:latest
docker run -v /local/path/to/output:/ppaxe/output:rw \
compgenlabub/ppaxe -v -p ./papers.pmids -o ./output.tbl -r ./report
If you want to install PPaxe manually, go to the Install ppaxe manually section.
usage: ppaxe [-h] -p PMIDS [-d DATABASE] [-o OUTPUT] [-r REPORT] [-i IP] [-v]
[-e]
Command-line tool to retrieve protein-protein interactions from the scientific
literature.
optional arguments:
-h, --help show this help message and exit
-p PMIDS, --pmids PMIDS
Text file with a list of PMids or PMCids
-d DATABASE, --database DATABASE
Download whole articles from database "PMC", or only
abstracts from "PUBMED".
-o OUTPUT, --output OUTPUT
Output file to print the retrieved interactions in
tabular format.
-r REPORT, --report REPORT
Print html report with the specified name.
-i IP, --ip IP Change the IP address of the StanfordCoreNLP server.
Default: http://localhost:9000
-v, --verbose Increase output verbosity.
-e, --exclude Exclude protein symbols not annotated in dictionary.
from ppaxe import core as ppcore
from ppaxe import report
# Perform query to PubMedCentral
pmids = ["28615517","28839427","28831451","28824332","28819371","28819357"]
query = ppcore.PMQuery(ids=pmids, database="PMC")
query.get_articles()
# Retrieve interactions from text
for article in query:
article.extract_interactions()
# Get the predictions
for prediction in article.predictions:
print(prediction.to_html())
# Print html report
# Will create 'report_file.html'
summary = report.ReportSummary(query)
summary.make_report("report_file")
# Will read PubMed ids in pmids.txt, predict the interactions
# in their fulltext from PubMedCentral, and print a tabular output
# and an html report
ppaxe -p pmids.txt -d PMC -v -o output.tbl -r report
# Or with docker image
docker run -v /local/path/to/output:/ppaxe/output:rw compgenlabub/ppaxe -v -p pmids.txt -o output.tbl -r report
The report output (option -r
) will contain a simple summary of the analysis, the interactions retrieved (including the sentences from which they were retrieved), a table with the protein/gene counts and a graph visualization made using cytoscape.js.
- Prerequisites
xml.dom
numpy
pycorenlp
cPickle
scipy
You can install this package manuallly using pip. However, before doing so, you have to download the Random Forest predictor and place it in ppaxe/data
.
# Clone the repository
git clone https://github.com/scastlara/ppaxe.git
# Download pickle with RF
wget https://www.dropbox.com/s/t6qcl19g536c0zu/RF_scikit.pkl?dl=0 -O ppaxe/ppaxe/data/RF_scikit.pkl
# Install
pip install ppaxe
- Download StanfordCoreNLP
In order to use the package you will need a StanfordCoreNLP server setup with the Protein/gene Tagger.
# Download StanfordCoreNLP
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip
unzip stanford-corenlp-full-2017-06-09.zip
# Download the Protein tagger
wget https://www.dropbox.com/s/ec3a4ey7s0k6qgy/FINAL-ner-model.AImed%2BMedTag%2BBioInfer.ser.gz?dl=0 -O FINAL-ner-model.AImed+MedTag+BioInfer.ser.gz
# Download English tagger models
wget http://nlp.stanford.edu/software/stanford-english-corenlp-2017-06-09-models.jar -O stanford-corenlp-full-2017-06-09/stanford-english-corenlp-2017-06-09-models.jar
# Change the location of the tagger in ppaxe/data/server.properties if necessary
# ...
# Start the StanfordCoreNLP server
cd stanford-corenlp-full-2017-06-09/
java -mx1000m -cp ./stanford-corenlp-3.8.0.jar:stanford-english-corenlp-2017-06-09-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -serverProperties ~/ppaxe/ppaxe/data/server.properties
Once the server is up and running and ppaxe has been installed, you are good to go.
By default, ppaxe will assume the server is available at localhost:9000. If you want to change the address, set up the server with the appropiate port and change the address in ppaxe by assigning the new address to the variable ppaxe.ppcore.NLP:
- Start the server
# Change the location of the ner tagger in server.properties manually
java -mx10000m -cp ./stanford-corenlp-3.8.0.jar:stanford-english-corenlp-2017-06-09-models.jar edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port your_port -serverProperties ppaxe/data/server.properties
- Use the ppaxe package
from ppaxe import core as ppcore
from pycorenlp import StanfordCoreNLP
ppcore.NLP = StanfordCoreNLP(your_new_adress)
# Do whatever you want
By default, PPaxe uses the HGNC dictionary of gene symbols to normalize the protein/gene symbols found in the article. The ppaxe
command-line tool has the option -e
that restricts all the results to only those proteins that match against the HGNC database. Users can change this file (located at ppaxe/data/HGNC_gene_dictionary.txt
) in order to restrict their searches to only specific genes or proteins, or to normalize gene names using a different dictionary.
Refer to the wiki of the package.
To run the tests:
python -m pytest -v tests
- Sergio Castillo-Lara - at the Computational Genomics Lab
This project is licensed under the GNU GPL3 license - see the LICENSE file for details