Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paper #62

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added paper/graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
53 changes: 53 additions & 0 deletions paper/paper.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: 'All of PLOS (allofplos): A toolkit to download and analyze the entire PLOS corpus'
tags:
- Science
- Open Access
- Data mining
- Data analysis
- Big Data
- Publishing
authors:
- name: Elizabeth Seiver
orcid: 0000-0003-3765-3841
affiliation: 1
- name: Sebastian Bassi
orcid: 0000-0002-2241-2001
affiliation: 2
affiliations:
- name: Public Library of Science, 1160 Battery Street, Koshland Building East, Suite 225, San Francisco, CA 94111, USA
index: 1
- name: Globant, 875 Howard St, Suite 320 , San Francisco, CA 94103, USA
index: 2
date: 15 December 2017
bibliography: ref.bib
---


# Summary

All of Plos (allofplos) is a collection of Python scripts that can be used to download, keep up-to-date, and analyze the entire PLOS Corpus of over 220,000 scientific articles.
PLOS (the Public Library of Science) is a nonprofit open access science publisher, with a library of open access journals and other scientific literature under an open content license [@OpenContent]. As the volume of published literature grows, Text and Data Mining (TDM) is an increasingly important research methodology that allows insights derived from automated analysis of text and data.
With the PLOS corpus, researchers will be able to do Text and Data Mining (TDM) since all the articles are in machine readable JATS-standard XML files [@JATS]. These files have all article content with the exception of figures and supplemental data.


# Program description

The main script is plos_corpus.py, which downloads the whole PLOS Corpus as a zip of about 5 gigabytes file containing more than 220,000 xml files (about 20 gigabytes of disk space when uncompressed). After uncompressing the files, it queries the list of DOIs from the PLOS Search API [PLOSAPI] and compares it against the DOIs of local xml files. This creates a list of DOIs to download. For each DOI in the list, its corresponding XML file is downloaded from the PLOS article page or from the internal ‘content-repo’ if executed from within PLOS network. It checks for amended articles (corrections, retractions, and expressions of concern) and uncorrected proofs. If any of those articles have a different remote XML version than the local file, it will download them and merge all into the main XML folder.
If the user wants to start analysing the corpus without doing the previous step, the project includes a starter_corpus directory, as seed data, a subset of 131 manually selected XML articles.
The main class to deal with PLOS articles is the Article class located in article_class.py. It provides access to most article metadata via its properties.

As an example of what can be done with this data, can be found in the corpus_analysis.py file in the samples directory. It is a set of functions to analyze articles in the PLOS corpus. A Jupyter Notebook [JUPYTER] is provided with examples such as spot check JATS field for irregularities, generate a list of retracted or corrected articles, create summaries of articles by type, publication date and more.

# Plos Corpus as a SQL Database

The project includes a ready to use SQLite [sqlite] Database with the seed data and a script to generate a custom database (makedb.py). With this script the user can generate a database out of the whole PLOS Corpus or a random subset with any given amount of articles. The database includes most article metadata and the abstract. This way the corpus can be searched with any standard SQL query. As an alternative to SQL, an Object Relational Mapper (ORM) based search is provided in dbtoorm.py.

![Database schema showing the relationship between different paper metadata.](graph.png)

# Data Availability

PLOS participates in industry efforts to further facilitate TDM research, including the construction of The Hague Declaration on Knowledge Discovery in the Digital Age. [haguedeclaration].
All articles published by PLOS are licencsed with the Creative Commons Attribution (CC BY) license. Under this license, authors retain ownership of the copyright for their content, but they allow anyone to download, reuse, reprint, modify, distribute and/or copy the content as long as the original authors and source are cited.

# References
51 changes: 51 additions & 0 deletions paper/ref.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
@online{OpenContent,
author = {PLOS},
title = {Who We Are},
year = 2017,
url = {https://www.plos.org/who-we-are},
urldate = {2017-12-15}
}

@online{JATS,
title = {Journal Publishing Tag Set},
year = 2017,
url = {https://jats.nlm.nih.gov/publishing/},
urldate = {2017-12-15}
}

@online{PLOSAPI,
author = {PLOS},
title = {PLOS API},
year = 2017,
url = {http://api.plos.org/solr/},
urldate = {2017-12-15}
}

@incollection{JUPYTER,
title = {Jupyter Notebooks – a publishing format for reproducible computational workflows},
author = {Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, Carol Willing, Jupyter Development Team},
booktitle = {Positioning and Power in Academic Publishing: Players, Agents and Agendas},
editor = {Loizides, F., Schmidt, B.},
pages = {87--90},
year = {2016},
publisher = {IOS Press},
doi = {10.3233/978-1-61499-649-1-87}
}

@software{sqlite,
title = {SQLite},
rights = {Public Domain},
version = {3.13.0},
url = {https://www.sqlite.org/download.html},
author = {Hipp, R, et. al.},
year = {2016},
date = {2016-05-18}
}

@online{haguedeclaration,
author = {Multiple authors},
title = {The Hague Declaration on Knowledge Discovery in the Digital Age},
year = 2017,
url = {http://thehaguedeclaration.com},
urldate = {2017-12-15}
}