Skip to content
This repository has been archived by the owner on Mar 1, 2023. It is now read-only.

Releases: blekhmanlab/rxivist

1.2.1

29 Jul 17:22
Compare
Choose a tag to compare

A missing configuration value in the v1.2.0 release would cause the spider to exit with an exception when running in certain modes.

1.2

23 Jul 05:21
Compare
Choose a tag to compare
1.2

Spider

  • Squashed bug that didn't update papers in an unknown collection (038eefa)
  • Outdated Crossref results are now not deleted until we verify we have valid data to replace it (0e1a398)
  • Author institution now recorded for each preprint, not just most recent (1cb5707)
  • Papers automatically updated if they have missing URLs (5a26612), missing dates (ec271cb) or missing authors (7fd60cc).
  • Paper abstracts are pulled from a different page location now, which is available more consistently.
  • Better retry logic for fetching data from Crossref (f6eb529)
  • Changes to accommodate the modified article metrics format on the bioRxiv website, which now includes download statistics for the "full-text HTML" in addition to the other metrics. (9d0077f)
  • Simplified get_publication_dates function (02fdbaf)
  • Squashed retry bug in fetching article stats (bf0dfb1)

API

  • Added endpoint at /v1/data/stats for stats reflecting data quality (c4940b6)

1.1

13 Feb 20:51
Compare
Choose a tag to compare
1.1

API

  • The /db directory has been added to document the pre-built Docker images now being released with the Rxivist database dumps.
  • The author_translations database table is no longer used to redirect outdated author profile page URLs to the new ones.

Spider

  • The methods to retrieve a preprint's date of publication have been pulled into the web crawler properly—previously, this was used only to collect data for the Rxivist preprint. It is now part of regular data collection (toggle from new option in config file).
  • More command-line options for launching the spider. Primarily, running python spider.py refresh no longer requires the ID of a single preprint, and will launch a regular refresh session.
  • More nuanced handling of errors encountered when querying the publication status of a preprint. Rather than bailing on the entire session if too many errors are encountered when calling this endpoint, that feature is instead simply turned off for that run.
  • Bug fix that didn't appropriately validate scraped DOI information.
  • Workaround for counting the number of recognized papers when searching for new preprints—previously, we used a new URL to indicate that a revision had been posted, which caused problems when bioRxiv changed the format of all their URLs. The new way is less accurate, but less fragile.
  • There was an increase to the default cap for number of articles refreshed per category in a single run. This cap is also now doubled automatically for the neuroscience collection.
  • Removal of several irrelevant utilities—a sitemap builder, for example.
  • Modified excessively verbose logging when searching for publication status.

1.0

18 Jan 22:27
Compare
Choose a tag to compare
1.0

The full Rxivist web application, as detailed in our preprint.

0.8

21 Oct 02:38
Compare
Choose a tag to compare
0.8 Pre-release
Pre-release
  • Added more code comments and tidied up function definitions
  • General refactoring

Web crawler

  • The old "Author" entities, and all references to them, are gone. Only the "DetailedAuthor" class and its attendant methods are present now, and are now called just Author.
  • Author institutions are updated every time we find a new paper of theirs, so we'll always have the most recent entry.
  • Cutting off semicolons that keep showing up at the end of author institution names in the bioRxiv HTML
  • More error handling, better recovery from botched HTTP calls

API

  • Added redirects for old author IDs. Lots and lots of authors were indexed by google, and the switch from Author to DetailedAuthor objects changed all the IDs. Now we have a translation to add 301 responses for the old ones to the (most likely) new one.
  • Changed URLs to not have /api/ in them anymore, since that will be in our hostname going forward.
  • Negative page numbers no longer allowed

0.7

13 Oct 23:30
Compare
Choose a tag to compare
0.7 Pre-release
Pre-release

The Rxivist web application has been removed from this repo and totally decoupled from the underlying API.

Web crawler

  • DOI is no longer a unique constraint when recording bioRxiv papers published elsewhere, since sometimes multiple posts to bioRxiv end up being published as one paper.
  • Requests-HTML was upgraded to avoid a ridiculous design decision in the fake_useragent package that renders it useless if their website goes down. Required upgrading to Python 3.6 also.
  • Detailed author entities now ranked alongside the old authors.
  • User can configure whether logs are sent to stdout
  • Fixed bug that created empty log files when logging to file is disabled
  • Author lists are updated when a bioRxiv revision is posted

API

  • Responses now use only the "detailed author" entities, rather than the old authors that had only names associated with them.
  • Created Docker container for deployment
  • Twitter data returned in paper queries

0.6

05 Oct 18:59
Compare
Choose a tag to compare
0.6 Pre-release
Pre-release
  • Replacing Altmetric data with Twitter statistics from Crossref
  • A more encapsulated container image for the web crawler

0.5

04 Oct 17:16
Compare
Choose a tag to compare
0.5 Pre-release
Pre-release
  • License info added

Web crawler

  • Recording data about where bioRxiv papers were published in peer-reviewed journals
  • Now pulling exact date a paper was first posted, rather than inferring based on traffic data
  • Pulling more detailed information about authors: rather than just name, we now record ORCID, email, and institutional affiliation
  • Added more configuration options
  • "Session limit" on papers that get refreshed is enforced at category level instead of overall, preventing some categories from getting skipped more often
  • Lots of little refactoring tasks

Website

  • Privacy policy
  • Users can adjust results per page
  • Pagination buttons are snazzy now, and display page numbers
  • Most badges removed from results page because they looked like baby buttons
  • Site header with stats (and social media buttons) at top of every page
  • Homepage link to "author leaderboards"
  • No more lab logo at the bottom

0.4

21 Sep 02:39
Compare
Choose a tag to compare
0.4 Pre-release
Pre-release

Web crawler

  • Ranking process reduced from hours to ~64 seconds
  • Logs recorded in file, rather than stdout
  • Bug fix that broke process for updating all download stats on existing papers
  • Download stat refresh can be capped at a set number of papers
  • More configuration options pulled out into config module
  • Spider iterates through all categories now, rather than only a single one
  • Authors ranked in individual categories
  • General refactoring

Website

  • New logo
  • Fields in search interface change based on limitations of other choices
  • Search results paginated now—no more 20-result limit
  • Author leaderboards added for each category
  • Prototype of news site style front page
  • Google Analytics added
  • More modularized templates

0.3

27 Aug 03:11
Compare
Choose a tag to compare
0.3 Pre-release
Pre-release

Web crawler

  • Altmetric daily data being pulled, is now default display on homepage
  • Papers identified via DOI instead of title
  • Incomplete traffic data fetched midway through a month is now replaced accurately
  • More error handling
  • Delays added in between web requests
  • No longer crashes if papers are added to a collection that is currently being crawled
  • No longer confused by authors with only a single name

Website

  • Papers and authors have profile pages that display all of their categorical rankings
  • Paper details page has graph for downloads over time
  • Text search can now include spaces