Releases · blekhmanlab/rxivist

29 Jul 17:22

rabdill

v1.2.1

be3c8c3

1.2.1 Latest

Latest

A missing configuration value in the v1.2.0 release would cause the spider to exit with an exception when running in certain modes.

Assets 2

23 Jul 05:21

rabdill

v1.2

beeadcb

1.2

Spider

Squashed bug that didn't update papers in an unknown collection (038eefa)
Outdated Crossref results are now not deleted until we verify we have valid data to replace it (0e1a398)
Author institution now recorded for each preprint, not just most recent (1cb5707)
Papers automatically updated if they have missing URLs (5a26612), missing dates (ec271cb) or missing authors (7fd60cc).
Paper abstracts are pulled from a different page location now, which is available more consistently.
Better retry logic for fetching data from Crossref (f6eb529)
Changes to accommodate the modified article metrics format on the bioRxiv website, which now includes download statistics for the "full-text HTML" in addition to the other metrics. (9d0077f)
Simplified get_publication_dates function (02fdbaf)
Squashed retry bug in fetching article stats (bf0dfb1)

API

Added endpoint at /v1/data/stats for stats reflecting data quality (c4940b6)

Assets 2

13 Feb 20:51

rabdill

v1.1

b86692f

1.1

API

The /db directory has been added to document the pre-built Docker images now being released with the Rxivist database dumps.
The author_translations database table is no longer used to redirect outdated author profile page URLs to the new ones.

Spider

The methods to retrieve a preprint's date of publication have been pulled into the web crawler properly—previously, this was used only to collect data for the Rxivist preprint. It is now part of regular data collection (toggle from new option in config file).
More command-line options for launching the spider. Primarily, running python spider.py refresh no longer requires the ID of a single preprint, and will launch a regular refresh session.
More nuanced handling of errors encountered when querying the publication status of a preprint. Rather than bailing on the entire session if too many errors are encountered when calling this endpoint, that feature is instead simply turned off for that run.
Bug fix that didn't appropriately validate scraped DOI information.
Workaround for counting the number of recognized papers when searching for new preprints—previously, we used a new URL to indicate that a revision had been posted, which caused problems when bioRxiv changed the format of all their URLs. The new way is less accurate, but less fragile.
There was an increase to the default cap for number of articles refreshed per category in a single run. This cap is also now doubled automatically for the neuroscience collection.
Removal of several irrelevant utilities—a sitemap builder, for example.
Modified excessively verbose logging when searching for publication status.

Assets 2

18 Jan 22:27

rabdill

v1.0

0cb3f65

1.0

The full Rxivist web application, as detailed in our preprint.

Assets 2

21 Oct 02:38

rabdill

v0.8

13238b3

0.8 Pre-release

Pre-release

Added more code comments and tidied up function definitions
General refactoring

Web crawler

The old "Author" entities, and all references to them, are gone. Only the "DetailedAuthor" class and its attendant methods are present now, and are now called just Author.
Author institutions are updated every time we find a new paper of theirs, so we'll always have the most recent entry.
Cutting off semicolons that keep showing up at the end of author institution names in the bioRxiv HTML
More error handling, better recovery from botched HTTP calls

API

Added redirects for old author IDs. Lots and lots of authors were indexed by google, and the switch from Author to DetailedAuthor objects changed all the IDs. Now we have a translation to add 301 responses for the old ones to the (most likely) new one.
Changed URLs to not have /api/ in them anymore, since that will be in our hostname going forward.
Negative page numbers no longer allowed

Assets 2

13 Oct 23:30

rabdill

v0.7

5796951

0.7 Pre-release

Pre-release

The Rxivist web application has been removed from this repo and totally decoupled from the underlying API.

Web crawler

DOI is no longer a unique constraint when recording bioRxiv papers published elsewhere, since sometimes multiple posts to bioRxiv end up being published as one paper.
Requests-HTML was upgraded to avoid a ridiculous design decision in the fake_useragent package that renders it useless if their website goes down. Required upgrading to Python 3.6 also.
Detailed author entities now ranked alongside the old authors.
User can configure whether logs are sent to stdout
Fixed bug that created empty log files when logging to file is disabled
Author lists are updated when a bioRxiv revision is posted

API

Responses now use only the "detailed author" entities, rather than the old authors that had only names associated with them.
Created Docker container for deployment
Twitter data returned in paper queries

Assets 2

05 Oct 18:59

rabdill

v0.6

424f79a

0.6 Pre-release

Pre-release

Replacing Altmetric data with Twitter statistics from Crossref
A more encapsulated container image for the web crawler

Assets 2

04 Oct 17:16

rabdill

v0.5

b64abb6

0.5 Pre-release

Pre-release

License info added

Web crawler

Recording data about where bioRxiv papers were published in peer-reviewed journals
Now pulling exact date a paper was first posted, rather than inferring based on traffic data
Pulling more detailed information about authors: rather than just name, we now record ORCID, email, and institutional affiliation
Added more configuration options
"Session limit" on papers that get refreshed is enforced at category level instead of overall, preventing some categories from getting skipped more often
Lots of little refactoring tasks

Website

Privacy policy
Users can adjust results per page
Pagination buttons are snazzy now, and display page numbers
Most badges removed from results page because they looked like baby buttons
Site header with stats (and social media buttons) at top of every page
Homepage link to "author leaderboards"
No more lab logo at the bottom

Assets 2

21 Sep 02:39

rabdill

v0.4

76fb559

0.4 Pre-release

Pre-release

Web crawler

Ranking process reduced from hours to ~64 seconds
Logs recorded in file, rather than stdout
Bug fix that broke process for updating all download stats on existing papers
Download stat refresh can be capped at a set number of papers
More configuration options pulled out into config module
Spider iterates through all categories now, rather than only a single one
Authors ranked in individual categories
General refactoring

Website

New logo
Fields in search interface change based on limitations of other choices
Search results paginated now—no more 20-result limit
Author leaderboards added for each category
Prototype of news site style front page
Google Analytics added
More modularized templates

Assets 2

27 Aug 03:11

rabdill

v0.3

27071ad

0.3 Pre-release

Pre-release

Web crawler

Altmetric daily data being pulled, is now default display on homepage
Papers identified via DOI instead of title
Incomplete traffic data fetched midway through a month is now replaced accurately
More error handling
Delays added in between web requests
No longer crashes if papers are added to a collection that is currently being crawled
No longer confused by authors with only a single name

Website

Papers and authors have profile pages that display all of their categorical rankings
Paper details page has graph for downloads over time
Text search can now include spaces

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spider

API

API

Spider

Web crawler

API

Web crawler

API

Web crawler

Website

Web crawler

Website

Web crawler

Website

Releases: blekhmanlab/rxivist

1.2.1

1.2

Spider

API

1.1

API

Spider

1.0

0.8

Web crawler

API

0.7

Web crawler

API

0.6

0.5

Web crawler

Website

0.4

Web crawler

Website

0.3

Web crawler

Website