This repository has been archived by the owner on Mar 1, 2023. It is now read-only.
Releases: blekhmanlab/rxivist
Releases · blekhmanlab/rxivist
1.2.1
1.2
Spider
- Squashed bug that didn't update papers in an unknown collection (038eefa)
- Outdated Crossref results are now not deleted until we verify we have valid data to replace it (0e1a398)
- Author institution now recorded for each preprint, not just most recent (1cb5707)
- Papers automatically updated if they have missing URLs (5a26612), missing dates (ec271cb) or missing authors (7fd60cc).
- Paper abstracts are pulled from a different page location now, which is available more consistently.
- Better retry logic for fetching data from Crossref (f6eb529)
- Changes to accommodate the modified article metrics format on the bioRxiv website, which now includes download statistics for the "full-text HTML" in addition to the other metrics. (9d0077f)
- Simplified
get_publication_dates
function (02fdbaf) - Squashed retry bug in fetching article stats (bf0dfb1)
API
- Added endpoint at
/v1/data/stats
for stats reflecting data quality (c4940b6)
1.1
API
- The
/db
directory has been added to document the pre-built Docker images now being released with the Rxivist database dumps. - The
author_translations
database table is no longer used to redirect outdated author profile page URLs to the new ones.
Spider
- The methods to retrieve a preprint's date of publication have been pulled into the web crawler properly—previously, this was used only to collect data for the Rxivist preprint. It is now part of regular data collection (toggle from new option in config file).
- More command-line options for launching the spider. Primarily, running
python spider.py refresh
no longer requires the ID of a single preprint, and will launch a regular refresh session. - More nuanced handling of errors encountered when querying the publication status of a preprint. Rather than bailing on the entire session if too many errors are encountered when calling this endpoint, that feature is instead simply turned off for that run.
- Bug fix that didn't appropriately validate scraped DOI information.
- Workaround for counting the number of recognized papers when searching for new preprints—previously, we used a new URL to indicate that a revision had been posted, which caused problems when bioRxiv changed the format of all their URLs. The new way is less accurate, but less fragile.
- There was an increase to the default cap for number of articles refreshed per category in a single run. This cap is also now doubled automatically for the neuroscience collection.
- Removal of several irrelevant utilities—a sitemap builder, for example.
- Modified excessively verbose logging when searching for publication status.
1.0
The full Rxivist web application, as detailed in our preprint.
0.8
- Added more code comments and tidied up function definitions
- General refactoring
Web crawler
- The old "Author" entities, and all references to them, are gone. Only the "DetailedAuthor" class and its attendant methods are present now, and are now called just
Author
. - Author institutions are updated every time we find a new paper of theirs, so we'll always have the most recent entry.
- Cutting off semicolons that keep showing up at the end of author institution names in the bioRxiv HTML
- More error handling, better recovery from botched HTTP calls
API
- Added redirects for old author IDs. Lots and lots of authors were indexed by google, and the switch from
Author
toDetailedAuthor
objects changed all the IDs. Now we have a translation to add 301 responses for the old ones to the (most likely) new one. - Changed URLs to not have
/api/
in them anymore, since that will be in our hostname going forward. - Negative page numbers no longer allowed
0.7
The Rxivist web application has been removed from this repo and totally decoupled from the underlying API.
Web crawler
- DOI is no longer a unique constraint when recording bioRxiv papers published elsewhere, since sometimes multiple posts to bioRxiv end up being published as one paper.
- Requests-HTML was upgraded to avoid a ridiculous design decision in the
fake_useragent
package that renders it useless if their website goes down. Required upgrading to Python 3.6 also. - Detailed author entities now ranked alongside the old authors.
- User can configure whether logs are sent to stdout
- Fixed bug that created empty log files when logging to file is disabled
- Author lists are updated when a bioRxiv revision is posted
API
- Responses now use only the "detailed author" entities, rather than the old authors that had only names associated with them.
- Created Docker container for deployment
- Twitter data returned in paper queries
0.6
0.5
- License info added
Web crawler
- Recording data about where bioRxiv papers were published in peer-reviewed journals
- Now pulling exact date a paper was first posted, rather than inferring based on traffic data
- Pulling more detailed information about authors: rather than just name, we now record ORCID, email, and institutional affiliation
- Added more configuration options
- "Session limit" on papers that get refreshed is enforced at category level instead of overall, preventing some categories from getting skipped more often
- Lots of little refactoring tasks
Website
- Privacy policy
- Users can adjust results per page
- Pagination buttons are snazzy now, and display page numbers
- Most badges removed from results page because they looked like baby buttons
- Site header with stats (and social media buttons) at top of every page
- Homepage link to "author leaderboards"
- No more lab logo at the bottom
0.4
Web crawler
- Ranking process reduced from hours to ~64 seconds
- Logs recorded in file, rather than stdout
- Bug fix that broke process for updating all download stats on existing papers
- Download stat refresh can be capped at a set number of papers
- More configuration options pulled out into config module
- Spider iterates through all categories now, rather than only a single one
- Authors ranked in individual categories
- General refactoring
Website
- New logo
- Fields in search interface change based on limitations of other choices
- Search results paginated now—no more 20-result limit
- Author leaderboards added for each category
- Prototype of news site style front page
- Google Analytics added
- More modularized templates
0.3
Web crawler
- Altmetric daily data being pulled, is now default display on homepage
- Papers identified via DOI instead of title
- Incomplete traffic data fetched midway through a month is now replaced accurately
- More error handling
- Delays added in between web requests
- No longer crashes if papers are added to a collection that is currently being crawled
- No longer confused by authors with only a single name
Website
- Papers and authors have profile pages that display all of their categorical rankings
- Paper details page has graph for downloads over time
- Text search can now include spaces