[WIP] Asynchronous corpus updates #46

mpacer · 2017-12-01T02:32:28Z

This is a first pass at making the corpus update commands work asynchronously.

There's still a lot I need to do, but this is enough to at least get the ball rolling about how to think about this.

In order to use async/await syntax it requires python3.5+, so I updated the requirements accordingly.

Some of the code I have there originates from the original tutorials I was splicing together to accomplish this… so if something seems to not make sense, please point it out so I can address that non-sensemaking-ness.

Notes to individuals

@eseiver

Explaining `plos_corpus.download_check_and_move`

Could you describe explicitly what the reasoning behind each of the steps of your `plos_corpus.download_check_and_move` function from the perspective of working on an individual file? Right now the approach is to process everything as a batch in steps, but from an async perspective, that actually works against using the available cycles efficiently.

Review of `Article.from_bytes`

Also… if I don't use the BytesIO wrapper and read the bytes directly, the object that is created when using et.from_string has different content than what is currently being saved to disk.

I don't like the from_bytes method having the ability to write articles to the corpus, but it seems like there should be some way to do that. I wanted to start with minimal surface area on existing objects, so I just included a single method that did everything I want. Nonetheless, I think that there's a lot of API discussions that we'll need to have around this.

@njsmith

it would be nice to be able to add tasks from inside the inner loop (`fetch`, based on how the code is structured as of 07ffabc )… That way we could trigger the new corrected article download if it was needed not in a batch after the fact, but all at once. Is the only way to add tasks from an inner coroutine to either use a global variable, define the inner loop as a closure (so it has access to the higher scope list), or to pass it in? It seems like that kind of state management could be handled more cleanly (e.g., a global variable seems like a terrible idea), but I'm not sure how to approach this.

…ent to os.makedirs

njsmith · 2017-12-01T03:19:23Z

I guess you could have a dedicated job dispatching task that does

while True:
    await sleep(...)
    job = await q.get()
    asyncio.ensure_future(job)

I guess the tricky bit then would be figuring out when to shut everything down... the problem is you don't know ahead of time how many jobs you'll need...

mpacer · 2017-12-01T04:01:05Z

I was thinking that something like trio's nurseries would be good for that since they do the babysitting for all of the tasks (so adding the task could happen through the context manager that would be passed around)… I didn't use trio because it doesn't integrate with aiohttp and I thought sockets were way overpowered for this task.

This really seems like there should be a ready, nice pattern from somewhere in the concurrent programming universe…isn't part of a server's job being able to handle the case where you don't know ahead of time how many jobs you'll need?

eseiver · 2017-12-05T19:14:26Z

@mpacer, happy to explain the different checking steps of download_check_and_move. for the most part, these are done on the new article files right after they've been downloaded to the temporary directory. also note that a number of these functions will change slightly in my current feature branch that brings the Article class into plos_corpus functions.

check_for_corrected_articles: for the newly downloaded articles, are any of them the 'correction' type? (In the new version, I'm changing this to three "amendment" article types that refer to other PLOS articles: correction, retraction, and expressions of concern). the idea is that for this article type, it implies that the PLOS article in the related-article field with the appropriate related-article-type may be updated in some way. It's used to create a list of articles to check if their XML content has been updated since it was last downloaded. Those articles are likely to be in corpusdir and not the temp download folder, depending on how long since the last update was run.
download_corrected_articles: for the list of articles derived in step 1, compare the XML on the server with the local copy. if there's a difference, download the server version into the temp folder.
download_vor_updates: for 5 of the 7 PLOS journals, an 'early version' or uncorrected-proof of the article is published before the final version replaces it in-place at the same DOI (version-of-record, or 'VOR'). allofplos maintains a text list of DOIs generated from local articles of uncorrected proofs (uncorrected_proofs_text_list). This function reads in that text list and checks the proof status (uncorrected vs VOR) in the remote XML file. If it's now a VOR, it downloads the new one to the tempdir and removes that DOI from the uncorrected proof list in memory.
check_for_uncorrected_proofs: this checks every file in the tempdir to see if it's an uncorrected proof. if it is, its DOI is added to the uncorrected proof list in memory. That list is then written back to uncorrected_proofs_text_list.

So 1 and 4 are standard checks that could be pretty easily incorporated into an async workflow, whereas 2 and 3 are checking a different set of files and/or downloading or comparing files than are likely not to be in the temporary directory.

sbassi · 2017-12-09T18:44:44Z

@mpacer Could you resolve the merge conflicts?

mpacer · 2017-12-10T00:12:23Z

I'm not too concerned with the merge conflicts as I don't want this to be merged yet. It's still a long way off & I am going to need a lot more time to think about how to approach this. When I push more commits I'll definitely fix the conflicts, but I'll wait until then so that I fix all the merge conflicts all at once. That's generally how I approach slow moving WIP PRs and merge conflicts since it only matters at the point when it's no longer WIP.

sbassi · 2017-12-10T00:57:39Z

Ok, I didn't know, my criteria was not to open a PR until ready to merge.

…

On Dec 9, 2017 4:12 PM, "M Pacer" ***@***.***> wrote: I'm not too concerned with the merge conflicts as I don't want this to be merged yet. It's still a long way off & I am going to need a lot more time to think about how to approach this. When I push more commits I'll definitely fix the conflicts, but I'll wait until then so that I fix all the merge conflicts all at once. That's generally how I approach slow moving WIP PRs and merge conflicts since it only matters at the point when it's no longer WIP. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#46 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAOjta9_10kQG8zQLniXA2kJNXjBpjHFks5s-yHngaJpZM4QxyFx> .

mpacer added 11 commits November 29, 2017 00:22

use makedirs rather than try/except; handles all mkdir -p functionality

bac0a72

Changing other pattern of os.mkdir with a print statement if not pres…

fcb8d9b

…ent to os.makedirs

add class method from_bytes for creating articles from byte strings

17ee2e3

basic async file getting utilities with timing comparisons

38be2b1

require python 3.5 or greater

0eca201

add MIN_DELAY as configurable for testing on plos network

5cdd28f

move fetch test inside async_utils package

f85990d

open correct directories with makedirs

07ffabc

build two directories so that they can be compared

7e38b3e

make the inner functions accept dois (not urls)

4b18b03

doc improvements and a bunch of cleanup

a4a946c

sbassi added the Do not merge label Dec 27, 2017

mpacer mentioned this pull request Feb 28, 2018

Change update functions & readme #88

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Asynchronous corpus updates #46

[WIP] Asynchronous corpus updates #46

mpacer commented Dec 1, 2017

njsmith commented Dec 1, 2017

mpacer commented Dec 1, 2017 •

edited

Loading

eseiver commented Dec 5, 2017 •

edited

Loading

sbassi commented Dec 9, 2017

mpacer commented Dec 10, 2017

sbassi commented Dec 10, 2017 via email

[WIP] Asynchronous corpus updates #46

Are you sure you want to change the base?

[WIP] Asynchronous corpus updates #46

Conversation

mpacer commented Dec 1, 2017

Notes to individuals

njsmith commented Dec 1, 2017

mpacer commented Dec 1, 2017 • edited Loading

eseiver commented Dec 5, 2017 • edited Loading

sbassi commented Dec 9, 2017

mpacer commented Dec 10, 2017

sbassi commented Dec 10, 2017 via email

mpacer commented Dec 1, 2017 •

edited

Loading

eseiver commented Dec 5, 2017 •

edited

Loading