Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Asynchronous corpus updates #46

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

mpacer
Copy link
Collaborator

@mpacer mpacer commented Dec 1, 2017

This is a first pass at making the corpus update commands work asynchronously.

There's still a lot I need to do, but this is enough to at least get the ball rolling about how to think about this.

In order to use async/await syntax it requires python3.5+, so I updated the requirements accordingly.

Some of the code I have there originates from the original tutorials I was splicing together to accomplish this… so if something seems to not make sense, please point it out so I can address that non-sensemaking-ness.

Notes to individuals

@eseiver
Explaining `plos_corpus.download_check_and_move` Could you describe explicitly what the reasoning behind each of the steps of your `plos_corpus.download_check_and_move` function from the perspective of working on an individual file? Right now the approach is to process everything as a batch in steps, but from an async perspective, that actually works against using the available cycles efficiently.
Review of `Article.from_bytes`Also… if I don't use the BytesIO wrapper and read the bytes directly, the object that is created when using et.from_string has different content than what is currently being saved to disk.

I don't like the from_bytes method having the ability to write articles to the corpus, but it seems like there should be some way to do that. I wanted to start with minimal surface area on existing objects, so I just included a single method that did everything I want. Nonetheless, I think that there's a lot of API discussions that we'll need to have around this.

@njsmith it would be nice to be able to add tasks from inside the inner loop (`fetch`, based on how the code is structured as of 07ffabc )… That way we could trigger the new corrected article download if it was needed not in a batch after the fact, but all at once. Is the only way to add tasks from an inner coroutine to either use a global variable, define the inner loop as a closure (so it has access to the higher scope list), or to pass it in? It seems like that kind of state management could be handled more cleanly (e.g., a global variable seems like a terrible idea), but I'm not sure how to approach this.

@njsmith
Copy link

njsmith commented Dec 1, 2017

I guess you could have a dedicated job dispatching task that does

while True:
    await sleep(...)
    job = await q.get()
    asyncio.ensure_future(job)

I guess the tricky bit then would be figuring out when to shut everything down... the problem is you don't know ahead of time how many jobs you'll need...

@mpacer
Copy link
Collaborator Author

mpacer commented Dec 1, 2017

I was thinking that something like trio's nurseries would be good for that since they do the babysitting for all of the tasks (so adding the task could happen through the context manager that would be passed around)… I didn't use trio because it doesn't integrate with aiohttp and I thought sockets were way overpowered for this task.

This really seems like there should be a ready, nice pattern from somewhere in the concurrent programming universe…isn't part of a server's job being able to handle the case where you don't know ahead of time how many jobs you'll need?

@eseiver
Copy link
Collaborator

eseiver commented Dec 5, 2017

@mpacer, happy to explain the different checking steps of download_check_and_move. for the most part, these are done on the new article files right after they've been downloaded to the temporary directory. also note that a number of these functions will change slightly in my current feature branch that brings the Article class into plos_corpus functions.

  1. check_for_corrected_articles: for the newly downloaded articles, are any of them the 'correction' type? (In the new version, I'm changing this to three "amendment" article types that refer to other PLOS articles: correction, retraction, and expressions of concern). the idea is that for this article type, it implies that the PLOS article in the related-article field with the appropriate related-article-type may be updated in some way. It's used to create a list of articles to check if their XML content has been updated since it was last downloaded. Those articles are likely to be in corpusdir and not the temp download folder, depending on how long since the last update was run.

  2. download_corrected_articles: for the list of articles derived in step 1, compare the XML on the server with the local copy. if there's a difference, download the server version into the temp folder.

  3. download_vor_updates: for 5 of the 7 PLOS journals, an 'early version' or uncorrected-proof of the article is published before the final version replaces it in-place at the same DOI (version-of-record, or 'VOR'). allofplos maintains a text list of DOIs generated from local articles of uncorrected proofs (uncorrected_proofs_text_list). This function reads in that text list and checks the proof status (uncorrected vs VOR) in the remote XML file. If it's now a VOR, it downloads the new one to the tempdir and removes that DOI from the uncorrected proof list in memory.

  4. check_for_uncorrected_proofs: this checks every file in the tempdir to see if it's an uncorrected proof. if it is, its DOI is added to the uncorrected proof list in memory. That list is then written back to uncorrected_proofs_text_list.

So 1 and 4 are standard checks that could be pretty easily incorporated into an async workflow, whereas 2 and 3 are checking a different set of files and/or downloading or comparing files than are likely not to be in the temporary directory.

@sbassi
Copy link
Collaborator

sbassi commented Dec 9, 2017

@mpacer Could you resolve the merge conflicts?

@mpacer
Copy link
Collaborator Author

mpacer commented Dec 10, 2017

I'm not too concerned with the merge conflicts as I don't want this to be merged yet. It's still a long way off & I am going to need a lot more time to think about how to approach this. When I push more commits I'll definitely fix the conflicts, but I'll wait until then so that I fix all the merge conflicts all at once. That's generally how I approach slow moving WIP PRs and merge conflicts since it only matters at the point when it's no longer WIP.

@sbassi
Copy link
Collaborator

sbassi commented Dec 10, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants