-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for additional web archives using Memento TimeMaps. #1
Comments
I'd love to support more archives, however, two things make the wayback machine attractive for this:
Time Travel is lovely, but it takes a really long time to load (30-50 seconds for any given lookup, many of them just time out), and doesn't (as far as I can tell?) support any form of site search, or URL prefix matching. Given the connection speed of some of the devices I'm targeting, I'd much prefer to reduce waiting as much as possible, as a lot of it will be in the network. The Internet Archive's CDX API also allows significant optimisation of the code on my end due to allowing complex filtering of snapshots, and as far as I can tell Memento APIs don't permit this, which in turn makes their responses take longer as they have to return all their data at once. Do you have any suggestions for mitigating these performance issues? |
Yeah, there's no doubt IA WM was the first, is the biggest, etc. and if you can only support one, that's the one. And for the 90s, IA is pretty much the only game in town, and if the other archives have pages, they're typically just copies of IA's WARCs (not always, but mostly). Most other archives don't support prefix search, etc. yet, so there will be a trade-off re: breadth and features. One solution would be to offer different branches: IA, and various non-IA in another. You don't have to go through TT, you could contact some of the other archives directly. Or you could run your own instance of MemGator, and specify the non-IA archives that you'd like to poll (e.g., just arquivo.pt, archive.today, perma.cc, and wayback.vefsafn.is). The non-IA archives are likely to be sparse for many URLs, so the responses should be small and relatively quick. Try MemGator; it doesn't do any processing or pagination and is thus pretty fast. $ time curl -isL memgator.cs.odu.edu/timemap/link/www.nasa.gov real 0m18.784s real 0m6.465s The second call responded quickly (6s) because IA had cached its response. But the first call at 18s isn't too bad given the size of the response. Other formats are similar: $ time curl -isL memgator.cs.odu.edu/timemap/json/www.nasa.gov | wc -l real 0m10.538s real 0m13.450s Finally, and I know you've already mentioned it in your repo, but regardless of the endpoint, some kind of caching would be a huge win for your application. It might even be worth it to go custom, since you're focused on data prior to a certain year (2000? 2005? 2010?). Most of the updates from all archives are going to come from the recent past. But even a standard reverse proxy would be super speedy. If you can keep robots out of your service, you'll probably get a lot of cache hits. |
I'm guessing you had happened to prime some cache on their end before your testing, because timing a request to mementoweb's API times out after two minutes, which isn't enough time to fetch the history for, for instance, apple.com, resulting in a I just don't believe that the Memento protocol as designed is fit for this purpose, given it's missing fundamental features like filtering, date range specifiers, or even rudimentary pagination. This is the only way I can see that something like this can consume the API in an efficient, performant way; the CDX API allows me to reduce the complexity on both ends of this equation by requesting only a small subset of the data for the initial query, and when the user has drilled down, a month's worth of less-filtered data. |
I've realised overnight that the site search thing isn't really a blocker to this; there's no reason I couldn't leverage Wayback Machine for site search and an aggregator for the actual history, but the performance issues due to limitations of the Memento APIs remain an issue. I missed this bit in your prior response:
The lack of processing or pagination is, IMO, exactly the cause of the trouble! It means that each archive used by the aggregator has its full history queried, which means fetching a potentially huge number of items. |
There are many additional web archives that could be supported, esp. if this service used Memento TimeMaps, either aggregated through TimeTravel or directly via their TimeMap URIs.
Some lists of archives:
Memento Quick Intro
The text was updated successfully, but these errors were encountered: