Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental time series #115

Merged
merged 3 commits into from
Jul 7, 2021
Merged

Incremental time series #115

merged 3 commits into from
Jul 7, 2021

Conversation

tunetheweb
Copy link
Member

Each month we run the complete time series across all the dates, even though only the latest month is updated each time. This is wasteful and sloooow, as discussed in #110 and also in requests to expand the lens's.

This PR changes the default to check the latest date already in the results JSON on Google Cloud Storage and then adds some WHERE clauses to only look after those dates, and the merges the new data in.

For certain queries (CrUX and blink_features) it runs the full time series (as per the old way) as joins are messy and these are quick anyway.

Using the -f flag forces the whole time series to regenerate, including the date limits the future date (e.g. to prevent including partial data when desktop for next run is already complete). It can also be used with the previously introduced -r option to limit full reruns to one or more reports.

To test I ran the drupal lens for the June rerun (see HTTPArchive/httparchive.org#361) with this new method and it only took a couple of hours to run (though ignore some of the a11y queries for drupal as messed those up while testing - rerunning now). The wordpress lens, run with the old method, is still going nearly 24 hours later. I'll run the magento lens using this new method after merging this as a post-release test.

Copy link
Member

@rviscomi rviscomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unable to give this one a more thorough review, but at a glance I don't see any major issues. Happy to merge it in and prove it works by running it on the next crawl.

@tunetheweb tunetheweb merged commit e316b8f into master Jul 7, 2021
@tunetheweb tunetheweb deleted the incremental-timeseries branch July 7, 2021 21:05
@tunetheweb
Copy link
Member Author

magento June reports have been rerun successfully using this new incremental logic.

Couple of small issues for the recently updated a11y and pwa score queries, but those have been fixed now. Other than that went well and ran quickly (hour or so in total instead of 12 hours last month for this lens).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants