Improvements: low hanging fruits #104

PathosEthosLogos · 2023-12-21T12:19:45Z

Hi there, saw the talk at Rconf. Looks like you've done a lot of work here. It looks solid, though there are many improvements that can be made. I have some suggestions that are low hanging fruits after a quick glance at the repo I think should be seriously considered.

I will start with a bit of a sanity check -- in the README, it says:

The total amount of RAM needed for overall model fitting is around 6GB

Is 6GB RAM a typo? In the talk, Dan says that this model takes a few days, but 6GB is below-average (some might argue below the half of average) of today's $500 consumer-grade or home laptop. It was confusing on why AWS would be necessary or why it would take a few days with LGBM. I saw the 5 download files and total combined was well below 1GB, though joined data is probably different story, which makes 6GB RAM sound about right for modelling (which makes it puzzling). For handling $16B in assets, getting a GPU + some RAM at the cost of $400 to $1,500 GPU are virtually nothing in comparison and serves the public much better. Some clarification on this would be appreciated.

So this leads me to the low hanging fruit: better code/lib choices.

I see a lot of dplyr functions, magrittr and %>% instead of |>, etc. Switching to tidytable would also be an extremely low hanging fruit while keeping 99% of the dplyr codebase. You can remove all dplyr::, tidyr::, tidyselect::, purrr:: (I might be missing a few more), add library(tidytable); conflict_prefer_all(winner = "tidytable", quiet = TRUE), and the 1% change needed would be 12 places where you use group_by. Here is a demo of the very simple changes:

# dplyr
df |>
  group_by(column_group) |>
  mutate(column_new = column_old + 1) |>
  ungroup()

# tidytable
df |>
  mutate(.by = column_group,
         column_new = column_old + 1)

And the performance gains would feel the same as switching from XGBoost to LGBM, while still not breaking 99% of the code.

Plus, though a very minor gain, |> is faster than %>%. This also removes the dependency on magrittr library.

And lastly, base R is slower than C++ based libraries akin to tidytable. Though this is not as low of a hanging fruit, code conversion from base R to tidytable makes the code much easier to read and more collaborative and reduces human communication time.

And as a minor nitpick bonus, some of your code comments are 2x or 5x longer than the code itself. This shouldn't be necessary especially for tidy-piped syntax when the code is already so easy to read.

The text was updated successfully, but these errors were encountered:

dfsnow · 2023-12-21T17:26:52Z

Hi @PathosEthosLogos! Thanks for the input. We really do appreciate anyone who takes the time to give us feedback. To answer your questions/suggestions:

Is 6GB RAM a typo? In the talk, Dan says that this model takes a few days, but 6GB is below-average (some might argue below the half of average) of today's $500 consumer-grade or home laptop. It was confusing on why AWS would be necessary or why it would take a few days with LGBM. I saw the 5 download files and total combined was well below 1GB, though joined data is probably different story, which makes 6GB RAM sound about right for modelling (which makes it puzzling). For handling $16B in assets, getting a GPU + some RAM at the cost of $400 to $1,500 GPU are virtually nothing in comparison and serves the public much better. Some clarification on this would be appreciated.

To clarify, it takes roughly 30 hours to run a full cross-validation cycle of the model that checks a huge amount of the hyperparameter space. Training a single LightGBM model using the finalized parameters and data takes <10 minutes.

We have some open issues (#36, #38) to reduce the CV time, but they aren't high priority. Boosted tree models tend to have a hyperparameter "plateau", i.e. a stable region of parameters where tuning further doesn't change performance much.

RE: Hardware and AWS. I did a whole dive benchmarking different models, hardware, and costs. The tl;dr is that it's cheapest and most efficient to simply provision AWS Batch jobs (with any hardware specs we want) to run our models. This lets us run many jobs in parallel and spin down the jobs when we aren't modeling (which is most of the year).

So this leads me to the low hanging fruit: better code/lib choices.

I'm a big data.table booster myself (I recently tried a nativedata.table rewrite of some stages of the pipeline), but the performance improvements (while substantial) aren't really worth the cost.

The vast majority of the pipeline's runtime comes from the model itself, or calculating feature importance, SHAP values, etc. Given the choice between refactoring for a marginal speedup of something that already works and trying new modeling methods/features/diagnostics, we're opting for the latter.

And as a minor nitpick bonus, some of your code comments are 2x or 5x longer than the code itself. This shouldn't be necessary especially for tidy-piped syntax when the code is already so easy to read.

This is a totally fair point 😄. At some point we need to go through and clean up the old comments.

PathosEthosLogos · 2023-12-22T17:58:06Z

It seems that there is a bit of misunderstanding. I will try to clarify. (seems that AWS + consumer grade GPU part isn't really a low hanging fruit anymore so I will skip, as well as LGBM as it requires a bit more work for saving model objects for production)

I want to emphasise the low hanging fruit of tidytable (definitely not data.table(!), which is the backend of tidytable). The code changes will be minimal (link provided) and will work identically, except faster, so I think there is a wee bit of misunderstanding here about what you mentioned as costs. With conflicted overrides to dplyr functions (among few others) in tidymodels, it should also speed up the modelling time at least a bit.

dfsnow · 2023-12-23T16:11:42Z

Ah, got it. Sure I'm happy to test out tidytable in the next few weeks. I'd also be delighted to review a PR with your proposed changes 👀

dfsnow · 2024-01-14T01:57:57Z

@Damonamajor or @wagnerlmichael, one of you want to take a crack at this this upcoming week?

dfsnow · 2024-01-24T20:49:01Z

Looks like using tidytable would require some significant refactoring to handle the complicated ifelse() logic we have in the assess stage. Right now, the speed benefit isn't really worth it, so I'm punting this issue to later this year.

dfsnow added this to the 2024 pipeline upgrades milestone Dec 27, 2023

dfsnow added the pipeline Code that glues all the model bits together label Dec 27, 2023

Damonamajor self-assigned this Jan 19, 2024

dfsnow unassigned Damonamajor Jan 24, 2024

dfsnow removed this from the 2024 pipeline upgrades milestone Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements: low hanging fruits #104

Improvements: low hanging fruits #104

PathosEthosLogos commented Dec 21, 2023 •

edited

Loading

dfsnow commented Dec 21, 2023

PathosEthosLogos commented Dec 22, 2023

dfsnow commented Dec 23, 2023

dfsnow commented Jan 14, 2024

dfsnow commented Jan 24, 2024

Improvements: low hanging fruits #104

Improvements: low hanging fruits #104

Comments

PathosEthosLogos commented Dec 21, 2023 • edited Loading

dfsnow commented Dec 21, 2023

PathosEthosLogos commented Dec 22, 2023

dfsnow commented Dec 23, 2023

dfsnow commented Jan 14, 2024

dfsnow commented Jan 24, 2024

PathosEthosLogos commented Dec 21, 2023 •

edited

Loading