-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements: low hanging fruits #104
Comments
Hi @PathosEthosLogos! Thanks for the input. We really do appreciate anyone who takes the time to give us feedback. To answer your questions/suggestions:
To clarify, it takes roughly 30 hours to run a full cross-validation cycle of the model that checks a huge amount of the hyperparameter space. Training a single LightGBM model using the finalized parameters and data takes <10 minutes. We have some open issues (#36, #38) to reduce the CV time, but they aren't high priority. Boosted tree models tend to have a hyperparameter "plateau", i.e. a stable region of parameters where tuning further doesn't change performance much. RE: Hardware and AWS. I did a whole dive benchmarking different models, hardware, and costs. The tl;dr is that it's cheapest and most efficient to simply provision AWS Batch jobs (with any hardware specs we want) to run our models. This lets us run many jobs in parallel and spin down the jobs when we aren't modeling (which is most of the year).
I'm a big The vast majority of the pipeline's runtime comes from the model itself, or calculating feature importance, SHAP values, etc. Given the choice between refactoring for a marginal speedup of something that already works and trying new modeling methods/features/diagnostics, we're opting for the latter.
This is a totally fair point 😄. At some point we need to go through and clean up the old comments. |
It seems that there is a bit of misunderstanding. I will try to clarify. (seems that AWS + consumer grade GPU part isn't really a low hanging fruit anymore so I will skip, as well as LGBM as it requires a bit more work for saving model objects for production) I want to emphasise the low hanging fruit of |
Ah, got it. Sure I'm happy to test out |
@Damonamajor or @wagnerlmichael, one of you want to take a crack at this this upcoming week? |
Looks like using |
Hi there, saw the talk at Rconf. Looks like you've done a lot of work here. It looks solid, though there are many improvements that can be made. I have some suggestions that are low hanging fruits after a quick glance at the repo I think should be seriously considered.
I will start with a bit of a sanity check -- in the README, it says:
Is 6GB RAM a typo? In the talk, Dan says that this model takes a few days, but 6GB is below-average (some might argue below the half of average) of today's $500 consumer-grade or home laptop. It was confusing on why AWS would be necessary or why it would take a few days with LGBM. I saw the 5 download files and total combined was well below 1GB, though joined data is probably different story, which makes 6GB RAM sound about right for modelling (which makes it puzzling). For handling $16B in assets, getting a GPU + some RAM at the cost of $400 to $1,500 GPU are virtually nothing in comparison and serves the public much better. Some clarification on this would be appreciated.
So this leads me to the low hanging fruit: better code/lib choices.
I see a lot of
dplyr
functions,magrittr
and%>%
instead of|>
, etc. Switching totidytable
would also be an extremely low hanging fruit while keeping 99% of thedplyr
codebase. You can remove alldplyr::
,tidyr::
,tidyselect::
,purrr::
(I might be missing a few more), addlibrary(tidytable); conflict_prefer_all(winner = "tidytable", quiet = TRUE)
, and the 1% change needed would be 12 places where you usegroup_by
. Here is a demo of the very simple changes:And the performance gains would feel the same as switching from XGBoost to LGBM, while still not breaking 99% of the code.
Plus, though a very minor gain,
|>
is faster than%>%
. This also removes the dependency onmagrittr
library.And lastly, base R is slower than C++ based libraries akin to
tidytable
. Though this is not as low of a hanging fruit, code conversion from base R totidytable
makes the code much easier to read and more collaborative and reduces human communication time.And as a minor nitpick bonus, some of your code comments are 2x or 5x longer than the code itself. This shouldn't be necessary especially for tidy-piped syntax when the code is already so easy to read.
The text was updated successfully, but these errors were encountered: