-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IbisML and GridSearchCV #136
Comments
Implementing CV isn't necessarily out of scope, but IbisML will probably need to reimplement it, because reproducible shuffling/splitting on databases is hard-ish. Here's the implementation of the much-simpler CV was something we intentionally deprioritized for initial release, but will definitely take a second look! |
Hi @koaning Thanks for your input. The reasons we deprioritized it for our initial work is that, I copied my reply in #135 (comment):
In my limited and outdated modeling experience, random or grid search may not be suitable for tuning some preprocessing transformations. Instead, preprocessing parameter tuning often requires feature analysis, for example, choosing between imputing a feature with the median or mean. Do you have some insights here. we would like to learn to see if we could prioritize this work. |
It can be pretty reasonable to ask questions like "what if we add this feature, how much uplift do we get?". If you ever want to do stuff like that ... it would sure help to be able to hyperparam that. It's not unreasonable given the current design not to allow for this though. Feels like fair game considering that it is a different backend. |
I agree, this is a very fundamental part of the workflow.
Part of the benefit of Ibis (and, by extension, IbisML) is that you can scale by choosing a more appropriate backend, instead of changing code. The way I see it (actually, the way I'm working on a large dataset myself), it would be a very reasonable user workflow to want to experiment locally on a smaller using the DuckDB backend, then scale on the full dataset using a distributed backend. As a result, I want to be able to use IbisML to define my I never planned to tune hyperparameters on the full, multi-TB dataset (what's the point?), but I do want to run the same |
It makes sense to support this, but I wanted to highlight a few potential scenarios to consider. Did some investigation this morning:
some update here:
it’s something we plan to address soon. |
I guess there is also another thing to consider: isn't it the goal of IbisML to not just support scikit-learn but also other frameworks? I am mainly mentioning this because there is a risk of overfitting to sklearn too. |
It's a bit of both. :) As you say, we don't want to (and possibly can't) overfit to scikit-learn. For example, re some of the concerns brought up in https://github.com/ibis-project/ibis-ml/issues/135#issuecomment-2307729193—we probably don't need to overly focus on enabling things like caching. Especially as users look to scale, the right approach with IbisML may involve adding a bit of simple custom code or something. For hyperparameter tuning, from some limited research when the IbisML project was started, it sounded like a lot of the target users in industry were using things like Optuna, so we may focus more on enabling that workflow. That said, scikit-learn is the most popular way for Python-oriented data scientists to work, and there's an extent of wanting to meet users where they are/fit into existing workflows where it's reasonable. To that end, I can probably look into at least:
|
let's start with sklearn and consider other framework during design and implementation if possible. Thank you. pytorch and tensorflow itself does not support kfold cv, some high level lib may do this. |
I may have found a bug related to how IbisML integrates with grid search from sklearn, but it could be that this is out of scope of the project.
This gave me this error:
It seems that the issues originates from the Ibis side of things, hence the ping. If this is out of scope for this project though I will gladly hear it.
The text was updated successfully, but these errors were encountered: