-
Notifications
You must be signed in to change notification settings - Fork 138
Tune function and CLI command #1397
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great Job @seraphimstreets ! 👏 Lets Implement the changes and discuss what needs to be discussed over the next weekend.
@programmer290399 if you have anything to add please do. There are somethings here I have mentioned to discuss, lets do that over the weekend.
... [ | ||
... {"Years": 0, "Salary": 10}, | ||
... {"Years": 1, "Salary": 20}, | ||
... {"Years": 2, "Salary": 30}, | ||
... {"Years": 3, "Salary": 40} | ||
... ], | ||
... [ | ||
... {"Years": 6, "Salary": 70}, | ||
... {"Years": 7, "Salary": 80} | ||
... ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we want the train and test sets to be passed in as keyword arguments like this:
score = await tune(model,
ParameterGrid(objective="min"),
... MeanSquaredErrorAccuracy(),
... Features(
... Feature("Years", float, 1),
... ),
train = [
{"Years": 0, "Salary": 10},
{"Years": 1, "Salary": 20},
{"Years": 2, "Salary": 30},
{"Years": 3, "Salary": 40}
],
test = [
{"Years": 6, "Salary": 70},
{"Years": 7, "Salary": 80}
])
dffml/high_level/ml.py
Outdated
if hasattr(model.config, "features") and any( | ||
isinstance(td, list) for td in train_ds | ||
): | ||
train_ds = list_records_to_dict( | ||
[feature.name for feature in model.config.features] | ||
+ predict_feature, | ||
*train_ds, | ||
model=model, | ||
) | ||
if hasattr(model.config, "features") and any( | ||
isinstance(td, list) for td in valid_ds | ||
): | ||
valid_ds = list_records_to_dict( | ||
[feature.name for feature in model.config.features] | ||
+ predict_feature, | ||
*valid_ds, | ||
model=model, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoid repetition of code.
I dont think we want another function for this.
And the conditions have different loops, might be too complex to combine them. If you can figure out a simple way, this is most preferable.
Otherwise, lets just loop over both datasets outside the condition i suppose.
dffml/high_level/ml.py
Outdated
elif isinstance(model, ModelContext): | ||
mctx = model | ||
|
||
# Allow for keep models open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#Allow scorers to be kept open
@@ -51,6 +51,7 @@ def inpath(binary): | |||
("operations", "nlp"), | |||
("service", "http"), | |||
("source", "mysql"), | |||
("tuner", "bayes_opt_gp"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets have a simpler, more understandable entrypoint
dffml/tuner/parameter_grid.py
Outdated
if self.parent.config.objective == "min": | ||
if acc < highest_acc: | ||
highest_acc = acc | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line space doesn't feel right.
for i in range(len(combination)): | ||
param = names[i] | ||
setattr(model.parent.config, names[i], combination[i]) | ||
await train(model.parent, *train_data) | ||
acc = await score( | ||
model.parent, accuracy_scorer, feature, *test_data | ||
) | ||
|
||
logging.info(f"Accuracy of the tuned model: {acc}") | ||
if self.parent.config.objective == "min": | ||
if acc < highest_acc: | ||
highest_acc = acc | ||
for param in names: | ||
best_config[param] = getattr( | ||
model.parent.config, param | ||
) | ||
elif self.parent.config.objective == "max": | ||
if acc > highest_acc: | ||
highest_acc = acc | ||
for param in names: | ||
best_config[param] = getattr( | ||
model.parent.config, param | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like there could be some code recurring through the different tuners. Let's consider the possibility of using helper/utility functions
examples/rockpaperscissors/tune.sh
Outdated
|
||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove extra lines
with model.parent.config.no_enforce_immutable(): | ||
for _ in range(self.parent.config.trials): | ||
combination = [] | ||
for pvs in self.parent.config.parameters.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does .values()
get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the values for the search space of the parameter. i.e [1,10,100] for n_iterations.
async def test_03_tune(self): | ||
acc = await tune( | ||
self.model, | ||
self.tuner, | ||
self.scorer, | ||
Feature("label", str, 1), | ||
[DirectorySource( | ||
foldername=str(self.traindir) + "/rps", | ||
feature="image", | ||
labels=["rock", "paper", "scissors"], | ||
)], | ||
[DirectorySource( | ||
foldername=str(self.testdir) + "/rps-test-set", | ||
feature="image", | ||
labels=["rock", "paper", "scissors"], | ||
)], | ||
) | ||
self.assertGreater(acc, 0.0) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so heres the drill, normally we won't want to test tuners for each model unless they are behaving differently for each model.
We need to add the unit tests in their respective tuner modules.
ie. you add a test for each tuner in their respective directories.
Even if they do behave differently for each model, you add a unit test that loops through all models available an tunes them.
Also a reasonable way to check if tuner is performing well would be to see if it is actually tuning. Eg. for parameter grid you can have two or 3 sets of values. Have one of them to be optimal and the other 2 as absurd values that would throw predictions off. Place a check for optimized parameters.
Might still probably end up self.assertGreater(acc, 0.0)
for the random search and similar tuners but you get the idea, lets see if you can derive anythign else.
Let's discuss these further in a meeting with @programmer290399 and figure out the best way to set these up.
packages = find: | ||
install_requires = | ||
dffml>=0.4.0 | ||
bayesian-optimization>=1.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there an implementation in a library we already have in requirements? say sklearn? We always wan't to have minimal dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is BayesSearchCV in the scikit-optimize library (which is also separate). But it requires that the model implement the sklearn estimator API.
dffml/high_level/ml.py
Outdated
else: | ||
predict_feature = [model.config.predict.name] | ||
|
||
def records_to_dict_check(ds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pull this out into the global scope or dffml/util/internal.py
|
||
nest_asyncio.apply() | ||
|
||
def check_parameters(pars): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pull this out into a method in the same class.
f"Optimizing model with Bayesian optimization with gaussian processes: {self.parent.config.parameters}" | ||
) | ||
|
||
def func(**vals): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's pull this out into a method as well
return acc | ||
|
||
optimizer = BayesianOptimization( | ||
f=func, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f=func, | |
f=functools.partial(model, func), |
It becomes:
def func(self, model, **vals):
The highest score value | ||
""" | ||
|
||
nest_asyncio.apply() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nest_asyncio.apply() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably something for a different PR, but it could potentially go well at the start of the noasync
functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the train/score functions in the asynchronous tune function need to be synchronous, nest_asyncio is necessary. Shall I move it to the noasync file then?
First, download the xgboost plugin for the DFFML library, which can be done via pip: | ||
|
||
.. code-block:: console | ||
:test: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:test: | |
:test: | |
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. code-block:: console | |
:test: | |
$ python -u bayes_opt_gp_xgboost.py |
|
||
In the same folder, we perform the CLI tune command. | ||
|
||
.. code-block:: console |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.. code-block:: console | |
.. code-block:: console | |
:test: |
-model xgbclassifier \ | ||
-model-features \ | ||
SepalLength:float:1 \ | ||
SepalWidth:float:1 \ | ||
PetalLength:float:1 \ | ||
-model-predict classification \ | ||
-model-location tempDir \ | ||
-tuner bayes_opt_gp \ | ||
-tuner-parameters @parameters.json \ | ||
-tuner-objective max \ | ||
-scorer clf \ | ||
-sources train=csv test=csv \ | ||
-source-train-filename iris_training.csv \ | ||
-source-test-filename iris_test.csv \ | ||
-source-train-tag train \ | ||
-source-test-tag test \ | ||
-features classification:int:1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-model xgbclassifier \ | |
-model-features \ | |
SepalLength:float:1 \ | |
SepalWidth:float:1 \ | |
PetalLength:float:1 \ | |
-model-predict classification \ | |
-model-location tempDir \ | |
-tuner bayes_opt_gp \ | |
-tuner-parameters @parameters.json \ | |
-tuner-objective max \ | |
-scorer clf \ | |
-sources train=csv test=csv \ | |
-source-train-filename iris_training.csv \ | |
-source-test-filename iris_test.csv \ | |
-source-train-tag train \ | |
-source-test-tag test \ | |
-features classification:int:1 | |
-model xgbclassifier \ | |
-model-features \ | |
SepalLength:float:1 \ | |
SepalWidth:float:1 \ | |
PetalLength:float:1 \ | |
-model-predict classification \ | |
-model-location tempDir \ | |
-tuner bayes_opt_gp \ | |
-tuner-parameters @parameters.json \ | |
-tuner-objective max \ | |
-scorer clf \ | |
-sources train=csv test=csv \ | |
-source-train-filename iris_training.csv \ | |
-source-test-filename iris_test.csv \ | |
-source-train-tag train \ | |
-source-test-tag test \ | |
-features classification:int:1 |
These all need two more spaces in front of them
@@ -0,0 +1,162 @@ | |||
Tuning a DFFML model with ParameterGrid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add both these tutorials to the line in the CI YAML file here:
dffml/.github/workflows/testing.yml
Line 219 in 748c1b9
- docs/tutorials/sources/file.rst |
Created tune function in high_level.ml, and allowed usage via CLI. First step as part of the AutoML GSOC project: #968.
Testing
Tested tune command with the tuner ParameterGrid and XGBClassifier model (iris dataset), XGBRegressor (Small housing dataset). Example CLI command is as follows:
Download Iris datasets:
xgbtest.json
CLI command: