Skip to content
This repository has been archived by the owner on Aug 25, 2024. It is now read-only.

Tune function and CLI command #1397

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

seraphimstreets
Copy link

Created tune function in high_level.ml, and allowed usage via CLI. First step as part of the AutoML GSOC project: #968.

Testing

Tested tune command with the tuner ParameterGrid and XGBClassifier model (iris dataset), XGBRegressor (Small housing dataset). Example CLI command is as follows:

Download Iris datasets:

wget http://download.tensorflow.org/data/iris_training.csv 
wget http://download.tensorflow.org/data/iris_test.csv 
sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' iris_training.csv iris_test.csv

xgbtest.json

{
    "learning_rate": [0.01, 0.05, 0.1],
    "n_estimators": [20, 100, 200],
    "max_depth": [3,5,8]
}

CLI command:

dffml tune \
-model xgbclassifier \
-model-features \
 SepalLength:float:1 \
 SepalWidth:float:1 \
 PetalLength:float:1 \
-model-predict classification \
-model-location tempDir \
-tuner parameter_grid \
-tuner-parameters @xgbtest.json \
-tuner-objective max \
-scorer clf \
-sources train=csv test=csv \
-source-train-filename iris_training.csv \
-source-test-filename iris_test.csv

@seraphimstreets seraphimstreets changed the title "tune function and CLI command" Tune function and CLI command Jun 19, 2022
Copy link
Contributor

@mhash1m mhash1m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great Job @seraphimstreets ! 👏 Lets Implement the changes and discuss what needs to be discussed over the next weekend.
@programmer290399 if you have anything to add please do. There are somethings here I have mentioned to discuss, lets do that over the weekend.

Comment on lines +359 to +368
... [
... {"Years": 0, "Salary": 10},
... {"Years": 1, "Salary": 20},
... {"Years": 2, "Salary": 30},
... {"Years": 3, "Salary": 40}
... ],
... [
... {"Years": 6, "Salary": 70},
... {"Years": 7, "Salary": 80}
... ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we want the train and test sets to be passed in as keyword arguments like this:

score = await tune(model, 
                ParameterGrid(objective="min"),
    ...         MeanSquaredErrorAccuracy(),
    ...         Features(
    ...             Feature("Years", float, 1),
    ...         ),
              train =   [
              {"Years": 0, "Salary": 10},
              {"Years": 1, "Salary": 20},
              {"Years": 2, "Salary": 30},
              {"Years": 3, "Salary": 40}
           ],
          test =  [
             {"Years": 6, "Salary": 70},
             {"Years": 7, "Salary": 80}
      ])

Comment on lines 391 to 408
if hasattr(model.config, "features") and any(
isinstance(td, list) for td in train_ds
):
train_ds = list_records_to_dict(
[feature.name for feature in model.config.features]
+ predict_feature,
*train_ds,
model=model,
)
if hasattr(model.config, "features") and any(
isinstance(td, list) for td in valid_ds
):
valid_ds = list_records_to_dict(
[feature.name for feature in model.config.features]
+ predict_feature,
*valid_ds,
model=model,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid repetition of code.
I dont think we want another function for this.
And the conditions have different loops, might be too complex to combine them. If you can figure out a simple way, this is most preferable.
Otherwise, lets just loop over both datasets outside the condition i suppose.

elif isinstance(model, ModelContext):
mctx = model

# Allow for keep models open
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#Allow scorers to be kept open

@@ -51,6 +51,7 @@ def inpath(binary):
("operations", "nlp"),
("service", "http"),
("source", "mysql"),
("tuner", "bayes_opt_gp"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets have a simpler, more understandable entrypoint

if self.parent.config.objective == "min":
if acc < highest_acc:
highest_acc = acc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line space doesn't feel right.

Comment on lines +92 to +114
for i in range(len(combination)):
param = names[i]
setattr(model.parent.config, names[i], combination[i])
await train(model.parent, *train_data)
acc = await score(
model.parent, accuracy_scorer, feature, *test_data
)

logging.info(f"Accuracy of the tuned model: {acc}")
if self.parent.config.objective == "min":
if acc < highest_acc:
highest_acc = acc
for param in names:
best_config[param] = getattr(
model.parent.config, param
)
elif self.parent.config.objective == "max":
if acc > highest_acc:
highest_acc = acc
for param in names:
best_config[param] = getattr(
model.parent.config, param
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like there could be some code recurring through the different tuners. Let's consider the possibility of using helper/utility functions

Comment on lines 28 to 30



Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra lines

with model.parent.config.no_enforce_immutable():
for _ in range(self.parent.config.trials):
combination = []
for pvs in self.parent.config.parameters.values():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does .values() get here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the values for the search space of the parameter. i.e [1,10,100] for n_iterations.

Comment on lines +155 to +173
async def test_03_tune(self):
acc = await tune(
self.model,
self.tuner,
self.scorer,
Feature("label", str, 1),
[DirectorySource(
foldername=str(self.traindir) + "/rps",
feature="image",
labels=["rock", "paper", "scissors"],
)],
[DirectorySource(
foldername=str(self.testdir) + "/rps-test-set",
feature="image",
labels=["rock", "paper", "scissors"],
)],
)
self.assertGreater(acc, 0.0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so heres the drill, normally we won't want to test tuners for each model unless they are behaving differently for each model.
We need to add the unit tests in their respective tuner modules.
ie. you add a test for each tuner in their respective directories.
Even if they do behave differently for each model, you add a unit test that loops through all models available an tunes them.

Also a reasonable way to check if tuner is performing well would be to see if it is actually tuning. Eg. for parameter grid you can have two or 3 sets of values. Have one of them to be optimal and the other 2 as absurd values that would throw predictions off. Place a check for optimized parameters.
Might still probably end up self.assertGreater(acc, 0.0) for the random search and similar tuners but you get the idea, lets see if you can derive anythign else.

Let's discuss these further in a meeting with @programmer290399 and figure out the best way to set these up.

packages = find:
install_requires =
dffml>=0.4.0
bayesian-optimization>=1.2.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an implementation in a library we already have in requirements? say sklearn? We always wan't to have minimal dependencies

Copy link
Author

@seraphimstreets seraphimstreets Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is BayesSearchCV in the scikit-optimize library (which is also separate). But it requires that the model implement the sklearn estimator API.

else:
predict_feature = [model.config.predict.name]

def records_to_dict_check(ds):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pull this out into the global scope or dffml/util/internal.py


nest_asyncio.apply()

def check_parameters(pars):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pull this out into a method in the same class.

f"Optimizing model with Bayesian optimization with gaussian processes: {self.parent.config.parameters}"
)

def func(**vals):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's pull this out into a method as well

return acc

optimizer = BayesianOptimization(
f=func,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f=func,
f=functools.partial(model, func),

It becomes:

        def func(self, model, **vals):

The highest score value
"""

nest_asyncio.apply()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nest_asyncio.apply()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably something for a different PR, but it could potentially go well at the start of the noasync functions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the train/score functions in the asynchronous tune function need to be synchronous, nest_asyncio is necessary. Shall I move it to the noasync file then?

First, download the xgboost plugin for the DFFML library, which can be done via pip:

.. code-block:: console
:test:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:test:
:test:

Comment on lines +95 to +96


Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. code-block:: console
:test:
$ python -u bayes_opt_gp_xgboost.py


In the same folder, we perform the CLI tune command.

.. code-block:: console

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.. code-block:: console
.. code-block:: console
:test:

Comment on lines 146 to 162
-model xgbclassifier \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
-model-predict classification \
-model-location tempDir \
-tuner bayes_opt_gp \
-tuner-parameters @parameters.json \
-tuner-objective max \
-scorer clf \
-sources train=csv test=csv \
-source-train-filename iris_training.csv \
-source-test-filename iris_test.csv \
-source-train-tag train \
-source-test-tag test \
-features classification:int:1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
-model xgbclassifier \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
-model-predict classification \
-model-location tempDir \
-tuner bayes_opt_gp \
-tuner-parameters @parameters.json \
-tuner-objective max \
-scorer clf \
-sources train=csv test=csv \
-source-train-filename iris_training.csv \
-source-test-filename iris_test.csv \
-source-train-tag train \
-source-test-tag test \
-features classification:int:1
-model xgbclassifier \
-model-features \
SepalLength:float:1 \
SepalWidth:float:1 \
PetalLength:float:1 \
-model-predict classification \
-model-location tempDir \
-tuner bayes_opt_gp \
-tuner-parameters @parameters.json \
-tuner-objective max \
-scorer clf \
-sources train=csv test=csv \
-source-train-filename iris_training.csv \
-source-test-filename iris_test.csv \
-source-train-tag train \
-source-test-tag test \
-features classification:int:1

These all need two more spaces in front of them

@@ -0,0 +1,162 @@
Tuning a DFFML model with ParameterGrid

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add both these tutorials to the line in the CI YAML file here:

- docs/tutorials/sources/file.rst

@johnandersen777 johnandersen777 added the awaiting maintainer The PR is waiting for a maintainer to review it label Feb 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
awaiting maintainer The PR is waiting for a maintainer to review it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants