Tune function and CLI command #1397

seraphimstreets · 2022-06-19T20:40:07Z

Created tune function in high_level.ml, and allowed usage via CLI. First step as part of the AutoML GSOC project: #968.

Testing

Tested tune command with the tuner ParameterGrid and XGBClassifier model (iris dataset), XGBRegressor (Small housing dataset). Example CLI command is as follows:

Download Iris datasets:

wget http://download.tensorflow.org/data/iris_training.csv 
wget http://download.tensorflow.org/data/iris_test.csv 
sed -i 's/.*setosa,versicolor,virginica/SepalLength,SepalWidth,PetalLength,PetalWidth,classification/g' iris_training.csv iris_test.csv

xgbtest.json

{
    "learning_rate": [0.01, 0.05, 0.1],
    "n_estimators": [20, 100, 200],
    "max_depth": [3,5,8]
}

CLI command:

dffml tune \
-model xgbclassifier \
-model-features \
 SepalLength:float:1 \
 SepalWidth:float:1 \
 PetalLength:float:1 \
-model-predict classification \
-model-location tempDir \
-tuner parameter_grid \
-tuner-parameters @xgbtest.json \
-tuner-objective max \
-scorer clf \
-sources train=csv test=csv \
-source-train-filename iris_training.csv \
-source-test-filename iris_test.csv

…to tunecli

mhash1m

Great Job @seraphimstreets ! 👏 Lets Implement the changes and discuss what needs to be discussed over the next weekend.
@programmer290399 if you have anything to add please do. There are somethings here I have mentioned to discuss, lets do that over the weekend.

mhash1m · 2022-07-06T08:04:41Z

dffml/high_level/ml.py

+    ...         [
+    ...             {"Years": 0, "Salary": 10},
+    ...             {"Years": 1, "Salary": 20},
+    ...             {"Years": 2, "Salary": 30},
+    ...             {"Years": 3, "Salary": 40}
+    ...         ],
+    ...         [
+    ...             {"Years": 6, "Salary": 70},
+    ...             {"Years": 7, "Salary": 80}
+    ...         ]


so we want the train and test sets to be passed in as keyword arguments like this:

score = await tune(model, ParameterGrid(objective="min"), ... MeanSquaredErrorAccuracy(), ... Features( ... Feature("Years", float, 1), ... ), train = [ {"Years": 0, "Salary": 10}, {"Years": 1, "Salary": 20}, {"Years": 2, "Salary": 30}, {"Years": 3, "Salary": 40} ], test = [ {"Years": 6, "Salary": 70}, {"Years": 7, "Salary": 80} ])

mhash1m · 2022-07-06T08:44:40Z

dffml/high_level/ml.py

+    if hasattr(model.config, "features") and any(
+        isinstance(td, list) for td in train_ds
+    ):
+        train_ds = list_records_to_dict(
+            [feature.name for feature in model.config.features]
+            + predict_feature,
+            *train_ds,
+            model=model,
+        )
+    if hasattr(model.config, "features") and any(
+        isinstance(td, list) for td in valid_ds
+    ):
+        valid_ds = list_records_to_dict(
+            [feature.name for feature in model.config.features]
+            + predict_feature,
+            *valid_ds,
+            model=model,
+        )


avoid repetition of code.
I dont think we want another function for this.
And the conditions have different loops, might be too complex to combine them. If you can figure out a simple way, this is most preferable.
Otherwise, lets just loop over both datasets outside the condition i suppose.

mhash1m · 2022-07-06T08:48:31Z

dffml/high_level/ml.py

+        elif isinstance(model, ModelContext):
+            mctx = model
+
+        # Allow for keep models open


#Allow scorers to be kept open

mhash1m · 2022-07-06T08:55:09Z

dffml/plugins.py

@@ -51,6 +51,7 @@ def inpath(binary):
    ("operations", "nlp"),
    ("service", "http"),
    ("source", "mysql"),
+    ("tuner", "bayes_opt_gp"),


lets have a simpler, more understandable entrypoint

mhash1m · 2022-07-11T15:09:36Z

dffml/tuner/parameter_grid.py

+                if self.parent.config.objective == "min":
+                    if acc < highest_acc:
+                        highest_acc = acc
+


This line space doesn't feel right.

mhash1m · 2022-07-11T15:20:30Z

dffml/tuner/random_search.py

+                for i in range(len(combination)):
+                    param = names[i]
+                    setattr(model.parent.config, names[i], combination[i])
+                await train(model.parent, *train_data)
+                acc = await score(
+                    model.parent, accuracy_scorer, feature, *test_data
+                )
+
+                logging.info(f"Accuracy of the tuned model: {acc}")
+                if self.parent.config.objective == "min":
+                    if acc < highest_acc:
+                        highest_acc = acc
+                        for param in names:
+                            best_config[param] = getattr(
+                                model.parent.config, param
+                            )
+                elif self.parent.config.objective == "max":
+                    if acc > highest_acc:
+                        highest_acc = acc
+                        for param in names:
+                            best_config[param] = getattr(
+                                model.parent.config, param
+                            )


seems like there could be some code recurring through the different tuners. Let's consider the possibility of using helper/utility functions

mhash1m · 2022-07-11T15:21:29Z

examples/rockpaperscissors/tune.sh

+
+
+


remove extra lines

mhash1m · 2022-07-11T15:28:21Z

dffml/tuner/random_search.py

+        with model.parent.config.no_enforce_immutable():
+            for _ in range(self.parent.config.trials):
+                combination = []
+                for pvs in self.parent.config.parameters.values():


what does .values() get here?

It's the values for the search space of the parameter. i.e [1,10,100] for n_iterations.

mhash1m · 2022-07-11T15:34:17Z

model/pytorch/tests/test_pytorchnet.py

+    async def test_03_tune(self):
+        acc = await tune(
+            self.model,
+            self.tuner,
+            self.scorer,
+            Feature("label", str, 1),
+            [DirectorySource(
+                foldername=str(self.traindir) + "/rps",
+                feature="image",
+                labels=["rock", "paper", "scissors"],
+            )],
+            [DirectorySource(
+                foldername=str(self.testdir) + "/rps-test-set",
+                feature="image",
+                labels=["rock", "paper", "scissors"],
+            )],
+        )
+        self.assertGreater(acc, 0.0)
+


Okay so heres the drill, normally we won't want to test tuners for each model unless they are behaving differently for each model.
We need to add the unit tests in their respective tuner modules.
ie. you add a test for each tuner in their respective directories.
Even if they do behave differently for each model, you add a unit test that loops through all models available an tunes them.

Also a reasonable way to check if tuner is performing well would be to see if it is actually tuning. Eg. for parameter grid you can have two or 3 sets of values. Have one of them to be optimal and the other 2 as absurd values that would throw predictions off. Place a check for optimized parameters.
Might still probably end up self.assertGreater(acc, 0.0) for the random search and similar tuners but you get the idea, lets see if you can derive anythign else.

Let's discuss these further in a meeting with @programmer290399 and figure out the best way to set these up.

mhash1m · 2022-07-11T15:43:55Z

tuner/bayes_opt_gp/setup.cfg

+packages = find:
+install_requires =
+    dffml>=0.4.0
+    bayesian-optimization>=1.2.0


is there an implementation in a library we already have in requirements? say sklearn? We always wan't to have minimal dependencies

There is BayesSearchCV in the scikit-optimize library (which is also separate). But it requires that the model implement the sklearn estimator API.

johnandersen777 · 2022-07-24T13:12:53Z

dffml/high_level/ml.py

+        else:
+            predict_feature = [model.config.predict.name]
+
+    def records_to_dict_check(ds):


Let's pull this out into the global scope or dffml/util/internal.py

johnandersen777 · 2022-07-24T13:17:24Z

tuner/bayes_opt_gp/dffml_tuner_bayes_opt_gp/bayes_opt_gp.py

+
+        nest_asyncio.apply()
+
+        def check_parameters(pars):


Let's pull this out into a method in the same class.

johnandersen777 · 2022-07-24T13:18:17Z

tuner/bayes_opt_gp/dffml_tuner_bayes_opt_gp/bayes_opt_gp.py

+            f"Optimizing model with Bayesian optimization with gaussian processes: {self.parent.config.parameters}"
+        )
+
+        def func(**vals):


Let's pull this out into a method as well

johnandersen777 · 2022-07-24T13:20:28Z

tuner/bayes_opt_gp/dffml_tuner_bayes_opt_gp/bayes_opt_gp.py

+                    return acc
+
+        optimizer = BayesianOptimization(
+            f=func,


Suggested change

f=func,

f=functools.partial(model, func),

It becomes:

def func(self, model, **vals):

johnandersen777 · 2022-07-24T13:20:58Z

tuner/bayes_opt_gp/dffml_tuner_bayes_opt_gp/bayes_opt_gp.py

+            The highest score value
+        """
+
+        nest_asyncio.apply()


Suggested change

nest_asyncio.apply()

This is probably something for a different PR, but it could potentially go well at the start of the noasync functions.

Since the train/score functions in the asynchronous tune function need to be synchronous, nest_asyncio is necessary. Shall I move it to the noasync file then?

johnandersen777 · 2022-07-24T13:45:07Z

docs/tutorials/tuners/bayes_opt_gp.rst

+First, download the xgboost plugin for the DFFML library, which can be done via pip: 
+
+.. code-block:: console
+    :test:


Suggested change

:test:

:test:

johnandersen777 · 2022-07-24T13:47:37Z

docs/tutorials/tuners/bayes_opt_gp.rst

+
+

Suggested change

.. code-block:: console

:test:

$ python -u bayes_opt_gp_xgboost.py

johnandersen777 · 2022-07-24T13:48:30Z

docs/tutorials/tuners/bayes_opt_gp.rst

+
+In the same folder, we perform the CLI tune command.
+
+.. code-block:: console


Suggested change

.. code-block:: console

.. code-block:: console

:test:

johnandersen777 · 2022-07-24T13:49:03Z

docs/tutorials/tuners/bayes_opt_gp.rst

+    -model xgbclassifier \
+    -model-features \
+    SepalLength:float:1 \
+    SepalWidth:float:1 \
+    PetalLength:float:1 \
+    -model-predict classification \
+    -model-location tempDir \
+    -tuner bayes_opt_gp \
+    -tuner-parameters @parameters.json \
+    -tuner-objective max \
+    -scorer clf \
+    -sources train=csv test=csv \
+    -source-train-filename iris_training.csv \
+    -source-test-filename iris_test.csv \
+    -source-train-tag train \
+    -source-test-tag test \
+    -features classification:int:1


Suggested change

-model xgbclassifier \

-model-features \

SepalLength:float:1 \

SepalWidth:float:1 \

PetalLength:float:1 \

-model-predict classification \

-model-location tempDir \

-tuner bayes_opt_gp \

-tuner-parameters @parameters.json \

-tuner-objective max \

-scorer clf \

-sources train=csv test=csv \

-source-train-filename iris_training.csv \

-source-test-filename iris_test.csv \

-source-train-tag train \

-source-test-tag test \

-features classification:int:1

-model xgbclassifier \

-model-features \

SepalLength:float:1 \

SepalWidth:float:1 \

PetalLength:float:1 \

-model-predict classification \

-model-location tempDir \

-tuner bayes_opt_gp \

-tuner-parameters @parameters.json \

-tuner-objective max \

-scorer clf \

-sources train=csv test=csv \

-source-train-filename iris_training.csv \

-source-test-filename iris_test.csv \

-source-train-tag train \

-source-test-tag test \

-features classification:int:1

These all need two more spaces in front of them

johnandersen777 · 2022-07-24T13:53:07Z

docs/tutorials/tuners/parameter_grid.rst

@@ -0,0 +1,162 @@
+Tuning a DFFML model with ParameterGrid


Let's add both these tutorials to the line in the CI YAML file here:

dffml/.github/workflows/testing.yml

Line 219 in 748c1b9

- docs/tutorials/sources/file.rst

seraphimstreets changed the title ~~"tune function and CLI command"~~ Tune function and CLI command Jun 19, 2022

"tune function and CLI command"

68c923e

seraphimstreets force-pushed the tunecli branch from c7fe54f to 68c923e Compare June 22, 2022 09:07

"tune function and CLI command"

4a7de3a

seraphimstreets force-pushed the tunecli branch from 68c923e to 4a7de3a Compare June 24, 2022 13:02

seraphimstreets added 4 commits June 30, 2022 20:34

Merge branch 'tunecli' of https://github.com/seraphimstreets/dffml in…

5623a7d

…to tunecli

"unit tests for xgboost, pytorch, spacy"

cef4d3e

"unit test cleaning"

41e4284

"random_search and bayes_opt_gp"

742be25

mhash1m reviewed Jul 11, 2022

View reviewed changes

Minor fixes and documentation

d4ca3b2

johnandersen777 suggested changes Jul 24, 2022

View reviewed changes

seraphimstreets added 2 commits July 29, 2022 11:02

Added requested changes

54d54d5

"minor doctest edits"

5a05c86

johnandersen777 added the awaiting maintainer The PR is waiting for a maintainer to review it label Feb 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tune function and CLI command #1397

Tune function and CLI command #1397

seraphimstreets commented Jun 19, 2022

mhash1m left a comment •

edited

Loading

mhash1m Jul 6, 2022

mhash1m Jul 6, 2022

mhash1m Jul 6, 2022

mhash1m Jul 6, 2022

mhash1m Jul 11, 2022

mhash1m Jul 11, 2022

mhash1m Jul 11, 2022

mhash1m Jul 11, 2022

seraphimstreets Jul 12, 2022

mhash1m Jul 11, 2022

mhash1m Jul 11, 2022

seraphimstreets Jul 12, 2022 •

edited

Loading

johnandersen777 Jul 24, 2022

johnandersen777 Jul 24, 2022

johnandersen777 Jul 24, 2022

johnandersen777 Jul 24, 2022

johnandersen777 Jul 24, 2022

johnandersen777 Jul 24, 2022

seraphimstreets Jul 30, 2022

johnandersen777 Jul 24, 2022

johnandersen777 Jul 24, 2022

johnandersen777 Jul 24, 2022

johnandersen777 Jul 24, 2022

johnandersen777 Jul 24, 2022


		In the same folder, we perform the CLI tune command.

		.. code-block:: console

Tune function and CLI command #1397

Are you sure you want to change the base?

Tune function and CLI command #1397

Conversation

seraphimstreets commented Jun 19, 2022

Testing

mhash1m left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seraphimstreets Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhash1m left a comment •

edited

Loading

seraphimstreets Jul 12, 2022 •

edited

Loading