Kubernetes for hyperparameter search experiments

This repository contains code and config files that accompany the following blog post: Kubernetes for AI Hyperparameter Search Experiments

Tested on Kubernetes version 1.10.11

Install guide: https://docs.nvidia.com/datacenter/kubernetes/kubernetes-install-guide/index.html

Hyperparameters for a machine learning model are options not optimized or learned during the training phase. Hyperparameters typically include options such as learning rate schedule, batch size, data augmentation options and others. Each option greatly affects the model accuracy on the same dataset. Two of the most common strategies for selecting the best hyperparameters for a model are grid search and random search. In the grid search method (also known as the parameter sweep method) you define the search space by enumerating all possible hyperparameter values and train a model on each set of values. Random search only select random sets of values sampled from the exhaustive set. The results of each training run are then validated against a separate validation set.

This repository includes kubernetes specification files and training scripts for running running large-scale hyperparameter search experiments using Kubernetes on a GPU cluster as shown in the figure below. The framework is flexible and allows you to do grid search or random search and implements “version everything” so you can trace back all previously run experiments.

The training script is a modified version of the submission by David Page on the Stanford’s DAWNBench webpage. The key modifications include changes to the training script that allow it ot accept hyperparameters by reading them from yaml spec. file.

Assuming you’ve already started by setting up a Kubernetes cluster, our solution for running hyperparameter search experiments consists of the following 7 steps:

Specify hyperparameter search space
Develop a training script that can accept hyperparameters and apply them to the training routine
Push training scripts and hyperparameters in a Git repository for tracking
Upload training and test dataset to a network storage such as NFS server
Specify Kubernetes Job specification files in YAML
Submit multiple Kubernetes job requests using above specification template
Analyze the results and pick the hyperparameter set

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
hyperparam-jobs-specs		hyperparam-jobs-specs
Hyperparam_search_results.ipynb		Hyperparam_search_results.ipynb
LICENSE		LICENSE
README.md		README.md
cifar10-job-template.yml		cifar10-job-template.yml
cifar10_train.py		cifar10_train.py
create_jobs.sh		create_jobs.sh
generate_hyperparam_combinations.py		generate_hyperparam_combinations.py
hyperparams.yml		hyperparams.yml
k8s_hyperparam_ref_arch.PNG		k8s_hyperparam_ref_arch.PNG
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kubernetes for hyperparameter search experiments

About

Releases

Packages

Contributors 2

Languages

License

NVIDIA-developer-blog/kubernetes-hyperparam-exp

Folders and files

Latest commit

History

Repository files navigation

Kubernetes for hyperparameter search experiments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages