RL Model Training Setup on Google Cloud

We will be using Cloud Shell on Google Cloud Platform Console for all steps below

0. Pre-requisites

Create a Google Cloud (GCP) Account
GPU quota enabled in GCP
Open Cloud Shell from GCP Console

1. Setup GCE VM

This section is adapted from fastai documentation. Please see acknowledgements and references below.
We want to use docker, CUDA 10.1 and conda package manager, leveraging on existing Google Cloud Deep Learning VM images
Please ensure GPU quota is enabled, else please refer to fastai GCP setup link above, Step 3

export IMAGE_FAMILY="pytorch-latest-gpu"
export ZONE="us-west1-b"
export INSTANCE_NAME="drml"
export INSTANCE_TYPE="n1-highmem-4"

gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator="type=nvidia-tesla-t4,count=1" \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=200GB \
        --metadata="install-nvidia-driver=True"

2. Setup required packages, settings and code

Ensure your project has been set in Cloud Shell, if not execute gcloud config set project <project_id>
Login to VM from Cloud Shell gcloud compute ssh --zone=us-west1-b jupyter@drml
Create new tmux session so that you can leave training running after closing cloud shelltmux new-session -A -s airsimenv
Get project code git clone https://github.com/raymondng76/IRS-Practice-Module-Dev.git
Create conda environment sudo /opt/conda/bin/conda create -n airsim python=3.6.7
Activate conda environment conda activate airsim
Install packages: pip install -r IRS-Practice-Module-Dev/requirements.txt
Get AirSim: git clone https://github.com/microsoft/AirSim.git
Update settings file
- rm AirSim/docker/settings.json
- cp IRS-Practice-Module-Dev/airsim\ settings/settings.json.nodisplay AirSim/docker/
- mv AirSim/docker/settings.json.nodisplay AirSim/docker/settings.json

3. Build and run AirSim docker

Create new session named code: tmux new-session -A -s airsimenv
cd AirSim/docker
Execute Build Script, targeting Ubuntu18.04 and CUDA 10.1:

python build_airsim_image.py \
   --base_image=nvidia/cudagl:10.1-devel-ubuntu18.04 \
   --target_image=airsim_binary:10.1-devel-ubuntu18.04

Verify docker image built: docker images | grep airsim
To use the default Blocks environment run: ./download_blocks_env_binary.sh
To use a packaged AirSim Unreal Environment, for example Neighborhood: wget https://github.com/microsoft/AirSim/releases/download/v1.2.0Linux/Neighborhood.zip

For additional environment that can run on Linux, go to Microsoft AirSim Linux Release 1.2.0

Unzip to AirSim docker dir unzip Neighborhood.zip -d .
Run environment in headless mode: ./run_airsim_image_binary.sh airsim_binary:10.1-devel-ubuntu18.04 Neighborhood/AirSimNH.sh -windowed -ResX=1080 -ResY=720 -- headless. Replace the environment bash file as required.
Note: in settings.json file, no-display mode has also been setup to conserve resources.
Detach tmux session: ctrl-b ctrl-b d

4. Run model training file

Create new session named code: tmux new-session -A -s code
Activate conda environment conda activate airsim
Ensure python dependencies have been installed. Then execute the below commands
- Execute gdown 'https://drive.google.com/uc?id=1ciGqwUpfNPQu_Ua7cowU8mDIXOG_9kkf'
- Unzip the weights: unzip Final_Weights_Models.zip
Copy YOLOv3 model weights to IRS-Practice-Module-Dev main directory
- cp -r Final_Weights_Models/Yolov3_drone_weights/ IRS-Practice-Module-Dev/weights
(Optional) If you are continuing training copy existing RL model/iteration weights to IRS-Practice-Module-Dev/code
- cd ..
- e.g. cp -r Final_Weights_Models/RDQN_Single_Model/3rd_Iteration/* IRS-Practice-Module-Dev/code
Execute required model training file in IRS-Practice-Module-Dev/code folder Note this is for initial run. For resuming/continuing run, please see item 6.
- cd code
- python <model>.py --verbose
Detach tmux session: ctrl-b ctrl-b d
You can close cloud shell and let training to continue
OPTIONAL: Stackdriver monitoring is recommended to be set up for CPU utilization to ensure that any stop in training can be detected. A threshold of < 40% for 5min is recommended. More information can be found in codelabs or the documentation

5. Login to view progress

Login to VM with user as jupyter from Cloud Shell gcloud compute ssh --zone=us-west1-b jupyter@drml
View code progress or airsim output by typing tmux attach -t where SESSION can be airsimenv or code

6. Resuming model training

If for some reason (e.g. unexpected errors or VM restart) you need to resume training, use the following command python <model>.py --verbose --load_model

7. Download RL model weights to run simulation locally

Install croc: curl https://getcroc.schollz.com | bash
From VM home folder, execute croc send IRS-Practice-Module-Dev/code/save_model/<model>.h5 for required model
Note the passcode output from previous line and execute command on local: croc -yes <passcode>

Acknowledgements and References

fastai GCP setup
Croc Github

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcp_training_readme.md

gcp_training_readme.md

RL Model Training Setup on Google Cloud

0. Pre-requisites

1. Setup GCE VM

2. Setup required packages, settings and code

3. Build and run AirSim docker

4. Run model training file

5. Login to view progress

6. Resuming model training

7. Download RL model weights to run simulation locally

Acknowledgements and References

Files

gcp_training_readme.md

Latest commit

History

gcp_training_readme.md

File metadata and controls

RL Model Training Setup on Google Cloud

0. Pre-requisites

1. Setup GCE VM

2. Setup required packages, settings and code

3. Build and run AirSim docker

4. Run model training file

5. Login to view progress

6. Resuming model training

7. Download RL model weights to run simulation locally

Acknowledgements and References