Skip to content

Latest commit

 

History

History
97 lines (84 loc) · 5.59 KB

gcp_training_readme.md

File metadata and controls

97 lines (84 loc) · 5.59 KB

RL Model Training Setup on Google Cloud

We will be using Cloud Shell on Google Cloud Platform Console for all steps below

0. Pre-requisites

  1. Create a Google Cloud (GCP) Account
  2. GPU quota enabled in GCP
  3. Open Cloud Shell from GCP Console

1. Setup GCE VM

  1. This section is adapted from fastai documentation. Please see acknowledgements and references below.
  2. We want to use docker, CUDA 10.1 and conda package manager, leveraging on existing Google Cloud Deep Learning VM images
  3. Please ensure GPU quota is enabled, else please refer to fastai GCP setup link above, Step 3
export IMAGE_FAMILY="pytorch-latest-gpu"
export ZONE="us-west1-b"
export INSTANCE_NAME="drml"
export INSTANCE_TYPE="n1-highmem-4"

gcloud compute instances create $INSTANCE_NAME \
        --zone=$ZONE \
        --image-family=$IMAGE_FAMILY \
        --image-project=deeplearning-platform-release \
        --maintenance-policy=TERMINATE \
        --accelerator="type=nvidia-tesla-t4,count=1" \
        --machine-type=$INSTANCE_TYPE \
        --boot-disk-size=200GB \
        --metadata="install-nvidia-driver=True"

2. Setup required packages, settings and code

  1. Ensure your project has been set in Cloud Shell, if not execute gcloud config set project <project_id>
  2. Login to VM from Cloud Shell gcloud compute ssh --zone=us-west1-b jupyter@drml
  3. Create new tmux session so that you can leave training running after closing cloud shelltmux new-session -A -s airsimenv
  4. Get project code git clone https://github.com/raymondng76/IRS-Practice-Module-Dev.git
  5. Create conda environment sudo /opt/conda/bin/conda create -n airsim python=3.6.7
  6. Activate conda environment conda activate airsim
  7. Install packages: pip install -r IRS-Practice-Module-Dev/requirements.txt
  8. Get AirSim: git clone https://github.com/microsoft/AirSim.git
  9. Update settings file
    • rm AirSim/docker/settings.json
    • cp IRS-Practice-Module-Dev/airsim\ settings/settings.json.nodisplay AirSim/docker/
    • mv AirSim/docker/settings.json.nodisplay AirSim/docker/settings.json

3. Build and run AirSim docker

  1. Create new session named code: tmux new-session -A -s airsimenv
  2. cd AirSim/docker
  3. Execute Build Script, targeting Ubuntu18.04 and CUDA 10.1:
python build_airsim_image.py \
   --base_image=nvidia/cudagl:10.1-devel-ubuntu18.04 \
   --target_image=airsim_binary:10.1-devel-ubuntu18.04
  1. Verify docker image built: docker images | grep airsim
  2. To use the default Blocks environment run: ./download_blocks_env_binary.sh
  3. To use a packaged AirSim Unreal Environment, for example Neighborhood: wget https://github.com/microsoft/AirSim/releases/download/v1.2.0Linux/Neighborhood.zip
  1. Unzip to AirSim docker dir unzip Neighborhood.zip -d .
  2. Run environment in headless mode: ./run_airsim_image_binary.sh airsim_binary:10.1-devel-ubuntu18.04 Neighborhood/AirSimNH.sh -windowed -ResX=1080 -ResY=720 -- headless. Replace the environment bash file as required.
  3. Note: in settings.json file, no-display mode has also been setup to conserve resources.
  4. Detach tmux session: ctrl-b ctrl-b d

4. Run model training file

  1. Create new session named code: tmux new-session -A -s code
  2. Activate conda environment conda activate airsim
  3. Ensure python dependencies have been installed. Then execute the below commands
    • Execute gdown 'https://drive.google.com/uc?id=1ciGqwUpfNPQu_Ua7cowU8mDIXOG_9kkf'
    • Unzip the weights: unzip Final_Weights_Models.zip
  4. Copy YOLOv3 model weights to IRS-Practice-Module-Dev main directory
    • cp -r Final_Weights_Models/Yolov3_drone_weights/ IRS-Practice-Module-Dev/weights
  5. (Optional) If you are continuing training copy existing RL model/iteration weights to IRS-Practice-Module-Dev/code
    • cd ..
    • e.g. cp -r Final_Weights_Models/RDQN_Single_Model/3rd_Iteration/* IRS-Practice-Module-Dev/code
  6. Execute required model training file in IRS-Practice-Module-Dev/code folder Note this is for initial run. For resuming/continuing run, please see item 6.
    • cd code
    • python <model>.py --verbose
  7. Detach tmux session: ctrl-b ctrl-b d
  8. You can close cloud shell and let training to continue
  9. OPTIONAL: Stackdriver monitoring is recommended to be set up for CPU utilization to ensure that any stop in training can be detected. A threshold of < 40% for 5min is recommended. More information can be found in codelabs or the documentation

5. Login to view progress

  1. Login to VM with user as jupyter from Cloud Shell gcloud compute ssh --zone=us-west1-b jupyter@drml
  2. View code progress or airsim output by typing tmux attach -t where SESSION can be airsimenv or code

6. Resuming model training

  1. If for some reason (e.g. unexpected errors or VM restart) you need to resume training, use the following command python <model>.py --verbose --load_model

7. Download RL model weights to run simulation locally

  1. Install croc: curl https://getcroc.schollz.com | bash
  2. From VM home folder, execute croc send IRS-Practice-Module-Dev/code/save_model/<model>.h5 for required model
  3. Note the passcode output from previous line and execute command on local: croc -yes <passcode>

Acknowledgements and References