Rework MPS limit normalization #11

elezar · 2023-11-03T12:55:43Z

With this change we always specify limits in terms of UUIDs
when passing these to the MPS control daemon. We also check for
valid indices.

Using this we see:

spec:
  containers:
  - args:
    - |-
      set -e
      rm -f /var/log/nvidia-mps/startup.log

      nvidia-cuda-mps-control -d
      echo set_default_active_thread_percentage 50 | nvidia-cuda-mps-control
      echo set_default_device_pinned_mem_limit GPU-f22fb098-d1b3-3806-2655-ba25f02229c1 10240M | nvidia-cuda-mps-control

      echo "startup complete" > /var/log/nvidia-mps/startup.log

      tail -n +1 -f /var/log/nvidia-mps/control.log
    command:
    - chroot
    - /driver-root
    - sh
    - -c
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: GPU-f22fb098-d1b3-3806-2655-ba25f02229c1

Assuming the following claim parameters:

---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  namespace: sharing-demo
  name: gpu-mps-sharing
spec:
  sharing:
    strategy: MPS
    mpsConfig:
      defaultActiveThreadPercentage: 50
      defaultPinnedDeviceMemoryLimit: 10Gi

and

spec:
  containers:
  - args:
    - |-
      set -e
      rm -f /var/log/nvidia-mps/startup.log

      nvidia-cuda-mps-control -d
      echo set_default_active_thread_percentage 50 | nvidia-cuda-mps-control
      echo set_default_device_pinned_mem_limit GPU-3109fa37-4445-73c7-b695-1b5a4d13f58e 5120M | nvidia-cuda-mps-control

      echo "startup complete" > /var/log/nvidia-mps/startup.log

      tail -n +1 -f /var/log/nvidia-mps/control.log
    command:
    - chroot
    - /driver-root
    - sh
    - -c
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: GPU-3109fa37-4445-73c7-b695-1b5a4d13f58e

when using:

---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  namespace: sharing-demo
  name: gpu-mps-sharing
spec:
  sharing:
    strategy: MPS
    mpsConfig:
      defaultActiveThreadPercentage: 50
      defaultPinnedDeviceMemoryLimit: 10Gi
      defaultPerDevicePinnedMemoryLimit:
         0: 5Gi

With this change we always specify limits in terms of UUIDs when passing these to the MPS control daemon. We also check for valid indices. Signed-off-by: Evan Lezar <[email protected]>

elezar requested review from klueska and cdesiniotis November 3, 2023 12:55

elezar marked this pull request as ready for review November 3, 2023 12:56

elezar force-pushed the CNT-4683/improve-pinned-memory-limits branch from 8e6acf5 to b739902 Compare November 3, 2023 13:26

elezar force-pushed the CNT-4683/improve-pinned-memory-limits branch from b739902 to 0203e0f Compare November 21, 2023 14:22

klueska added enhancement and removed enhancement labels Jan 25, 2024

klueska assigned elezar Jan 26, 2024

Rework MPS limit normalization

8e42696

With this change we always specify limits in terms of UUIDs when passing these to the MPS control daemon. We also check for valid indices. Signed-off-by: Evan Lezar <[email protected]>

elezar force-pushed the CNT-4683/improve-pinned-memory-limits branch from 0203e0f to 8e42696 Compare February 29, 2024 14:05

elezar merged commit d00050e into NVIDIA:main Mar 1, 2024
5 checks passed

elezar deleted the CNT-4683/improve-pinned-memory-limits branch March 1, 2024 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework MPS limit normalization #11

Rework MPS limit normalization #11

elezar commented Nov 3, 2023

Rework MPS limit normalization #11

Rework MPS limit normalization #11

Conversation

elezar commented Nov 3, 2023