Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework MPS limit normalization #11

Merged
merged 1 commit into from
Mar 1, 2024

Conversation

elezar
Copy link
Member

@elezar elezar commented Nov 3, 2023

With this change we always specify limits in terms of UUIDs
when passing these to the MPS control daemon. We also check for
valid indices.

Using this we see:

spec:
  containers:
  - args:
    - |-
      set -e
      rm -f /var/log/nvidia-mps/startup.log

      nvidia-cuda-mps-control -d
      echo set_default_active_thread_percentage 50 | nvidia-cuda-mps-control
      echo set_default_device_pinned_mem_limit GPU-f22fb098-d1b3-3806-2655-ba25f02229c1 10240M | nvidia-cuda-mps-control

      echo "startup complete" > /var/log/nvidia-mps/startup.log

      tail -n +1 -f /var/log/nvidia-mps/control.log
    command:
    - chroot
    - /driver-root
    - sh
    - -c
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: GPU-f22fb098-d1b3-3806-2655-ba25f02229c1

Assuming the following claim parameters:

---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  namespace: sharing-demo
  name: gpu-mps-sharing
spec:
  sharing:
    strategy: MPS
    mpsConfig:
      defaultActiveThreadPercentage: 50
      defaultPinnedDeviceMemoryLimit: 10Gi

and

spec:
  containers:
  - args:
    - |-
      set -e
      rm -f /var/log/nvidia-mps/startup.log

      nvidia-cuda-mps-control -d
      echo set_default_active_thread_percentage 50 | nvidia-cuda-mps-control
      echo set_default_device_pinned_mem_limit GPU-3109fa37-4445-73c7-b695-1b5a4d13f58e 5120M | nvidia-cuda-mps-control

      echo "startup complete" > /var/log/nvidia-mps/startup.log

      tail -n +1 -f /var/log/nvidia-mps/control.log
    command:
    - chroot
    - /driver-root
    - sh
    - -c
    env:
    - name: CUDA_VISIBLE_DEVICES
      value: GPU-3109fa37-4445-73c7-b695-1b5a4d13f58e

when using:

---
apiVersion: gpu.resource.nvidia.com/v1alpha1
kind: GpuClaimParameters
metadata:
  namespace: sharing-demo
  name: gpu-mps-sharing
spec:
  sharing:
    strategy: MPS
    mpsConfig:
      defaultActiveThreadPercentage: 50
      defaultPinnedDeviceMemoryLimit: 10Gi
      defaultPerDevicePinnedMemoryLimit:
         0: 5Gi

@elezar elezar marked this pull request as ready for review November 3, 2023 12:56
@elezar elezar force-pushed the CNT-4683/improve-pinned-memory-limits branch from 8e6acf5 to b739902 Compare November 3, 2023 13:26
@elezar elezar force-pushed the CNT-4683/improve-pinned-memory-limits branch from b739902 to 0203e0f Compare November 21, 2023 14:22
With this change we always specify limits in terms of UUIDs
when passing these to the MPS control daemon. We also check for
valid indices.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the CNT-4683/improve-pinned-memory-limits branch from 0203e0f to 8e42696 Compare February 29, 2024 14:05
@elezar elezar merged commit d00050e into NVIDIA:main Mar 1, 2024
5 checks passed
@elezar elezar deleted the CNT-4683/improve-pinned-memory-limits branch March 1, 2024 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants