nvmlDeviceGetMPSComputeRunningProcesses_v2 api is missing #28

qisikai · 2021-11-25T09:13:49Z

desc

I want to know whether it's possible to call nvmlDeviceGetMPSComputeRunningProcesses_v2 api using go-nvml

and when It can be supported

thanks

api

doc

https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g02098e9876e3fb86eeb9cac2222e5b5d

The text was updated successfully, but these errors were encountered:

XuehaiPan · 2021-11-25T09:46:54Z

The nvml.h version in this repository is from CUDA 11.2.2. It does not have the definition of function nvmlDeviceGetMPSComputeRunningProcesses_v2.

klueska · 2021-11-25T09:48:11Z

Yes. We are a bit behind in keeping go-nvml in sync with the latest NVML release. We hope to have a new version out next week.

XuehaiPan · 2021-11-25T09:53:49Z

We are a bit behind in keeping go-nvml in sync with the latest NVML release.

The latest devel image of CUDA at dockerhub is CUDA 11.4.2 (we use docker image to update nvml.h), which is behind the latest CUDA release (11.5) too.

qisikai · 2021-12-02T10:40:51Z

That's great, thanks

XuehaiPan · 2022-03-11T09:24:01Z

This issue is resolved by PR #38.

qisikai · 2022-03-14T12:22:17Z

when I use the lastest master branch, I got:
（p[1]'s pid is wrong）

deviceGetMPSComputeRunningProcesses_v1 called
mp[0] = {Pid:4126076 UsedGpuMemory:2928672768}
mp[1] = {Pid:4294967295 UsedGpuMemory:4126048}

It works when I change ProcessInfo_v1 's definition from

type ProcessInfo_v1 struct {
	Pid           uint32
	UsedGpuMemory uint64
}

to

type ProcessInfo_v1 struct {
	Pid           uint32
	UsedGpuMemory uint64
	GpuInstanceId     uint32
	ComputeInstanceId uint32
}

deviceGetMPSComputeRunningProcesses_v1 called
p[0] = {Pid:3436170 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
p[1] = {Pid:3436156 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

qisikai · 2022-03-14T12:26:54Z

when I use the lastest master branch, I got: （p[1]'s pid is wrong）

deviceGetMPSComputeRunningProcesses_v1 called
mp[0] = {Pid:4126076 UsedGpuMemory:2928672768}
mp[1] = {Pid:4294967295 UsedGpuMemory:4126048}

It works when I change ProcessInfo_v1 's definition from

type ProcessInfo_v1 struct {
	Pid           uint32
	UsedGpuMemory uint64
}

to

type ProcessInfo_v1 struct {
	Pid           uint32
	UsedGpuMemory uint64
	GpuInstanceId     uint32
	ComputeInstanceId uint32
}

deviceGetMPSComputeRunningProcesses_v1 called
p[0] = {Pid:3436170 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
p[1] = {Pid:3436156 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

hi, PR #38 did't works. pls help， thanks @klueska @XuehaiPan
environment：
Tesla T4 + 450.80 / 450.142

klueska · 2022-03-14T13:15:40Z

This is obviously unexpected, and glancing at the code, it's not clear to me why / how this would happen.

Can you show me the output of the following on your machine:

$ objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so | grep nvmlDeviceGetMPSComputeRunningProcesses

qisikai · 2022-03-14T13:18:45Z

objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so | grep nvmlDeviceGetMPSComputeRunningProcesses

objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 | grep nvmlDeviceGetMPSComputeRunningProcesses
000000000004cbd0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base>:
   4cbf9:       0f 8f d1 01 00 00       jg     4cdd0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x200>
   4cc08:       0f 84 9a 00 00 00       je     4cca8 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0xd8>
   4cc12:       7f 14                   jg     4cc28 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x58>
   4ccbd:       0f 84 7d 00 00 00       je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cccb:       74 73                   je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4ccd4:       75 6a                   jne    4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4ccdc:       0f 84 6e 01 00 00       je     4ce50 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x280>
   4cce5:       0f 84 d5 01 00 00       je     4cec0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f0>
   4ccee:       0f 84 cc 01 00 00       je     4cec0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f0>
   4ccfa:       0f 84 ca 01 00 00       je     4ceca <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2fa>
   4cd0c:       74 32                   je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cd15:       75 29                   jne    4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cd1e:       74 20                   je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cd49:       0f 8e c5 fe ff ff       jle    4cc14 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x44>
   4ce41:       e9 b9 fd ff ff          jmpq   4cbff <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f>
   4ce59:       0f 8e e1 fe ff ff       jle    4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4ceb7:       e9 84 fe ff ff          jmpq   4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cec5:       e9 76 fe ff ff          jmpq   4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cee4:       e9 57 fe ff ff          jmpq   4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>

qisikai · 2022-03-14T13:23:52Z

klueska · 2022-03-14T13:47:28Z

Interesting. I wouldn't have expected the R450 driver (i.e. NVML 11.0) to have a symbol defined for nvmlDeviceGetMPSComputeRunningProcesses because it doesn't appear in the NVML header file until R470 (i.e. NVML version 11.4).

Can you verify things are work as expected for the other similar functions, i.e. DeviceGetComputeRunningProcesses and DeviceGetGraphicsRunningProcesses, or do these have the same problem? If these are working but nvmlDeviceGetMPSComputeRunningProcesses is not, then my assumption below is likely true.

What I think is going on is that the binary for libnvidia-ml.so with NVML 11.0 actually had the symbol for nvmlDeviceGetMPSComputeRunningProcesses compiled into it even though it wasn't available in the NVML header for this version.

And I'm guessing that it already operated on the v2 version of the process struct even though it didn't explicitly have a v2 in its function name. Since it wasn't officially part of the header yet, people (in theory) shouldn't have known about it or be able to use it.

Then when NVML 11.4 came out, it was officially added to the API, but only as a v2 function (since it clearly operates on the v2 struct). It was OK to not "backport" support for the v1 struct into the original, unversioned function, because it was never officially supported. However, the original unversioned function is still present in older versions of the driver and is now being picked up by our go-nvml library (which is supposed to be usable on all versions of NVML 11.0+).

As such, we need to tell this original, unversioned function to actually operator on the v2 struct, even though that breaks the pattern from the other, similar functions (which were officially supported prior to the introduction of the v2 struct).

qisikai · 2022-03-14T13:56:19Z

@klueska
DeviceGetComputeRunningProcesses\DeviceGetGraphicsRunningProcesses from the lastest master branch works as expected (and DeviceGetComputeRunningProcesses equals to deviceGetComputeRunningProcesses_v2).

and I also agree with this point of view：
And I'm guessing that it already operated on the v2 version of the process struct even though it didn't explicitly have a v2 in its function name

XuehaiPan · 2022-03-14T14:07:46Z

Then when NVML 11.4 came out, it was officially added to the API, but only as a v2 function (since it clearly operates on the v2 struct). It was OK to not "backport" support for the v1 struct into the original, unversioned function, because it was never officially supported. However, the original unversioned function is still present in older versions of the driver and is now being picked up by our go-nvml library (which is supposed to be usable on all versions of NVML 11.0+).

In nvml.h (NVML 11.6 at master branch), the unversioned nvmlDeviceGetMPSComputeRunningProcesses uses struct nvmlProcessInfo_v1_t:

go-nvml/pkg/nvml/nvml.h

Lines 8420 to 8425 in c3a16a2

    
           nvmlReturn_t DECLDIR nvmlDeviceGetComputeRunningProcesses(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v1_t *infos); 
        
           nvmlReturn_t DECLDIR nvmlDeviceGetComputeRunningProcesses_v2(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v2_t *infos); 
        
           nvmlReturn_t DECLDIR nvmlDeviceGetGraphicsRunningProcesses(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v1_t *infos); 
        
           nvmlReturn_t DECLDIR nvmlDeviceGetGraphicsRunningProcesses_v2(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v2_t *infos); 
        
           nvmlReturn_t DECLDIR nvmlDeviceGetMPSComputeRunningProcesses(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v1_t *infos); 
        
           nvmlReturn_t DECLDIR nvmlDeviceGetMPSComputeRunningProcesses_v2(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v2_t *infos);

klueska · 2022-03-14T14:30:50Z

Right -- so I'm thinking that old driver versions (where the API wasn't published) were actually buggy by operating on the v2 struct and then when they published the API, they retroactively went back and "fixed" things for it to use the v1 struct. Not sure though -- will need to check internally.

klueska · 2022-03-14T21:04:55Z

As I suspected I got confirmation from the NVML team that this is what happened. The function nvmlDeviceGetMPSComputeRunningProcesses was never meant to be exposed in the libnvidia-ml.so.1 binary until the R470 driver. It's apparently common for "internal" functions like this to make their way into the binary (for testing and ease of merging code into the code base which is still under development), but normally these "internal" functions have their names mangled so as not to interfere with the real API once it is released.

So what I said before is exactly what happened -- there was a hidden version of nvmlDeviceGetMPSComputeRunningProcesses in R450 that operated on nvmlProcessInfo_v2_t structs, but since it was never meant to be exposed they "fixed" it to operator on nvmlProcessInfo_v1_t structs when it was offically released in R70 along side a nvmlDeviceGetMPSComputeRunningProcesses_v2 API which operates on v2 structs.

So the short of it is that nvmlDeviceGetMPSComputeRunningProcesses is not intended to be available on your R450 driver, so you should not expect to be using it there. In the coming days I will merge something to master that prevents this function from being visible in drivers prior to R470.

qisikai · 2022-03-15T03:45:34Z

As I suspected I got confirmation from the NVML team that this is what happened. The function nvmlDeviceGetMPSComputeRunningProcesses was never meant to be exposed in the libnvidia-ml.so.1 binary until the R470 driver. It's apparently common for "internal" functions like this to make their way into the binary (for testing and ease of merging code into the code base which is still under development), but normally these "internal" functions have their names mangled so as not to interfere with the real API once it is released.

So what I said before is exactly what happened -- there was a hidden version of nvmlDeviceGetMPSComputeRunningProcesses in R450 that operated on nvmlProcessInfo_v2_t structs, but since it was never meant to be exposed they "fixed" it to operator on nvmlProcessInfo_v1_t structs when it was offically released in R70 along side a nvmlDeviceGetMPSComputeRunningProcesses_v2 API which operates on v2 structs.

So the short of it is that nvmlDeviceGetMPSComputeRunningProcesses is not intended to be available on your R450 driver, so you should not expect to be using it there. In the coming days I will merge something to master that prevents this function from being visible in drivers prior to R470.

How about this idea:
use nvmlDeviceGetMPSComputeRunningProcesses v2 for drivers < R470 (like 450.80 .etc).
for drivers >= 470, then use current logic.

I find that nvidia-smi with 450.80 driver can get the right result of nvmlDeviceGetMPSComputeRunningProcesses, so go-nvml need to provide a way to do the same thing.

wookayin mentioned this issue Mar 15, 2022

Use NVIDIA's official pynvml binding wookayin/gpustat#107

Merged

XuehaiPan mentioned this issue Mar 21, 2022

[Bug] gpu memory-usage not show right in driver 510 version XuehaiPan/nvitop#13

Closed

XuehaiPan mentioned this issue Jul 24, 2022

feat(core/libnvml): add compatibility layers for NVML Python bindings XuehaiPan/nvitop#30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvmlDeviceGetMPSComputeRunningProcesses_v2 api is missing #28

nvmlDeviceGetMPSComputeRunningProcesses_v2 api is missing #28

qisikai commented Nov 25, 2021

XuehaiPan commented Nov 25, 2021

klueska commented Nov 25, 2021

XuehaiPan commented Nov 25, 2021

qisikai commented Dec 2, 2021

XuehaiPan commented Mar 11, 2022

qisikai commented Mar 14, 2022 •

edited

Loading

qisikai commented Mar 14, 2022

klueska commented Mar 14, 2022

qisikai commented Mar 14, 2022

qisikai commented Mar 14, 2022

klueska commented Mar 14, 2022 •

edited

Loading

qisikai commented Mar 14, 2022 •

edited

Loading

XuehaiPan commented Mar 14, 2022

klueska commented Mar 14, 2022 •

edited

Loading

klueska commented Mar 14, 2022

qisikai commented Mar 15, 2022

nvmlDeviceGetMPSComputeRunningProcesses_v2 api is missing #28

nvmlDeviceGetMPSComputeRunningProcesses_v2 api is missing #28

Comments

qisikai commented Nov 25, 2021

desc

api

doc

XuehaiPan commented Nov 25, 2021

klueska commented Nov 25, 2021

XuehaiPan commented Nov 25, 2021

qisikai commented Dec 2, 2021

XuehaiPan commented Mar 11, 2022

qisikai commented Mar 14, 2022 • edited Loading

qisikai commented Mar 14, 2022

klueska commented Mar 14, 2022

qisikai commented Mar 14, 2022

qisikai commented Mar 14, 2022

klueska commented Mar 14, 2022 • edited Loading

qisikai commented Mar 14, 2022 • edited Loading

XuehaiPan commented Mar 14, 2022

klueska commented Mar 14, 2022 • edited Loading

klueska commented Mar 14, 2022

qisikai commented Mar 15, 2022

qisikai commented Mar 14, 2022 •

edited

Loading

klueska commented Mar 14, 2022 •

edited

Loading

qisikai commented Mar 14, 2022 •

edited

Loading

klueska commented Mar 14, 2022 •

edited

Loading