Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvmlDeviceGetMPSComputeRunningProcesses_v2 api is missing #28

Open
qisikai opened this issue Nov 25, 2021 · 16 comments
Open

nvmlDeviceGetMPSComputeRunningProcesses_v2 api is missing #28

qisikai opened this issue Nov 25, 2021 · 16 comments

Comments

@qisikai
Copy link

qisikai commented Nov 25, 2021

desc

I want to know whether it's possible to call nvmlDeviceGetMPSComputeRunningProcesses_v2 api using go-nvml

and when It can be supported

thanks

api

image

doc

https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g02098e9876e3fb86eeb9cac2222e5b5d

@XuehaiPan
Copy link
Contributor

The nvml.h version in this repository is from CUDA 11.2.2. It does not have the definition of function nvmlDeviceGetMPSComputeRunningProcesses_v2.

@klueska
Copy link
Contributor

klueska commented Nov 25, 2021

Yes. We are a bit behind in keeping go-nvml in sync with the latest NVML release. We hope to have a new version out next week.

@XuehaiPan
Copy link
Contributor

We are a bit behind in keeping go-nvml in sync with the latest NVML release.

The latest devel image of CUDA at dockerhub is CUDA 11.4.2 (we use docker image to update nvml.h), which is behind the latest CUDA release (11.5) too.

@qisikai
Copy link
Author

qisikai commented Dec 2, 2021

That's great, thanks

@XuehaiPan
Copy link
Contributor

This issue is resolved by PR #38.

@qisikai
Copy link
Author

qisikai commented Mar 14, 2022

when I use the lastest master branch, I got:
(p[1]'s pid is wrong)

deviceGetMPSComputeRunningProcesses_v1 called
mp[0] = {Pid:4126076 UsedGpuMemory:2928672768}
mp[1] = {Pid:4294967295 UsedGpuMemory:4126048}

image

It works when I change ProcessInfo_v1 's definition from

type ProcessInfo_v1 struct {
	Pid           uint32
	UsedGpuMemory uint64
}

to

type ProcessInfo_v1 struct {
	Pid           uint32
	UsedGpuMemory uint64
	GpuInstanceId     uint32
	ComputeInstanceId uint32
}
deviceGetMPSComputeRunningProcesses_v1 called
p[0] = {Pid:3436170 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
p[1] = {Pid:3436156 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

@qisikai
Copy link
Author

qisikai commented Mar 14, 2022

when I use the lastest master branch, I got: (p[1]'s pid is wrong)

deviceGetMPSComputeRunningProcesses_v1 called
mp[0] = {Pid:4126076 UsedGpuMemory:2928672768}
mp[1] = {Pid:4294967295 UsedGpuMemory:4126048}

image

It works when I change ProcessInfo_v1 's definition from

type ProcessInfo_v1 struct {
	Pid           uint32
	UsedGpuMemory uint64
}

to

type ProcessInfo_v1 struct {
	Pid           uint32
	UsedGpuMemory uint64
	GpuInstanceId     uint32
	ComputeInstanceId uint32
}
deviceGetMPSComputeRunningProcesses_v1 called
p[0] = {Pid:3436170 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
p[1] = {Pid:3436156 UsedGpuMemory:5420089344 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

hi, PR #38 did't works. pls help, thanks @klueska @XuehaiPan
environment:
Tesla T4 + 450.80 / 450.142

@klueska
Copy link
Contributor

klueska commented Mar 14, 2022

This is obviously unexpected, and glancing at the code, it's not clear to me why / how this would happen.

Can you show me the output of the following on your machine:

$ objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so | grep nvmlDeviceGetMPSComputeRunningProcesses

@qisikai
Copy link
Author

qisikai commented Mar 14, 2022

objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so | grep nvmlDeviceGetMPSComputeRunningProcesses

objdump -D /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 | grep nvmlDeviceGetMPSComputeRunningProcesses
000000000004cbd0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base>:
   4cbf9:       0f 8f d1 01 00 00       jg     4cdd0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x200>
   4cc08:       0f 84 9a 00 00 00       je     4cca8 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0xd8>
   4cc12:       7f 14                   jg     4cc28 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x58>
   4ccbd:       0f 84 7d 00 00 00       je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cccb:       74 73                   je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4ccd4:       75 6a                   jne    4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4ccdc:       0f 84 6e 01 00 00       je     4ce50 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x280>
   4cce5:       0f 84 d5 01 00 00       je     4cec0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f0>
   4ccee:       0f 84 cc 01 00 00       je     4cec0 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f0>
   4ccfa:       0f 84 ca 01 00 00       je     4ceca <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2fa>
   4cd0c:       74 32                   je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cd15:       75 29                   jne    4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cd1e:       74 20                   je     4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cd49:       0f 8e c5 fe ff ff       jle    4cc14 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x44>
   4ce41:       e9 b9 fd ff ff          jmpq   4cbff <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x2f>
   4ce59:       0f 8e e1 fe ff ff       jle    4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4ceb7:       e9 84 fe ff ff          jmpq   4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cec5:       e9 76 fe ff ff          jmpq   4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>
   4cee4:       e9 57 fe ff ff          jmpq   4cd40 <nvmlDeviceGetMPSComputeRunningProcesses@@Base+0x170>

@qisikai
Copy link
Author

qisikai commented Mar 14, 2022

image

@klueska
Copy link
Contributor

klueska commented Mar 14, 2022

Interesting. I wouldn't have expected the R450 driver (i.e. NVML 11.0) to have a symbol defined for nvmlDeviceGetMPSComputeRunningProcesses because it doesn't appear in the NVML header file until R470 (i.e. NVML version 11.4).

Can you verify things are work as expected for the other similar functions, i.e. DeviceGetComputeRunningProcesses and DeviceGetGraphicsRunningProcesses, or do these have the same problem? If these are working but nvmlDeviceGetMPSComputeRunningProcesses is not, then my assumption below is likely true.

What I think is going on is that the binary for libnvidia-ml.so with NVML 11.0 actually had the symbol for nvmlDeviceGetMPSComputeRunningProcesses compiled into it even though it wasn't available in the NVML header for this version.

And I'm guessing that it already operated on the v2 version of the process struct even though it didn't explicitly have a v2 in its function name. Since it wasn't officially part of the header yet, people (in theory) shouldn't have known about it or be able to use it.

Then when NVML 11.4 came out, it was officially added to the API, but only as a v2 function (since it clearly operates on the v2 struct). It was OK to not "backport" support for the v1 struct into the original, unversioned function, because it was never officially supported. However, the original unversioned function is still present in older versions of the driver and is now being picked up by our go-nvml library (which is supposed to be usable on all versions of NVML 11.0+).

As such, we need to tell this original, unversioned function to actually operator on the v2 struct, even though that breaks the pattern from the other, similar functions (which were officially supported prior to the introduction of the v2 struct).

@qisikai
Copy link
Author

qisikai commented Mar 14, 2022

@klueska
DeviceGetComputeRunningProcesses\DeviceGetGraphicsRunningProcesses from the lastest master branch works as expected (and DeviceGetComputeRunningProcesses equals to deviceGetComputeRunningProcesses_v2).

and I also agree with this point of view:
And I'm guessing that it already operated on the v2 version of the process struct even though it didn't explicitly have a v2 in its function name

@XuehaiPan
Copy link
Contributor

Then when NVML 11.4 came out, it was officially added to the API, but only as a v2 function (since it clearly operates on the v2 struct). It was OK to not "backport" support for the v1 struct into the original, unversioned function, because it was never officially supported. However, the original unversioned function is still present in older versions of the driver and is now being picked up by our go-nvml library (which is supposed to be usable on all versions of NVML 11.0+).

In nvml.h (NVML 11.6 at master branch), the unversioned nvmlDeviceGetMPSComputeRunningProcesses uses struct nvmlProcessInfo_v1_t:

go-nvml/pkg/nvml/nvml.h

Lines 8420 to 8425 in c3a16a2

nvmlReturn_t DECLDIR nvmlDeviceGetComputeRunningProcesses(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v1_t *infos);
nvmlReturn_t DECLDIR nvmlDeviceGetComputeRunningProcesses_v2(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v2_t *infos);
nvmlReturn_t DECLDIR nvmlDeviceGetGraphicsRunningProcesses(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v1_t *infos);
nvmlReturn_t DECLDIR nvmlDeviceGetGraphicsRunningProcesses_v2(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v2_t *infos);
nvmlReturn_t DECLDIR nvmlDeviceGetMPSComputeRunningProcesses(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v1_t *infos);
nvmlReturn_t DECLDIR nvmlDeviceGetMPSComputeRunningProcesses_v2(nvmlDevice_t device, unsigned int *infoCount, nvmlProcessInfo_v2_t *infos);

@klueska
Copy link
Contributor

klueska commented Mar 14, 2022

Right -- so I'm thinking that old driver versions (where the API wasn't published) were actually buggy by operating on the v2 struct and then when they published the API, they retroactively went back and "fixed" things for it to use the v1 struct. Not sure though -- will need to check internally.

@klueska
Copy link
Contributor

klueska commented Mar 14, 2022

As I suspected I got confirmation from the NVML team that this is what happened. The function nvmlDeviceGetMPSComputeRunningProcesses was never meant to be exposed in the libnvidia-ml.so.1 binary until the R470 driver. It's apparently common for "internal" functions like this to make their way into the binary (for testing and ease of merging code into the code base which is still under development), but normally these "internal" functions have their names mangled so as not to interfere with the real API once it is released.

So what I said before is exactly what happened -- there was a hidden version of nvmlDeviceGetMPSComputeRunningProcesses in R450 that operated on nvmlProcessInfo_v2_t structs, but since it was never meant to be exposed they "fixed" it to operator on nvmlProcessInfo_v1_t structs when it was offically released in R70 along side a nvmlDeviceGetMPSComputeRunningProcesses_v2 API which operates on v2 structs.

So the short of it is that nvmlDeviceGetMPSComputeRunningProcesses is not intended to be available on your R450 driver, so you should not expect to be using it there. In the coming days I will merge something to master that prevents this function from being visible in drivers prior to R470.

@qisikai
Copy link
Author

qisikai commented Mar 15, 2022

As I suspected I got confirmation from the NVML team that this is what happened. The function nvmlDeviceGetMPSComputeRunningProcesses was never meant to be exposed in the libnvidia-ml.so.1 binary until the R470 driver. It's apparently common for "internal" functions like this to make their way into the binary (for testing and ease of merging code into the code base which is still under development), but normally these "internal" functions have their names mangled so as not to interfere with the real API once it is released.

So what I said before is exactly what happened -- there was a hidden version of nvmlDeviceGetMPSComputeRunningProcesses in R450 that operated on nvmlProcessInfo_v2_t structs, but since it was never meant to be exposed they "fixed" it to operator on nvmlProcessInfo_v1_t structs when it was offically released in R70 along side a nvmlDeviceGetMPSComputeRunningProcesses_v2 API which operates on v2 structs.

So the short of it is that nvmlDeviceGetMPSComputeRunningProcesses is not intended to be available on your R450 driver, so you should not expect to be using it there. In the coming days I will merge something to master that prevents this function from being visible in drivers prior to R470.


How about this idea:
use nvmlDeviceGetMPSComputeRunningProcesses v2 for drivers < R470 (like 450.80 .etc).
for drivers >= 470, then use current logic.

I find that nvidia-smi with 450.80 driver can get the right result of nvmlDeviceGetMPSComputeRunningProcesses, so go-nvml need to provide a way to do the same thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants