Large standard deviation when profile cutlass profiler with nsys profiler. #748

Quangnguyengiabku · 2022-12-28T03:47:36Z

Quangnguyengiabku
Dec 28, 2022

Hello everyone! Recently, I used the NVIDIA nsys profiler (https://docs.nvidia.com/nsight-systems/UserGuide/index.html) to profile some GEMMs on the cutlass profiler. As a result, I found that the performance standard deviation of the cutlass kernels is huge (approximately 10% compared to the same kernel cutlass on PyTorch is 0.3%). I tried to config parameters to reduce the StdDev but didn't work. Could anyone help me?
Another thing that confuses me is the kernel _void cutlass::Kernel<cutlass_80_simt_sgemm_256x128_8x4_nn_align1>(T1::Params) (masked in bleu) _ suddenly appeared when I profiler with nsys while if I use cutlass profiler, I don't see it. What is cutlass_80? Could anyone explain it to me?
An example of the command I use:

/usr/local/cuda/bin/nsys profile ./tools/profiler/cutlass_profiler --operation=Gemm --m=1024 --n=1024 --k=2048 --kernels=cutlass_simt*nn --profiling-iterations=200 --dist=uniform,min:5,max:100,scale:2 --warmup-iterations=10 --alpha=1 --beta=0 --accum=f32 --save-workspace=always

jackkosaian · 2022-12-28T12:41:54Z

jackkosaian
Dec 28, 2022

Hi, @Quangnguyengiabku.

A few questions:

What GPU are you running on? What CUDA version are you using?
Do you see a high standard deviation if you run the CUTLASS profiler multiple times outside of nsys (e.g., if you compare the "Runtime" output in the CSV printed by the profiler)?
How is your PyTorch version set up? Does it run the same number of iterations, etc.? How have you verified that it is running the same CUTLASS kernel?
Have you locked the clocks and changed the power cap on your GPU? Without these being locked, there can be performance variation due to the clock frequency on your GPU being changed, perhaps due to power-related throttling. I would expect these to be more likely to occur in your runs of the CUTLASS profiler than running a single PyTorch kernel because the CUTLASS profiler will run many different kernels closely after one another. This higher-intensity workload would be more likely to experience throttling, and, thus, experience performance variation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large standard deviation when profile cutlass profiler with nsys profiler. #748

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Large standard deviation when profile cutlass profiler with nsys profiler. #748

Quangnguyengiabku Dec 28, 2022

Replies: 1 comment

jackkosaian Dec 28, 2022

Quangnguyengiabku
Dec 28, 2022

jackkosaian
Dec 28, 2022