Large standard deviation when profile cutlass profiler with nsys profiler. #748
Quangnguyengiabku
started this conversation in
General
Replies: 1 comment
-
Hi, @Quangnguyengiabku. A few questions:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello everyone! Recently, I used the NVIDIA nsys profiler (https://docs.nvidia.com/nsight-systems/UserGuide/index.html) to profile some GEMMs on the cutlass profiler. As a result, I found that the performance standard deviation of the cutlass kernels is huge (approximately 10% compared to the same kernel cutlass on PyTorch is 0.3%). I tried to config parameters to reduce the StdDev but didn't work. Could anyone help me?
Another thing that confuses me is the kernel _void cutlass::Kernel<cutlass_80_simt_sgemm_256x128_8x4_nn_align1>(T1::Params) (masked in bleu) _ suddenly appeared when I profiler with nsys while if I use cutlass profiler, I don't see it. What is cutlass_80? Could anyone explain it to me?
An example of the command I use:
/usr/local/cuda/bin/nsys profile ./tools/profiler/cutlass_profiler --operation=Gemm --m=1024 --n=1024 --k=2048 --kernels=cutlass_simt*nn --profiling-iterations=200 --dist=uniform,min:5,max:100,scale:2 --warmup-iterations=10 --alpha=1 --beta=0 --accum=f32 --save-workspace=always
Beta Was this translation helpful? Give feedback.
All reactions