CUDA 11.3 significantly improved the performance of CUTLASS #241

hwu36 · 2021-04-18T02:44:20Z

hwu36
Apr 18, 2021
Maintainer

CUDA 11.3 significantly improves the performance of Ampere/Turing/Volta Tensor Core kernels.

298TFLOPS was recorded when benchmarking CUTLASS FP16 GEMM on A100. This is 14% higher than CUDA 11.2. FP32(via TF32) GEMM is improved by 39% and can reach 143TFLOPS. The same speedup applies to the CONV kernels.

To reproduce these two numbers, follow below commands

cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=cutlass_tensorop_s16816gemm_f16_128x256_32x3_nt_align8,cutlass_tensorop_s1688tf32gemm_256x128_16x3_tn_align4

make cutlass_profiler -j8

sudo nvidia-smi -pm 1 -i 0 

sudo nvidia-smi -lgc 1410 -i 0

sudo nvidia-smi --power-limit=400 -i 0

./tools/profiler/cutlass_profiler --m=8192 --n=6912 --k=16384 --kernels=cutlass_tensorop_s16816gemm_f16_128x256_32x3_nt_align8 --clock=1410 --providers=cutlass

./tools/profiler/cutlass_profiler --m=8192 --n=6912 --k=8192 --kernels=cutlass_tensorop_s1688tf32gemm_256x128_16x3_tn_align4 --clock=1410 --providers=cutlass

timespacce · 2021-04-18T13:03:33Z

timespacce
Apr 18, 2021

At last has nvidia started minimizing the gap between their products and purely-cuda-written solutions.

With 11.0 - on A100_SXM_40GB our AI/ML research group was achieving on 8192x8192 * 8192x8192 4.25ms 270 TFLOP/s fp16 on cutlass and 3.71ms 297 TFLOP/s fp16 on cublass.

With 11.3 - we get 3.72ms or 295 TFLOP/s fp16 on cutlass and 3.71ms or 297 TFLOP/s fp16 on cublass.

The gap is getting smaller and smaller!

Nvidia and the cutlass development team have done an amazing job!

0 replies

hwu36 · 2021-04-18T17:02:53Z

hwu36
Apr 18, 2021
Maintainer Author

So glad to hear! 1ms or 0.3% (=3.71/3.72) difference is within the noise range.

BTW, which cutlass kernel did you run and which kernel did cublas run (nvprof or nsight can tell the kernel name) ?

0 replies

FindDefinition · 2021-04-20T06:57:57Z

FindDefinition
Apr 20, 2021

why can we get better performance in cuda 11.3 for tensor core? better compiler optimization during nvcc compilation? I check the release note and found nothing.

1 reply

hwu36 Apr 20, 2021
Maintainer Author

I added one at the bottom of https://developer.nvidia.com/blog/exploring-the-new-features-of-cuda-11-3/ 😄

hwu36 · 2021-04-22T02:54:00Z

hwu36
Apr 22, 2021
Maintainer Author

Speedup of different types of tensor cores

3 replies

alohali Aug 20, 2021

Do you have similar data on T4 & A10?

hwu36 Aug 20, 2021
Maintainer Author

The 11.3 compiler improves the tensor core performance on all architectures. The speedup on Ampere (e.g. A10) is bigger than Truing (e.g. T4)/Volta.

ginowu Nov 30, 2021

The 11.3 compiler improves the tensor core performance on all architectures. The speedup on Ampere (e.g. A10) is bigger than Truing (e.g. T4)/Volta.

BTW: My epilogue fusing codes bring in 6 register/thread increase when using cuda 11.2, after switching to 11.4.2, this issue vanished.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA 11.3 significantly improved the performance of CUTLASS #241

{{title}}

Replies: 4 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

CUDA 11.3 significantly improved the performance of CUTLASS #241

hwu36 Apr 18, 2021 Maintainer

Replies: 4 comments · 4 replies

timespacce Apr 18, 2021

hwu36 Apr 18, 2021 Maintainer Author

FindDefinition Apr 20, 2021

hwu36 Apr 20, 2021 Maintainer Author

hwu36 Apr 22, 2021 Maintainer Author

alohali Aug 20, 2021

hwu36 Aug 20, 2021 Maintainer Author

ginowu Nov 30, 2021

hwu36
Apr 18, 2021
Maintainer

Replies: 4 comments 4 replies

timespacce
Apr 18, 2021

hwu36
Apr 18, 2021
Maintainer Author

FindDefinition
Apr 20, 2021

hwu36 Apr 20, 2021
Maintainer Author

hwu36
Apr 22, 2021
Maintainer Author

hwu36 Aug 20, 2021
Maintainer Author