Replies: 4 comments 4 replies
-
At last has nvidia started minimizing the gap between their products and purely-cuda-written solutions. With 11.0 - on A100_SXM_40GB our AI/ML research group was achieving on 8192x8192 * 8192x8192 4.25ms 270 TFLOP/s fp16 on cutlass and 3.71ms 297 TFLOP/s fp16 on cublass. With 11.3 - we get 3.72ms or 295 TFLOP/s fp16 on cutlass and 3.71ms or 297 TFLOP/s fp16 on cublass. The gap is getting smaller and smaller! Nvidia and the cutlass development team have done an amazing job! |
Beta Was this translation helpful? Give feedback.
-
So glad to hear! 1ms or 0.3% (=3.71/3.72) difference is within the noise range. BTW, which cutlass kernel did you run and which kernel did cublas run (nvprof or nsight can tell the kernel name) ? |
Beta Was this translation helpful? Give feedback.
-
why can we get better performance in cuda 11.3 for tensor core? better compiler optimization during nvcc compilation? I check the release note and found nothing. |
Beta Was this translation helpful? Give feedback.
-
Speedup of different types of tensor cores |
Beta Was this translation helpful? Give feedback.
-
CUDA 11.3 significantly improves the performance of Ampere/Turing/Volta Tensor Core kernels.
298TFLOPS was recorded when benchmarking CUTLASS FP16 GEMM on A100. This is 14% higher than CUDA 11.2. FP32(via TF32) GEMM is improved by 39% and can reach 143TFLOPS. The same speedup applies to the CONV kernels.
To reproduce these two numbers, follow below commands
Beta Was this translation helpful? Give feedback.
All reactions