CUTLASS 2.10.0 #627

hwu36 · 2022-09-16T02:42:34Z

hwu36
Sep 16, 2022
Maintainer

CUTLASS 2.10.0

CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors.
Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable.
Optimizations for GEMM+Softmax.
Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM.
GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after.
GEMM Epilogue Permutation Fusion can permute the GEMM output before storing.
Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized.
Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now.
Standalone Layernorm and Pooling kernels.
Back-to-back GEMM enhancements.
Updates and bugfixes from the community (thanks!)

This discussion was created from the release CUTLASS 2.10.0.

hwu36 · 2022-09-16T03:18:58Z

Another big release. Details are in Changelog.

We first added pyCutlass, a python interface to almost all cutlass features. Check its README for the details.

Second, we added bunch of Transformer related features/enhancements

Group GEMM is improved. Now, it has two modes. The default one is still device only. We added prefix sum and parallel search to make the scheduling part of this mode faster. The new one is Host precompute, which can move all the scheduling to the host. Additionally, we add utilities to sort the problem sizes according to problem_size_k to increase the chance of having better load balance. See example 24.
GEMM+Softmax performance is significantly improved by fusing exp into the GEMM. We also added more tuning template arguments. See example 35
Group GEMM based MHA. It is a general version which does not require padding the sequence lengths to the same to run bmm. It works best for the language models.
GEMM+layer_norm+GEMM fusion which can hide all the cost of layer_norm. It supports shift-k if the default square-sum causes numerical issues of computing variance.
Standalone layernorm and pooling.
b2b GEMM no longer requires first GEMM K dimension to be the multiple of Threadblock Tile K dimension. Our future FMHA will be based on it.

Last but not least, CUTLASS 2.10 works best with CUDA 11.6u2

0 replies