Replies: 1 comment
-
Another big release. Details are in Changelog. We first added pyCutlass, a python interface to almost all cutlass features. Check its README for the details. Second, we added bunch of Transformer related features/enhancements
Third, group conv and [depthwise conv] (https://github.com/NVIDIA/cutlass/blob/master/test/unit/conv/device/depthwise_fprop_implicit_gemm_f16nhwc_f16nhwc_f16nhwc_simt_f16_sm60.cu) are supported now. We will keep improving them. Last but not least, CUTLASS 2.10 works best with CUDA 11.6u2 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
CUTLASS 2.10.0
CUTLASS Python now supports GEMM, Convolution and Grouped GEMM for different data types as well as different epilogue flavors.
Optimizations for CUTLASS's Grouped GEMM kernel. It can move some scheduling into the host side if applicable.
Optimizations for GEMM+Softmax.
Grouped GEMM for Multihead Attention is a general MHA that does not require equal sequence length in every GEMM.
GEMM + Layer norm fusion for Ampere can fuse the layernorm into GEMMs before and after.
GEMM Epilogue Permutation Fusion can permute the GEMM output before storing.
Grouped convolution targeting implicit GEMM introduces the first group convolution implementation to CUTLASS. It is an Analytical implementation, not an Optimized.
Depthwise separable convolution introduces the first depthwise convolution which is also Analytical for now.
Standalone Layernorm and Pooling kernels.
Back-to-back GEMM enhancements.
Updates and bugfixes from the community (thanks!)
This discussion was created from the release CUTLASS 2.10.0.
Beta Was this translation helpful? Give feedback.
All reactions