[QST] make_tiled_copy_B generates incompatible layouts #1953

phantaurus · 2024-11-20T00:49:54Z

What is your question?
Hello!

I am writing an int8 GEMM layer using cute.

I use MMA_Atom<SM80_16x8x32_S32S8S8S32_TN> as my atom MMA, and define my tiled MMA as:

using TiledMma = TiledMMA< MMA_Atom_Arch,               
       Layout<Shape<4, _1, _1>>,
       Layout<Shape<_1, _4, _1>>>;

For element B, my original layout is transposed, so I use

using SmemCopyAtomTransposed = Copy_Atom<SM75_U16x8_LDSM_T, int8_t>;

Then I define tiled copy and use the tiled copy to partition my tensor in shared memory.

 auto smem_tiled_copy_B = make_tiled_copy_B(SmemCopyAtomTransposed{}, tiled_mma);

Here I plot the MMA and smem_tiled_copy_B using print_latex.
mma_int8.pdf
tiled_copy_B.pdf

Good news is that the destination of smem_tiled_copy_B matches the MMA layout of B.
Bad news is that the source of smem_tiled_copy_B is arranged like ((2, 8), 2):((64, 1),16) instead of something like (16, 2):(1, 16).

I am not sure why this configuration generates the (2, 8) partition. SmemCopyAtomTransposed is constructed using SM75_U16x8_LDSM_T and int8_t, which internally should uses the ldmatrix instruction that takes in one 128-bit input each time. So it seems more reasonable for make_tiled_copy_B to have 16 continuous int8_t values in the inner dimension.

This generates errors when calling cute::copy(), as SM75_U16x8_LDSM_T for int8 is incompatible with the src layout:

In CopyAtom, src layout doesn't vectorize into registers. This src layout is incompatible with this tiled copy."

instantiation of "void cute::copy_unpack(const cute::Copy_Traits<Operation, Args...>&, const cute::Tensor<TS, SLayout> &, cute::Tensor<TD, DLayout> &) [with Operation=cute::SM75_U16x8_LDSM_T, Args=<>, TS=cute::ViewEngine<cute::smem_ptr<int8_t>>, SLayout=cute::Layout<cute::tuple<cute::tuple<cute::C<2>, cute::C<8>>>, cute::tuple<cute::tuple<int, cute::_1>>>, TD=cute::ViewEngine<int8_t *>, DLayout=cute::Layout<cute::tuple<cute::_16>, cute::tuple<cute::C<1>>>]" 

external/cutlass/include/cute/atom/copy_atom.hpp(104): here

instantiation of "void cute::Copy_Atom<cute::Copy_Traits<Args...>, T>::call(const cute::Tensor<TS, SLayout> &, cute::Tensor<TD, DLayout> &) const [with Args=<cute::SM75_U16x8_LDSM_T>, T=int8_t, TS=cute::ViewEngine<cute::smem_ptr<int8_t>>, SLayout=cute::Layout<cute::tuple<cute    ::tuple<cute::C<2>, cute::C<8>>>, cute::tuple<cute::tuple<int, cute::_1>>>, TD=cute::ViewEngine<int8_t *>, DLayout=cute::Layout<cute::tuple<cute::_16>, cute::tuple<cute::C<1>>>]"

Could you help me take a look at this issue? Thank you so much!

The text was updated successfully, but these errors were encountered:

ccecka · 2024-11-20T03:11:01Z

I am not sure why this configuration generates the (2, 8) partition. SmemCopyAtomTransposed is constructed using SM75_U16x8_LDSM_T and int8_t, which internally should uses the ldmatrix instruction that takes in one 128-bit input each time. So it seems more reasonable for make_tiled_copy_B to have 16 continuous int8_t values in the inner dimension.

It does seem more reasonable, but that's not how the LDSM_T works for int8_t input. It's hinted at in the name, the U16x8 means the LDSM_T was designed for 16bit types. We can use with int8_t types as well, but obviously we need 2xint8_ts for every U16.

The solution is therefore some shared memory layout engineering. With the TiledCopy you show, we want each thread to access 128 consecutive bits, as you mention. So if I just follow the tiled_copy_B.pdf this should work

auto smem_layout_B_atom = Layout<Shape <Shape <_8,  _4>,Shape <_2,_16>>,
                                 Stride<Stride<_2,_256>,Stride<_1,_16>>>{});  // 32N x 32K
auto smem_layout_B = tile_to_shape(smem_layout_B_atom, Shape<_128,_32>{});    // 128N x 32K

but I'm sure you can do better by also considering the stores from global memory and bank-accesses.

phantaurus · 2024-11-21T01:44:25Z

Thank you so much for your reply! I didn't realize that I can actually manipulate the SmemAtomLayout. Before, I was simply naively doing Shape<_32, _32>, Stride<_1, _32>.

In my case, since B is actually transposed, i.e., row major, I use the following SmemAtomLayout:

auto smem_layout_B_atom = Layout<Shape<Shape<_8, _4>, Shape<_2, _16>>,
                                 Stride<Stride<_2, _16>, Stride<_1, _64>>>

For swizzle, I wouldn't want to swizzle the consecutive 16 indicies, so MBase = 4.
My Atom B has 32 rows (for K) and 32 columns (for N). As we work on 8x rows each time, there are 2 bits left for swizzling. We wouldn't want to swizzle the columns, so we right-shift the lowest 2x column bits by 2x bits, which guanratees that thread 0 to 4 access different banks, making BBits = 2 and SShift = 2.
If my Atom size can be 64x64, I would be able to make BBits = 3 and SShift = 3, which guarantees that all 8x threads in ldmatrix access different banks.

I am not sure what would need to be modified or considered for the stores from global memory. I use a pretty regularized, continuous, non-swizzled GmemCopy, which uses SM80_CP_ASYNC_CACHEGLOBAL. It doesn't seem that GmemCopy can affect SmemCopy. Is there anything that I miss here? Thanks!

phantaurus · 2024-11-25T07:15:47Z

Hello, ccecka!

The solution is therefore some shared memory layout engineering.

I modified my MMA's permutation such that the output of the MMA can be directly used as operand A of a subsequent MMA. Basically the output register indices for thread i is exactly the input indices for thread i in the next MMA.

using TiledMma = TiledMMA< MMA_Atom_Arch,               
       Layout<Shape<4, _1, _1>>,
       Layout<Shape<_1, _4, _1>>,
       Tile<Underscore, Layout<Shape<_2, _2, _4, _2>, Stride<_1, _8, _2, _16>>, Underscore>;

brand_new_mma.pdf

This works like a charm. However, I didn't expect this change to affect smem_tiled_copy_B.
I used the same atom copy:

using SmemCopyAtomTransposed = Copy_Atom<SM75_U16x8_LDSM_T, int8_t>;
auto smem_tiled_copy_B = make_tiled_copy_B(SmemCopyAtomTransposed{}, tiled_mma);

But now, smem_tiled_copy_B becomes rather wierd:
brand_new_tiled_copy_B.pdf

I think SmemCopy would just need 4 destination registers, and my MMA does provide 4 registers there for each thread. Considering U16 vs Int8, I can understand that the source is partitioned into 8x2. However, this time the source is even more partitioned.

Could you help me understand the reason behind this, and what could I do to perform a correct SmemCopy? It seems that I can manipulate the Smem layout a bit harder?

Thank you so much!

phantaurus · 2024-11-25T07:42:29Z

auto smem_layout_B_atom = Layout<Shape <Shape <_8, _4>,Shape <_2,_16>>,
Stride<Stride<_2,_256>,Stride<_1,_16>>>{}); // 32N x 32K
auto smem_layout_B = tile_to_shape(smem_layout_B_atom, Shape<_128,_32>{}); // 128N x 32K

I’m still a bit unclear on changing SmemLayout. Here's my current understanding: previously, the data was stored in Smem in row-major order. However, we now want the data to follow a different pattern in Smem, where two values belonging to consecutive columns are stored first, and then we proceed in a row-major fashion.

What I’m not sure about is how Gmem to Smem copy will handle this new storage pattern. Currently, I’m using SM80_CP_ASYNC_CACHEGLOBAL for the GmemCopy. This operation uses one source address and one destination address per transfer, and each copy moves 128 bits. It doesn't seem that this instruction can adapt to the new SmemLayout.

Thank you so much!

ccecka · 2024-11-25T21:35:55Z

Looks to me like you won't be able to use 128b GMEM->SMEM copy as you only have 16 contiguous bits in each. This is what I meant by engineering the SMEM Layout to consider the loads from GMEM and the stores into RMEM.

You can make the SmemLayout anything you like and it shouldn't affect correctness -- it will only affect where each logical value is stored and, therefore, the access patterns including the number of bits possible in a vectorization and the banks accessed by each thread. Sacrifices can be made in the GMEM->SMEM copy, but LDSM is in general more strict about its granularity and requirements.

EDIT: For reference, your new tiled_copy_B with LDSM wants a smem atom like this:

auto smem_layout_B_atom = Layout<Shape <Shape <_2,  _2, _4>, _2>,
                                 Stride<Stride<_2, _16, _4>, _1>>{};   // 16x2 atom to account for contiguous T0 and T16

phantaurus · 2024-11-25T23:01:57Z

Thank you so much for your reply! I guess I can also choose to use DefaultCopy from Smem to Register.

What I'm currently struggling with is that ldmatrix does not seem to be compatible with changing the permutation of TiledMMA:

ldmatrix demands that the first row must go to 4x registers belonging to thread 0-3, the second row must go to 4x registers belonging to thread 4-7, the third row must go to thread 8-11, and the forth row 12-16, etc.

In my example, changing the tiledMMA permutation on the N dimension consequently changes the TiledCopy Src layout for B.
To simplify the scenario, let's focus solely on pure FP16, using a similar configuration:

using MMA_Atom_Arch = MMA_Atom<SM80_16x8x16_F32F16F16F32_TN>;
using ValLayoutMNK = Layout<Shape<_1, _4, _1>>;
using PermutationsMNK = Tile<Underscore, Layout<Shape<_2, _2, _4, _2>, Stride<_1, _8, _2, _16>>, Underscore>;

using TiledMma = TiledMMA<typename MMA_Atom_Arch,
      Layout<Shape<_4, _1, _1>>,
      typename ValLayoutMNK,
      typename PermutationsMNK>;

Here is the generated tiledMMA with and without PermutationsMNK:
fp16_mma_with_permutation.pdf
fp16_mma_no_perm.pdf

We use TiledMMA ad SmemCopyAtomTransposed to generate SmemTiledCopyB:

using SmemCopyAtomTransposed = Copy_Atom<SM75_U16x8_LDSM_T, cutlass::half_t>;
auto smem_tiled_copy_B = make_tiled_copy_B(SmemCopyAtomTransposed{}, TiledMma{});

Here is the TiledCopyB with and without permutation:
fp16_copyB_with_permutation.pdf
fp16_copyB_no_perm.pdf

Without modifying the permutation, everything works perfectly. On the source side, ldmatrix processes 8 rows, each assigned to a single thread. On the destination side, all 32 threads receive their corresponding portion of values in the 8x8 matrix.

With the permutation on N dimension, however, the destination can not form a 8x8 mapping that maps to all 32 threads:

In this case, make_tiled_copy_B does not produce an error. Instead, it maps a single ldmatrix instruction to both a strided input and a strided output. However, it seems that ldmatrix cannot inherently handle a strided input address for a single input row.

I'm trying to understand the meaning of the make_tiled_copy_B output and what options I have. Does this imply that I need to choose between the following:
(1) Using DefaultCopy for Smem to register transfers,
(2) Modifying SmemLayout, but potentially compromising the efficiency of Gmem to Smem copy,
(3) Do not do the permutation. This could mean less efficiency on the output side.

It seems that I need to make trade-offs to optimize certain aspects at the expense of others and select the most performance-efficient solution. For example, if Gmem to Smem copy happens only once, while the data in Smem is reused multiple times when copying to registers, then (2) would be more advantageous than (1).

Thank you so much!

ccecka · 2024-11-25T23:22:31Z

ldmatrix demands that the first row must go to 4x registers belonging to thread 0-3, the second row must go to 4x registers belonging to thread 4-7, the third row must go to thread 8-11, and the forth row 12-16, etc.

This is absolutely not true. The TiledCopy shows you how the threads will be accessing the source tensor, it's just that LDSM in particular puts extra constraints on the SMEM Layout that each thread will be accessing. If you permute the TiledMMA, the TiledCopy will be permuted, the access patterns will be permuted, and the SMEM Layout needs to also be permuted to satisfy the LDSM constraints. I've transcribed a new "SMEM Layout Atom" that ensures the necessary values are contiguous -- that atom can be tiled into any larger layout that you like.

(1) Using DefaultCopy for shared memory (Smem) to register transfers,
(2) Modifying SmemLayout, but potentially compromising the efficiency of Gmem to Smem copy,
(3) Do not do the permutation. This could mean less efficiency on the output side.

It seems that I need to make trade-offs to optimize certain aspects at the expense of others and select the most performance-efficient solution. For example, if Gmem to Smem copy happens only once, while the data in Smem is reused multiple times when copying to registers, then (2) would be more advantageous than (1).

This is correct :-D But whether you permute or not in (3) does not affect your ability to use LDSM.

phantaurus · 2024-11-26T00:07:35Z

Awesome! I didn’t phrase the ldmatrix part correctly. I completely agree with your point and really appreciate your help!

Back to the headache that ldmatrix does not support Int8. Would it work if I reinterpret-cast the Int8 B tensor to Fp16 during Memcpy, then reinterpret it back to Int8 to do MMAs?

Say we create a dedicated "Fp16 MMA", which serves only to generate and perform TiledCopyB. This Fp16 MMA would mimic the structure and layout of B in the actual Int8 MMA, but each Fp16 element represents two contiguous Int8 values in the original Int8 MMA.

The layout for B in the actual Int8 MMA:

The layout for B in the manufactured Fp16 MMA, each element can be seen as two contigous Int8 values:

Then we use this Fp16 MMA with Copy_Atom<SM75_U16x8_LDSM_T, cutlass::half_t> to produce the SmemTiledCopyB and perform the Smem to Register copy, assuming all data is treated as Fp16.

The beautiful Fp16 TiledCopy, each Fp16 value now represents two contigous Int8 values.

Then we reinterpret the register's rmem_ptr back to Int8 once the copy is done, and proceed with the actual int8 MMA computation.

It seems like a reasonable workaround to the initial problem discussed in this post, that LDSM_T generates a 2x8 layout for int8 values instead of 1x16. I am updating my implementation with this to see if there is anything that I didn't consider.

phantaurus · 2024-11-26T23:03:30Z

Nvm it doesn't work.
It's as if I want to perform a SmemCopy from 64x16 to 32x32 for the following FP16 SmemTiledCopy
Essentially, even though there is a .trans in ldmatrix, this operation is not a matrix transpose...

phantaurus added ? - Needs Triage question Question labels Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QST] make_tiled_copy_B generates incompatible layouts #1953

[QST] make_tiled_copy_B generates incompatible layouts #1953

phantaurus commented Nov 20, 2024 •

edited

Loading

ccecka commented Nov 20, 2024

phantaurus commented Nov 21, 2024 •

edited

Loading

phantaurus commented Nov 25, 2024

phantaurus commented Nov 25, 2024 •

edited

Loading

ccecka commented Nov 25, 2024 •

edited

Loading

phantaurus commented Nov 25, 2024 •

edited

Loading

ccecka commented Nov 25, 2024 •

edited

Loading

phantaurus commented Nov 26, 2024

phantaurus commented Nov 26, 2024 •

edited

Loading

[QST] make_tiled_copy_B generates incompatible layouts #1953

[QST] make_tiled_copy_B generates incompatible layouts #1953

Comments

phantaurus commented Nov 20, 2024 • edited Loading

ccecka commented Nov 20, 2024

phantaurus commented Nov 21, 2024 • edited Loading

phantaurus commented Nov 25, 2024

phantaurus commented Nov 25, 2024 • edited Loading

ccecka commented Nov 25, 2024 • edited Loading

phantaurus commented Nov 25, 2024 • edited Loading

ccecka commented Nov 25, 2024 • edited Loading

phantaurus commented Nov 26, 2024

phantaurus commented Nov 26, 2024 • edited Loading

phantaurus commented Nov 20, 2024 •

edited

Loading

phantaurus commented Nov 21, 2024 •

edited

Loading

phantaurus commented Nov 25, 2024 •

edited

Loading

ccecka commented Nov 25, 2024 •

edited

Loading

phantaurus commented Nov 25, 2024 •

edited

Loading

ccecka commented Nov 25, 2024 •

edited

Loading

phantaurus commented Nov 26, 2024 •

edited

Loading