Replies: 1 comment 1 reply
-
The A and B layouts have projections in the threads which are difficult to depict in these diagrams. T64 is "missing" from the A Layout. T64 will read the same values that T0 reads in A. T32 is "missing" from the B Layout. T32 will read the same values that T0 reads in B. Your understanding is correct -- all threads hold parts of the data of matrices A, B, and C, but that data may actually be reproduced across multiple threads. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have read all the documents of CuTe, and I have always been puzzled about the TileMMA thread layout setting ThrLayoutVMNK (_32,_2,_2,_1):(_1,_32,_64,_0). When I use print_latex to print, I see that the data of matrix A is distributed among threads 0-31 and 32-63. Does this mean that the two warps of thread idx 64~127 do not hold any data of matrix A? Similarly, matrix B is also distributed among the threads of 2 warps (0-31, 64-95), but the data of matrix C is distributed within the full 4 warps (0-127). My current understanding is that all threads hold parts of the data of matrices A, B, and C, it's just that print_latex cannot print them out. I would be very grateful if someone could answer this!
And the output latex as follow:
Beta Was this translation helpful? Give feedback.
All reactions