How does mcast_mask behave in TMA_LOAD_MULTICAST? #1321

hyhieu · 2024-01-29T04:39:03Z

hyhieu
Jan 29, 2024

This is a follow-up question from #1315.

While I managed to make a program with SM90_TMA_LOAD_MULTICAST to run, I am failing to understand its behavior, especially how the mcast_mask affects the copy.

Below, I describe my small experiment with sample outputs. I put my at the end, and also include the full reproducing program. Hope to receive some help here. Thank you in advance.

I am trying a very simple task: copy a matrix of layout (16, 8) column-major from GMEM to two SMEM arrays. Each SMEM array has layout (8, 8) and lives in a CTA. Hence there are two CTAs. These two CTAs form a cluster.

I created the CTA like this:

code block

using ClusterShape = Shape<_2, _1, _1>;
auto tma = make_tma_copy(
    SM90_TMA_LOAD_MULTICAST{}, gX, smem_layout, size(ClusterShape{}));

Then, inside the kernel, I have this block of code, which allows me to vary the mcast_mask to see the result:

code block

cute::cluster_arrive_relaxed();
cute::cluster_wait();

if (warp_idx == 0 && lane_predicate) {
    constexpr int k_tma_transaction_bytes = size(sX) * sizeof(T);

    tma_mbar[0] = 0;
    cute::initialize_barrier(tma_mbar[0], 1;
    cute::set_barrier_transaction_bytes(tma_mbar[0], k_tma_transaction_bytes);

    uint16_t mcast_mask = 0b01;  // I could use 0b00, 0b10, or 0b11
    cute::copy(tma.with(tma_mbar[0], mcast_mask), tXgX, tXsX);
}

cute::cluster_arrive();
cute::cluster_wait();

Here's what I have seen:

The global tensor looks like this:

Global tensor

gmem_ptr[32b](0x7ff601e00000) o (_16,_8):(_1,_16):
  0.00e+00  1.60e+00  3.20e+00  4.80e+00  6.40e+00  8.00e+00  9.60e+00  1.12e+01
  1.00e-01  1.70e+00  3.30e+00  4.90e+00  6.50e+00  8.10e+00  9.70e+00  1.13e+01
  2.00e-01  1.80e+00  3.40e+00  5.00e+00  6.60e+00  8.20e+00  9.80e+00  1.14e+01
  3.00e-01  1.90e+00  3.50e+00  5.10e+00  6.70e+00  8.30e+00  9.90e+00  1.15e+01
  4.00e-01  2.00e+00  3.60e+00  5.20e+00  6.80e+00  8.40e+00  1.00e+01  1.16e+01
  5.00e-01  2.10e+00  3.70e+00  5.30e+00  6.90e+00  8.50e+00  1.01e+01  1.17e+01
  6.00e-01  2.20e+00  3.80e+00  5.40e+00  7.00e+00  8.60e+00  1.02e+01  1.18e+01
  7.00e-01  2.30e+00  3.90e+00  5.50e+00  7.10e+00  8.70e+00  1.03e+01  1.19e+01
  8.00e-01  2.40e+00  4.00e+00  5.60e+00  7.20e+00  8.80e+00  1.04e+01  1.20e+01
  9.00e-01  2.50e+00  4.10e+00  5.70e+00  7.30e+00  8.90e+00  1.05e+01  1.21e+01
  1.00e+00  2.60e+00  4.20e+00  5.80e+00  7.40e+00  9.00e+00  1.06e+01  1.22e+01
  1.10e+00  2.70e+00  4.30e+00  5.90e+00  7.50e+00  9.10e+00  1.07e+01  1.23e+01
  1.20e+00  2.80e+00  4.40e+00  6.00e+00  7.60e+00  9.20e+00  1.08e+01  1.24e+01
  1.30e+00  2.90e+00  4.50e+00  6.10e+00  7.70e+00  9.30e+00  1.09e+01  1.25e+01
  1.40e+00  3.00e+00  4.60e+00  6.20e+00  7.80e+00  9.40e+00  1.10e+01  1.26e+01
  1.50e+00  3.10e+00  4.70e+00  6.30e+00  7.90e+00  9.50e+00  1.11e+01  1.27e+01

If mcast_mask = 0b00, the SMEM tensor sX is all zeros. That makes sense;

0b00

sX: smem_ptr[32b](0x7ff700000400) o (_8,_8):(_1,_8):
  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00
  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00  0.00e+00

If mcast_mask = 0b01, the SMEM tensor sX is as follows.

0b01

sX: smem_ptr[32b](0x7fdf00000400) o (_8,_8):(_1,_8):
  0.00e+00  1.60e+00  3.20e+00  4.80e+00  7.20e+00  8.80e+00  1.04e+01  1.20e+01
  1.00e-01  1.70e+00  3.30e+00  4.90e+00  7.30e+00  8.90e+00  1.05e+01  1.21e+01
  2.00e-01  1.80e+00  3.40e+00  5.00e+00  7.40e+00  9.00e+00  1.06e+01  1.22e+01
  3.00e-01  1.90e+00  3.50e+00  5.10e+00  7.50e+00  9.10e+00  1.07e+01  1.23e+01
  4.00e-01  2.00e+00  3.60e+00  5.20e+00  7.60e+00  9.20e+00  1.08e+01  1.24e+01
  5.00e-01  2.10e+00  3.70e+00  5.30e+00  7.70e+00  9.30e+00  1.09e+01  1.25e+01
  6.00e-01  2.20e+00  3.80e+00  5.40e+00  7.80e+00  9.40e+00  1.10e+01  1.26e+01
  7.00e-01  2.30e+00  3.90e+00  5.50e+00  7.90e+00  9.50e+00  1.11e+01  1.27e+01

This makes no sense to me. Comparing this to the global tensor, sX is taking the top-left and the bottom-right tiles of the global tensor. The PTX doc says:

The source data is multicast to the same CTA-relative offset as dstMem in the shared memory of each destination CTA.

But I am not sure how to interpret this.

Finally, if mcast_mask = 0b11, the SMEM tensor sX is just the same as the case mast_mask = 0b01 above.

0b11

sX: smem_ptr[32b](0x7f6c00000400) o (_8,_8):(_1,_8):
  0.00e+00  1.60e+00  3.20e+00  4.80e+00  7.20e+00  8.80e+00  1.04e+01  1.20e+01
  1.00e-01  1.70e+00  3.30e+00  4.90e+00  7.30e+00  8.90e+00  1.05e+01  1.21e+01
  2.00e-01  1.80e+00  3.40e+00  5.00e+00  7.40e+00  9.00e+00  1.06e+01  1.22e+01
  3.00e-01  1.90e+00  3.50e+00  5.10e+00  7.50e+00  9.10e+00  1.07e+01  1.23e+01
  4.00e-01  2.00e+00  3.60e+00  5.20e+00  7.60e+00  9.20e+00  1.08e+01  1.24e+01
  5.00e-01  2.10e+00  3.70e+00  5.30e+00  7.70e+00  9.30e+00  1.09e+01  1.25e+01
  6.00e-01  2.20e+00  3.80e+00  5.40e+00  7.80e+00  9.40e+00  1.10e+01  1.26e+01
  7.00e-01  2.30e+00  3.90e+00  5.50e+00  7.90e+00  9.50e+00  1.11e+01  1.27e+01

So, two questions that stand out to me are:

What exactly does mcast_mask control? In my example above, why does mcast_mask = 0b11 and mcast_mask = 0b10 have the same behavior?
How do I actually accomplish my task with SM90_TMA_LOAD_MULTICAST on a cluster of two CTAs? That is, copy a (16, 8) column-major tensor from GMEM into two SMEM tensors, each has the layout (8, 8), also column major.

Finally, here's the full (minimal, self-contained) program.

/**********

Self-contained example to study SM90_TMA_MULTICAST

Usage:

$ nvcc main.cu \
    --expt-relaxed-constexpr \
    --generate-code=arch=compute_90a,code=sm_90a \
    -lcuda \
    -w \
    -Xcompiler=-Wconversion \
    -Xcompiler=-fno-strict-aliasing \
    -Xcompiler=-Wfatal-errors \
    -Xcompiler=-Wno-abi \
    -Xcompiler=-Wfatal-errors \
    -std=c++17 \
    -arch=sm_90 \
    -I/usr/local/cuda/include \
    -I"${CUTLASS_PATH}/include" \
    -I"${CUTLASS_PATH}/tools/util/include"

$ ./a.out

**********/

#include <cstdio>

#include "thrust/device_vector.h"
#include "thrust/host_vector.h"

#include "cutlass/cutlass.h"
#include "cutlass/cluster_launch.hpp"
#include "cute/tensor.hpp"
#include "cute/arch/cluster_sm90.hpp"

template <
    class T,
    class TensorX,
    class GmemLayout,
    class SmemLayout,
    class Tma
>
__global__ static void
tma_kernel(
    TensorX tX,
    GmemLayout gmem_layout,
    SmemLayout smem_layout,
    CUTE_GRID_CONSTANT Tma const tma
) {
    using namespace cute;

    __shared__ T smem[cosize_v<SmemLayout>];
    __shared__ uint64_t tma_mbar[1];

    auto sX = make_tensor(make_smem_ptr(smem), smem_layout);

    auto mX = tma.get_tma_tensor(shape(gmem_layout));
    auto gX = local_tile(mX, shape(sX), make_coord(blockIdx.x, blockIdx.y));  // (CTA_TILE_M,CTA_TILE_N)

    auto block_rank_in_cluster = cute::block_rank_in_cluster();
    auto cta_tma = tma.get_slice(block_rank_in_cluster);

    auto tXgX = cta_tma.partition_S(gX);  // (TMA,TMA_M,TMA_N,REST_M,REST_N)
    auto tXsX = cta_tma.partition_D(sX);  // (TMA,TMA_M,TMA_N)

    auto warp_idx = cutlass::canonical_warp_idx_sync();
    auto lane_predicate = cute::elect_one_sync();

    cute::cluster_arrive_relaxed();
    cute::cluster_wait();

    if (warp_idx == 0 && lane_predicate) {
        constexpr int k_tma_transaction_bytes = size(sX) * sizeof(T);

        tma_mbar[0] = 0;
        cute::initialize_barrier(tma_mbar[0], 1 /*numThreads*/);
        cute::set_barrier_transaction_bytes(tma_mbar[0], k_tma_transaction_bytes);

        uint16_t mcast_mask = 0b01;
        cute::copy(tma.with(tma_mbar[0], mcast_mask), tXgX, tXsX);
    }

    cute::cluster_arrive();
    cute::cluster_wait();

    if (thread(0, 0)) {
        printf("----------\n");
        print("tma: "); print(tma); print("\n");

        printf("----------\n");
        print("smem_layout: "); print(smem_layout); print("\n");

        printf("----------\n");
        print("tX: "); print_tensor(tX); print("\n");
        print("mX: "); print_tensor(mX); print("\n");
        print("gX: "); print_tensor(gX); print("\n");

        print("sX: "); print_tensor(sX); print("\n");
    }
}


int main() {
    using namespace cute;

    using T = float;

    constexpr int m = 16;
    constexpr int n = 8;

    // create data
    thrust::host_vector<T> cpu_data(m * n);
    for (int i = 0; i < m*n; ++i) {
        cpu_data[i] = static_cast<T>(i / 10.f);
    }
    thrust::device_vector<T> gpu_data = cpu_data;
    cudaDeviceSynchronize();

    // cluster shape
    using ClusterShape = Shape<_2, _1, _1>;

    // create tensors
    auto gmem_layout = Layout<Shape<  Int<m>, Int<n>>>{};
    auto smem_layout = Layout<Shape<Int<m/2>, Int<n>>>{};

    auto pX = reinterpret_cast<const T*>(gpu_data.data().get());
    auto gX = make_tensor(make_gmem_ptr(pX), gmem_layout);

    // create the TMA object
    auto tma = make_tma_copy(
        SM90_TMA_LOAD_MULTICAST{}, gX, smem_layout, size(ClusterShape{}));

    // launch the kernel
    dim3 grid_dims{2, 1, 1};
    dim3 block_dims{1, 1, 1};
    dim3 cluster_dims{size<0>(ClusterShape{}),
                      size<1>(ClusterShape{}),
                      size<2>(ClusterShape{})};
    cutlass::ClusterLaunchParams launch_params{grid_dims, block_dims, cluster_dims};
    void const* kernel_ptr = (void const*)tma_kernel<
        T,
        decltype(gX),
        decltype(gmem_layout),
        decltype(smem_layout),
        decltype(tma)
    >;

    cutlass::launch_kernel_on_cluster(
        launch_params, kernel_ptr, gX, gmem_layout, smem_layout, tma);

    auto result = cudaDeviceSynchronize();
    if (result != cudaSuccess) {
        CUTLASS_TRACE_HOST("Kernel launch FAILED.\n");
        cudaError_t error = cudaGetLastError();
        std::cout << error << std::endl;
    }

    return 0;
}

Answered by ccecka

Jan 29, 2024

Your concept of the parameters appears correct, but the Multicast TMAs would not be used to copy a 16x8 gmem tensor to two 8x8 smem tensors. In your example case, each copy appears to be completely independent.

Instead, the Multicast TMAs are used to copy a a single 8x8 gmem tensor to two 8x8 smem tensors in a broadcasted fashion, where the broadcast is performed across all participating CTAs in the mcast_mask.

This is useful in GEMMs, for example, because the A tiles can be broadcasted across each "row" of CTAs and the B tiles can be broadcasted across each "column" of CTAs.

View full answer

hwu36 · 2024-01-29T05:43:11Z

hwu36
Jan 29, 2024
Maintainer

@thakkarV

0 replies

ccecka · 2024-01-29T05:45:51Z

ccecka
Jan 29, 2024

Your concept of the parameters appears correct, but the Multicast TMAs would not be used to copy a 16x8 gmem tensor to two 8x8 smem tensors. In your example case, each copy appears to be completely independent.

Instead, the Multicast TMAs are used to copy a a single 8x8 gmem tensor to two 8x8 smem tensors in a broadcasted fashion, where the broadcast is performed across all participating CTAs in the mcast_mask.

This is useful in GEMMs, for example, because the A tiles can be broadcasted across each "row" of CTAs and the B tiles can be broadcasted across each "column" of CTAs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does mcast_mask behave in TMA_LOAD_MULTICAST? #1321

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

How does mcast_mask behave in TMA_LOAD_MULTICAST? #1321

hyhieu Jan 29, 2024

Replies: 2 comments

hwu36 Jan 29, 2024 Maintainer

ccecka Jan 29, 2024

hyhieu
Jan 29, 2024

hwu36
Jan 29, 2024
Maintainer

ccecka
Jan 29, 2024