[QUESTION] compilation time increases significantly after vec/mat/quat assign autograd fix commit #332

g1n0st · 2024-10-25T06:13:19Z

After commit 5a88c4b, I have observed a significant increase in compilation time for some warp kernels that previously compiled efficiently in my application. The increase seems to be related to changes in the vec/mat/quat autograd assignment, based on my bisection search of the commit history.

If needed, I would be happy to try to provide a minimal reproducible case by extracting a kernel from my application.

Thank you for any insights or suggestions!

The text was updated successfully, but these errors were encountered:

nvlukasz · 2024-10-25T23:24:34Z

Thanks @g1n0st, how significant is the slowdown you are seeing? A minimal repro would be great if you have the cycles.

Do you also see a perf regression when running the kernels or is it just the compilation time?

g1n0st · 2024-10-26T00:42:38Z

Hi @nvlukasz , thank you for the quick reply! For reproduce, you can refer to this minimal code snippet extracted from my code:

import warp as wp

wp.init()

@wp.kernel
def compute_hess(y: wp.array(dtype=wp.vec(length=12, dtype=wp.float64))):
    tid = wp.tid()
    dim = 3
    F = wp.mat33d()
    F[0, 0] = y[tid][3 + 0] - y[tid][0]
    F[1, 0] = y[tid][3 + 1] - y[tid][1]
    F[2, 0] = y[tid][3 + 2] - y[tid][2]
    F[0, 1] = y[tid][6 + 0] - y[tid][0]
    F[1, 1] = y[tid][6 + 1] - y[tid][1]
    F[2, 1] = y[tid][6 + 2] - y[tid][2]
    F[0, 2] = y[tid][9 + 0] - y[tid][0]
    F[1, 2] = y[tid][9 + 1] - y[tid][1]
    F[2, 2] = y[tid][9 + 2] - y[tid][2]
    H_F = wp.mat(shape=(9, 9), dtype=wp.float64)
    for i in range(dim * dim):
        for j in range(dim * dim):
            H_F[i, j] = wp.float64(0.0)
    for i in range(dim):
        for j in range(dim):
            aiaj = F[0, i] * F[0, j] + F[1, i] * F[1, j] + F[2, i] * F[2, j]
            if i == j:
                for p in range(3):
                    for q in range(3):
                        H_F[i * dim + p, j * dim + q] += wp.float64(8.0) * F[p, j] * F[q, i]
                for p in range(3):
                    H_F[i * dim + p, j * dim + p] += wp.float64(4.0) * (aiaj - wp.float64(1.0))
                for k in range(dim):
                    if k != j:
                        for p in range(3):
                            for q in range(3):
                                H_F[i * dim + p, j * dim + q] += wp.float64(4.0) * F[p, k] * F[q, k]
            else:
                for p in range(3):
                    for q in range(3):
                        H_F[i * dim + p, j * dim + q] += wp.float64(4.0) * F[p, j] * F[q, i]
                for p in range(3):
                    H_F[i * dim + p, j * dim + p] += wp.float64(4.0) * aiaj

    coeff = wp.float64(1.0)
    local_hessian = wp.mat(shape=(12, 12), dtype=wp.float64)
    for i in range(12):
        for j in range(12):
            local_hessian[i, j] = wp.float64(0.0)
    for i in range(dim * dim):
        for j in range(dim * dim):
            H_F[i, j] *= coeff
    for i in range(dim * dim):
        for j in range(dim * dim):
            ii = i + dim
            jj = j + dim
            local_hessian[ii, jj] += H_F[i, j]

    for i in range(dim):
        for j in range(dim * dim):
            ii = i
            jj = j + dim
            local_hessian[ii, jj] += -H_F[i, j] - H_F[i + dim, j] - H_F[i + dim * 2, j]

    for i in range(dim * dim):
        for j in range(dim):
            ii = i + dim
            jj = j
            local_hessian[ii, jj] += -H_F[j, i] - H_F[j + dim, i] - H_F[j + dim * 2, i]

    for i in range(dim):
        for j in range(dim):
            ii = i
            jj = j
            local_hessian[ii, jj] += (
                H_F[i, j]
                + H_F[i + dim, j]
                + H_F[i + dim * 2, j]
                + H_F[i, j + dim]
                + H_F[i + dim, j + dim]
                + H_F[i + dim * 2, j + dim]
                + H_F[i, j + dim * 2]
                + H_F[i + dim, j + dim * 2]
                + H_F[i + dim * 2, j + dim * 2]
            )

    mat3 = wp.mat33d()
    for bi in range(4):
        for bj in range(4):
            for di in range(3):
                for dj in range(3):
                    mat3[di, dj] = local_hessian[bi * 3 + di, bj * 3 + dj]

y =  wp.zeros(shape=1, dtype=wp.vec(length=12, dtype=wp.float64))
wp.launch(kernel=compute_hess, dim=1, inputs=[y])
wp.synchronize()

Before that in released version 1.3.0, the compilation time from scratch is:

Warp 1.3.0 initialized:
   CUDA Toolkit 12.5, Driver 12.2
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA H100 NVL" (93 GiB, sm_90, mempool enabled)
   Kernel cache:
     /home/chang/.cache/warp/1.3.0
Module __main__ 72e77da load on device 'cuda:0' took 21989.06 ms  (compiled)

After commit 5a88c4b, the compilation time from scratch is:

Warp 1.3.1 initialized:
   CUDA Toolkit 12.1, Driver 12.2
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA H100 NVL" (93 GiB, sm_90, mempool enabled)
   Kernel cache:
     /home/chang/.cache/warp/1.3.1
Module __main__ 72e77da load on device 'cuda:0' took 230008.95 ms  (compiled)

which is 10x slower. And in other warp kernels, it is even worse if more computation on matrices is involved. Thank you for any insights or suggestions about this!

g1n0st · 2024-10-26T00:46:27Z

And I just monitor the compilation time and do not test the performance regression yet since the compilation time prohibits me from running a complete test. Thanks again for your help!

nvlukasz · 2024-10-26T15:52:50Z

Thanks for the repro! 10x slowdown sounds pretty bad, we will definitely take a look.

nvlukasz · 2024-10-28T19:30:28Z

Update: Looks like the slowdown happens in the NVRTC and LLVM compilers, rather than Warp codegen per se. This is likely due to the addition of several templated functions in our builtin headers (assign and adj_assign), which cause the native compilers to take longer. Will keep looking for a workaround.

In the meantime, disabling backward code generation seems to help a lot, if you don't need differentiability.

wp.config.enable_backward = False

g1n0st · 2024-11-15T08:54:49Z

Hi,

I noticed that Warp 1.4.2 was released two days ago, but this issue still seems to persist. I’m more than willing to try the latest version, so I’d like to follow up on this thread.

In my application, autodiff is still required at times, so simply disabling backward code generation may not be the best workaround for me (and it also seems ineffective in some cases). I would greatly appreciate any further assistance or suggestions on resolving this.

Thank you!
Chang

daedalus5 · 2024-11-15T22:28:40Z

We're looking into getting you some additional compilation time improvements, but in the meantime, you should see a good speedup if you pre-declare your large matrix types outside the kernel, eg:

mat99 = wp.mat(shape=(9, 9), dtype=wp.float64)
mat12_12 = wp.mat(shape=(12, 12), dtype=wp.float64)

And then inside the kernel, you can use a constructor rather than loop through the matrices and initialize each element. So for example, you can replace

    for i in range(dim * dim):
        for j in range(dim * dim):
            H_F[i, j] = wp.float64(0.0)

with

H_F = mat99(wp.float64(0.0))

and the same for your 12x12 matrix.

g1n0st added the question The issue author requires information label Oct 25, 2024

shi-eric added the bug Something isn't working label Oct 25, 2024

shi-eric assigned nvlukasz Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] compilation time increases significantly after vec/mat/quat assign autograd fix commit #332

[QUESTION] compilation time increases significantly after vec/mat/quat assign autograd fix commit #332

g1n0st commented Oct 25, 2024

nvlukasz commented Oct 25, 2024

g1n0st commented Oct 26, 2024

g1n0st commented Oct 26, 2024

nvlukasz commented Oct 26, 2024 •

edited

Loading

nvlukasz commented Oct 28, 2024

g1n0st commented Nov 15, 2024

daedalus5 commented Nov 15, 2024

[QUESTION] compilation time increases significantly after vec/mat/quat assign autograd fix commit #332

[QUESTION] compilation time increases significantly after vec/mat/quat assign autograd fix commit #332

Comments

g1n0st commented Oct 25, 2024

nvlukasz commented Oct 25, 2024

g1n0st commented Oct 26, 2024

g1n0st commented Oct 26, 2024

nvlukasz commented Oct 26, 2024 • edited Loading

nvlukasz commented Oct 28, 2024

g1n0st commented Nov 15, 2024

daedalus5 commented Nov 15, 2024

nvlukasz commented Oct 26, 2024 •

edited

Loading