Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] compilation time increases significantly after vec/mat/quat assign autograd fix commit #332

Open
g1n0st opened this issue Oct 25, 2024 · 7 comments
Assignees
Labels
bug Something isn't working question The issue author requires information

Comments

@g1n0st
Copy link

g1n0st commented Oct 25, 2024

After commit 5a88c4b, I have observed a significant increase in compilation time for some warp kernels that previously compiled efficiently in my application. The increase seems to be related to changes in the vec/mat/quat autograd assignment, based on my bisection search of the commit history.

If needed, I would be happy to try to provide a minimal reproducible case by extracting a kernel from my application.

Thank you for any insights or suggestions!

@g1n0st g1n0st added the question The issue author requires information label Oct 25, 2024
@shi-eric shi-eric added the bug Something isn't working label Oct 25, 2024
@nvlukasz
Copy link
Contributor

Thanks @g1n0st, how significant is the slowdown you are seeing? A minimal repro would be great if you have the cycles.

Do you also see a perf regression when running the kernels or is it just the compilation time?

@g1n0st
Copy link
Author

g1n0st commented Oct 26, 2024

Hi @nvlukasz , thank you for the quick reply! For reproduce, you can refer to this minimal code snippet extracted from my code:

import warp as wp

wp.init()

@wp.kernel
def compute_hess(y: wp.array(dtype=wp.vec(length=12, dtype=wp.float64))):
    tid = wp.tid()
    dim = 3
    F = wp.mat33d()
    F[0, 0] = y[tid][3 + 0] - y[tid][0]
    F[1, 0] = y[tid][3 + 1] - y[tid][1]
    F[2, 0] = y[tid][3 + 2] - y[tid][2]
    F[0, 1] = y[tid][6 + 0] - y[tid][0]
    F[1, 1] = y[tid][6 + 1] - y[tid][1]
    F[2, 1] = y[tid][6 + 2] - y[tid][2]
    F[0, 2] = y[tid][9 + 0] - y[tid][0]
    F[1, 2] = y[tid][9 + 1] - y[tid][1]
    F[2, 2] = y[tid][9 + 2] - y[tid][2]
    H_F = wp.mat(shape=(9, 9), dtype=wp.float64)
    for i in range(dim * dim):
        for j in range(dim * dim):
            H_F[i, j] = wp.float64(0.0)
    for i in range(dim):
        for j in range(dim):
            aiaj = F[0, i] * F[0, j] + F[1, i] * F[1, j] + F[2, i] * F[2, j]
            if i == j:
                for p in range(3):
                    for q in range(3):
                        H_F[i * dim + p, j * dim + q] += wp.float64(8.0) * F[p, j] * F[q, i]
                for p in range(3):
                    H_F[i * dim + p, j * dim + p] += wp.float64(4.0) * (aiaj - wp.float64(1.0))
                for k in range(dim):
                    if k != j:
                        for p in range(3):
                            for q in range(3):
                                H_F[i * dim + p, j * dim + q] += wp.float64(4.0) * F[p, k] * F[q, k]
            else:
                for p in range(3):
                    for q in range(3):
                        H_F[i * dim + p, j * dim + q] += wp.float64(4.0) * F[p, j] * F[q, i]
                for p in range(3):
                    H_F[i * dim + p, j * dim + p] += wp.float64(4.0) * aiaj

    coeff = wp.float64(1.0)
    local_hessian = wp.mat(shape=(12, 12), dtype=wp.float64)
    for i in range(12):
        for j in range(12):
            local_hessian[i, j] = wp.float64(0.0)
    for i in range(dim * dim):
        for j in range(dim * dim):
            H_F[i, j] *= coeff
    for i in range(dim * dim):
        for j in range(dim * dim):
            ii = i + dim
            jj = j + dim
            local_hessian[ii, jj] += H_F[i, j]

    for i in range(dim):
        for j in range(dim * dim):
            ii = i
            jj = j + dim
            local_hessian[ii, jj] += -H_F[i, j] - H_F[i + dim, j] - H_F[i + dim * 2, j]

    for i in range(dim * dim):
        for j in range(dim):
            ii = i + dim
            jj = j
            local_hessian[ii, jj] += -H_F[j, i] - H_F[j + dim, i] - H_F[j + dim * 2, i]

    for i in range(dim):
        for j in range(dim):
            ii = i
            jj = j
            local_hessian[ii, jj] += (
                H_F[i, j]
                + H_F[i + dim, j]
                + H_F[i + dim * 2, j]
                + H_F[i, j + dim]
                + H_F[i + dim, j + dim]
                + H_F[i + dim * 2, j + dim]
                + H_F[i, j + dim * 2]
                + H_F[i + dim, j + dim * 2]
                + H_F[i + dim * 2, j + dim * 2]
            )

    mat3 = wp.mat33d()
    for bi in range(4):
        for bj in range(4):
            for di in range(3):
                for dj in range(3):
                    mat3[di, dj] = local_hessian[bi * 3 + di, bj * 3 + dj]

y =  wp.zeros(shape=1, dtype=wp.vec(length=12, dtype=wp.float64))
wp.launch(kernel=compute_hess, dim=1, inputs=[y])
wp.synchronize()

Before that in released version 1.3.0, the compilation time from scratch is:

Warp 1.3.0 initialized:
   CUDA Toolkit 12.5, Driver 12.2
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA H100 NVL" (93 GiB, sm_90, mempool enabled)
   Kernel cache:
     /home/chang/.cache/warp/1.3.0
Module __main__ 72e77da load on device 'cuda:0' took 21989.06 ms  (compiled)

After commit 5a88c4b, the compilation time from scratch is:

Warp 1.3.1 initialized:
   CUDA Toolkit 12.1, Driver 12.2
   Devices:
     "cpu"      : "x86_64"
     "cuda:0"   : "NVIDIA H100 NVL" (93 GiB, sm_90, mempool enabled)
   Kernel cache:
     /home/chang/.cache/warp/1.3.1
Module __main__ 72e77da load on device 'cuda:0' took 230008.95 ms  (compiled)

which is 10x slower. And in other warp kernels, it is even worse if more computation on matrices is involved. Thank you for any insights or suggestions about this!

@g1n0st
Copy link
Author

g1n0st commented Oct 26, 2024

And I just monitor the compilation time and do not test the performance regression yet since the compilation time prohibits me from running a complete test. Thanks again for your help!

@nvlukasz
Copy link
Contributor

nvlukasz commented Oct 26, 2024

Thanks for the repro! 10x slowdown sounds pretty bad, we will definitely take a look.

@nvlukasz
Copy link
Contributor

Update: Looks like the slowdown happens in the NVRTC and LLVM compilers, rather than Warp codegen per se. This is likely due to the addition of several templated functions in our builtin headers (assign and adj_assign), which cause the native compilers to take longer. Will keep looking for a workaround.

In the meantime, disabling backward code generation seems to help a lot, if you don't need differentiability.

wp.config.enable_backward = False

@g1n0st
Copy link
Author

g1n0st commented Nov 15, 2024

Hi,

I noticed that Warp 1.4.2 was released two days ago, but this issue still seems to persist. I’m more than willing to try the latest version, so I’d like to follow up on this thread.

In my application, autodiff is still required at times, so simply disabling backward code generation may not be the best workaround for me (and it also seems ineffective in some cases). I would greatly appreciate any further assistance or suggestions on resolving this.

Thank you!
Chang

@daedalus5
Copy link
Contributor

We're looking into getting you some additional compilation time improvements, but in the meantime, you should see a good speedup if you pre-declare your large matrix types outside the kernel, eg:

mat99 = wp.mat(shape=(9, 9), dtype=wp.float64)
mat12_12 = wp.mat(shape=(12, 12), dtype=wp.float64)

And then inside the kernel, you can use a constructor rather than loop through the matrices and initialize each element. So for example, you can replace

    for i in range(dim * dim):
        for j in range(dim * dim):
            H_F[i, j] = wp.float64(0.0)

with

H_F = mat99(wp.float64(0.0))

and the same for your 12x12 matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question The issue author requires information
Projects
None yet
Development

No branches or pull requests

4 participants