-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] compilation time increases significantly after vec/mat/quat assign autograd fix commit #332
Comments
Thanks @g1n0st, how significant is the slowdown you are seeing? A minimal repro would be great if you have the cycles. Do you also see a perf regression when running the kernels or is it just the compilation time? |
Hi @nvlukasz , thank you for the quick reply! For reproduce, you can refer to this minimal code snippet extracted from my code: import warp as wp
wp.init()
@wp.kernel
def compute_hess(y: wp.array(dtype=wp.vec(length=12, dtype=wp.float64))):
tid = wp.tid()
dim = 3
F = wp.mat33d()
F[0, 0] = y[tid][3 + 0] - y[tid][0]
F[1, 0] = y[tid][3 + 1] - y[tid][1]
F[2, 0] = y[tid][3 + 2] - y[tid][2]
F[0, 1] = y[tid][6 + 0] - y[tid][0]
F[1, 1] = y[tid][6 + 1] - y[tid][1]
F[2, 1] = y[tid][6 + 2] - y[tid][2]
F[0, 2] = y[tid][9 + 0] - y[tid][0]
F[1, 2] = y[tid][9 + 1] - y[tid][1]
F[2, 2] = y[tid][9 + 2] - y[tid][2]
H_F = wp.mat(shape=(9, 9), dtype=wp.float64)
for i in range(dim * dim):
for j in range(dim * dim):
H_F[i, j] = wp.float64(0.0)
for i in range(dim):
for j in range(dim):
aiaj = F[0, i] * F[0, j] + F[1, i] * F[1, j] + F[2, i] * F[2, j]
if i == j:
for p in range(3):
for q in range(3):
H_F[i * dim + p, j * dim + q] += wp.float64(8.0) * F[p, j] * F[q, i]
for p in range(3):
H_F[i * dim + p, j * dim + p] += wp.float64(4.0) * (aiaj - wp.float64(1.0))
for k in range(dim):
if k != j:
for p in range(3):
for q in range(3):
H_F[i * dim + p, j * dim + q] += wp.float64(4.0) * F[p, k] * F[q, k]
else:
for p in range(3):
for q in range(3):
H_F[i * dim + p, j * dim + q] += wp.float64(4.0) * F[p, j] * F[q, i]
for p in range(3):
H_F[i * dim + p, j * dim + p] += wp.float64(4.0) * aiaj
coeff = wp.float64(1.0)
local_hessian = wp.mat(shape=(12, 12), dtype=wp.float64)
for i in range(12):
for j in range(12):
local_hessian[i, j] = wp.float64(0.0)
for i in range(dim * dim):
for j in range(dim * dim):
H_F[i, j] *= coeff
for i in range(dim * dim):
for j in range(dim * dim):
ii = i + dim
jj = j + dim
local_hessian[ii, jj] += H_F[i, j]
for i in range(dim):
for j in range(dim * dim):
ii = i
jj = j + dim
local_hessian[ii, jj] += -H_F[i, j] - H_F[i + dim, j] - H_F[i + dim * 2, j]
for i in range(dim * dim):
for j in range(dim):
ii = i + dim
jj = j
local_hessian[ii, jj] += -H_F[j, i] - H_F[j + dim, i] - H_F[j + dim * 2, i]
for i in range(dim):
for j in range(dim):
ii = i
jj = j
local_hessian[ii, jj] += (
H_F[i, j]
+ H_F[i + dim, j]
+ H_F[i + dim * 2, j]
+ H_F[i, j + dim]
+ H_F[i + dim, j + dim]
+ H_F[i + dim * 2, j + dim]
+ H_F[i, j + dim * 2]
+ H_F[i + dim, j + dim * 2]
+ H_F[i + dim * 2, j + dim * 2]
)
mat3 = wp.mat33d()
for bi in range(4):
for bj in range(4):
for di in range(3):
for dj in range(3):
mat3[di, dj] = local_hessian[bi * 3 + di, bj * 3 + dj]
y = wp.zeros(shape=1, dtype=wp.vec(length=12, dtype=wp.float64))
wp.launch(kernel=compute_hess, dim=1, inputs=[y])
wp.synchronize() Before that in released version
After commit 5a88c4b, the compilation time from scratch is:
which is 10x slower. And in other warp kernels, it is even worse if more computation on matrices is involved. Thank you for any insights or suggestions about this! |
And I just monitor the compilation time and do not test the performance regression yet since the compilation time prohibits me from running a complete test. Thanks again for your help! |
Thanks for the repro! 10x slowdown sounds pretty bad, we will definitely take a look. |
Update: Looks like the slowdown happens in the NVRTC and LLVM compilers, rather than Warp codegen per se. This is likely due to the addition of several templated functions in our builtin headers ( In the meantime, disabling backward code generation seems to help a lot, if you don't need differentiability. wp.config.enable_backward = False |
Hi, I noticed that Warp 1.4.2 was released two days ago, but this issue still seems to persist. I’m more than willing to try the latest version, so I’d like to follow up on this thread. In my application, autodiff is still required at times, so simply disabling backward code generation may not be the best workaround for me (and it also seems ineffective in some cases). I would greatly appreciate any further assistance or suggestions on resolving this. Thank you! |
We're looking into getting you some additional compilation time improvements, but in the meantime, you should see a good speedup if you pre-declare your large matrix types outside the kernel, eg:
And then inside the kernel, you can use a constructor rather than loop through the matrices and initialize each element. So for example, you can replace
with
and the same for your 12x12 matrix. |
After commit 5a88c4b, I have observed a significant increase in compilation time for some warp kernels that previously compiled efficiently in my application. The increase seems to be related to changes in the vec/mat/quat autograd assignment, based on my bisection search of the commit history.
If needed, I would be happy to try to provide a minimal reproducible case by extracting a kernel from my application.
Thank you for any insights or suggestions!
The text was updated successfully, but these errors were encountered: