Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: cuda::ptx takes long to compile #2933

Open
1 task done
ahendriksen opened this issue Nov 22, 2024 · 3 comments
Open
1 task done

[BUG]: cuda::ptx takes long to compile #2933

ahendriksen opened this issue Nov 22, 2024 · 3 comments
Assignees
Labels
bug Something isn't working right.

Comments

@ahendriksen
Copy link
Contributor

Is this a duplicate?

Type of Bug

Performance

Component

libcu++

Describe the bug

Including <cuda/ptx> takes ~800ms on my workstation.

How to Reproduce

Comparing the time to compile an empty file, a file including cuda/ptx and a file including cuda/std/__type_traits/integral_constant.h (which is included from cuda/ptx).

$ echo "" > empty.cu
$ echo "#include <cuda/ptx>" > cuda_ptx.cu
$ echo "#include <cuda/std/__type_traits/integral_constant.h>" > cuda_std_integral_constant.cu
$ hyperfine --warmup 1 'nvcc -arch sm_90a -x cu -c empty.cu -o test.o'  'nvcc -arch sm_90a -x cu -c cuda_ptx.cu -o test.o'  'nvcc -arch sm_90a -x cu -c cuda_std_integral_constant.cu -o test.o'

Benchmark 1: nvcc -arch sm_90a -x cu -c empty.cu -o test.o
  Time (mean ± σ):      1.434 s ±  0.012 s    [User: 1.070 s, System: 0.368 s]
  Range (min … max):    1.414 s …  1.455 s    10 runs

Benchmark 2: nvcc -arch sm_90a -x cu -c cuda_ptx.cu -o test.o
  Time (mean ± σ):      2.299 s ±  0.022 s    [User: 1.861 s, System: 0.442 s]
  Range (min … max):    2.275 s …  2.339 s    10 runs

Benchmark 3: nvcc -arch sm_90a -x cu -c cuda_std_integral_constant.cu -o test.o
  Time (mean ± σ):      2.131 s ±  0.023 s    [User: 1.709 s, System: 0.426 s]
  Range (min … max):    2.098 s …  2.167 s    10 runs

Expected behavior

This should not be a heavy header.

Reproduction link

No response

Operating System

Ubuntu Linux 22.04

nvidia-smi output

NA

NVCC version

Benchmark was performed using prerelease version of nvcc, but should be reproducible with any recent version.

@ahendriksen ahendriksen added the bug Something isn't working right. label Nov 22, 2024
@ahendriksen ahendriksen self-assigned this Nov 22, 2024
@github-project-automation github-project-automation bot moved this to Todo in CCCL Nov 22, 2024
@ahendriksen
Copy link
Contributor Author

I have attaced a trace of the compile time. It can be checked in perfetto.dev.

Turns out that a large portion of the time is spent preprocessing the CUDA fp16 and bf16 headers. It is transitively included as follows:

  • cuda/ptx
  • cuda/std/detail/__config (82ms)
  • cuda/std/detail/libcxx/include/__config
  • cuda/__cccl_config
  • cuda/std/__cccl/extended_floating_point.h
  • cuda_fp16.h + cuda_bf16.h (80ms)

cuda_ptx.json

@bernhardmgruber
Copy link
Contributor

Yep, looks like the extended FP type headers are quite expensive, but since they are included as part of the CCCL config, they will affect each translation unit. @miscco could we consider only defining _CCCL_HAS_NVFP16 and _CCCL_HAS_NVBF16 in the CCCL config headers and leaving it up to downstream libraries and users to include the corresponding headers themselves?

@miscco
Copy link
Collaborator

miscco commented Nov 22, 2024

yeah that would definitely be better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Status: Todo
Development

No branches or pull requests

3 participants