[BUG]: cuda::ptx takes long to compile #2933

ahendriksen · 2024-11-22T10:09:53Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Performance

Component

libcu++

Describe the bug

Including <cuda/ptx> takes ~800ms on my workstation.

How to Reproduce

Comparing the time to compile an empty file, a file including cuda/ptx and a file including cuda/std/__type_traits/integral_constant.h (which is included from cuda/ptx).

$ echo "" > empty.cu
$ echo "#include <cuda/ptx>" > cuda_ptx.cu
$ echo "#include <cuda/std/__type_traits/integral_constant.h>" > cuda_std_integral_constant.cu
$ hyperfine --warmup 1 'nvcc -arch sm_90a -x cu -c empty.cu -o test.o'  'nvcc -arch sm_90a -x cu -c cuda_ptx.cu -o test.o'  'nvcc -arch sm_90a -x cu -c cuda_std_integral_constant.cu -o test.o'

Benchmark 1: nvcc -arch sm_90a -x cu -c empty.cu -o test.o
  Time (mean ± σ):      1.434 s ±  0.012 s    [User: 1.070 s, System: 0.368 s]
  Range (min … max):    1.414 s …  1.455 s    10 runs

Benchmark 2: nvcc -arch sm_90a -x cu -c cuda_ptx.cu -o test.o
  Time (mean ± σ):      2.299 s ±  0.022 s    [User: 1.861 s, System: 0.442 s]
  Range (min … max):    2.275 s …  2.339 s    10 runs

Benchmark 3: nvcc -arch sm_90a -x cu -c cuda_std_integral_constant.cu -o test.o
  Time (mean ± σ):      2.131 s ±  0.023 s    [User: 1.709 s, System: 0.426 s]
  Range (min … max):    2.098 s …  2.167 s    10 runs

Expected behavior

This should not be a heavy header.

Reproduction link

No response

Operating System

Ubuntu Linux 22.04

nvidia-smi output

NA

NVCC version

Benchmark was performed using prerelease version of nvcc, but should be reproducible with any recent version.

The text was updated successfully, but these errors were encountered:

ahendriksen · 2024-11-22T11:03:37Z

I have attaced a trace of the compile time. It can be checked in perfetto.dev.

Turns out that a large portion of the time is spent preprocessing the CUDA fp16 and bf16 headers. It is transitively included as follows:

cuda/ptx
cuda/std/detail/__config (82ms)
cuda/std/detail/libcxx/include/__config
cuda/__cccl_config
cuda/std/__cccl/extended_floating_point.h
cuda_fp16.h + cuda_bf16.h (80ms)

cuda_ptx.json

bernhardmgruber · 2024-11-22T11:13:52Z

Yep, looks like the extended FP type headers are quite expensive, but since they are included as part of the CCCL config, they will affect each translation unit. @miscco could we consider only defining _CCCL_HAS_NVFP16 and _CCCL_HAS_NVBF16 in the CCCL config headers and leaving it up to downstream libraries and users to include the corresponding headers themselves?

miscco · 2024-11-22T12:32:56Z

yeah that would definitely be better

ahendriksen added the bug Something isn't working right. label Nov 22, 2024

ahendriksen self-assigned this Nov 22, 2024

github-project-automation bot added this to CCCL Nov 22, 2024

github-project-automation bot moved this to Todo in CCCL Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: cuda::ptx takes long to compile #2933

[BUG]: cuda::ptx takes long to compile #2933

ahendriksen commented Nov 22, 2024

ahendriksen commented Nov 22, 2024

bernhardmgruber commented Nov 22, 2024

miscco commented Nov 22, 2024

[BUG]: cuda::ptx takes long to compile #2933

[BUG]: cuda::ptx takes long to compile #2933

Comments

ahendriksen commented Nov 22, 2024

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

ahendriksen commented Nov 22, 2024

bernhardmgruber commented Nov 22, 2024

miscco commented Nov 22, 2024