This repo tries to answer the question "for a specific input size, and (simple) computation, which framework/accelerator should I choose?".
By implementing the same computation on all available frameworks/accelerators (CPU FPU baseline, SIMD, threads, GPU, FPGA, multi-machine, ...) and running with input sizes from 1 float to 1 billion (or more) floats, we'll see which framework is optimal for which input size.
Frameworks:
- single-threaded scalar/FPU-only
- SIMD
- SIMD compiler auto-vectorisation
- SIMD explicit (std::experimental::simd) (equivalent to gcc/clang vector intrinsics)
-
-fopenmp-simd
OpenMP SIMD
- threading
-
-fopenmp
OpenMP threading - C++11 threads
-
std::execution::par_unseq
clang parallel stl (pstl)
-
- GPU
-
-fopenmp -fopenmp-targets=nvptx64
OpenMP GPU offloading - CUDA/HIP, HIP is a subset of CUDA that normal clang can compile to both nvidia and amd GPUs, see https://github.com/ROCm/HIP#what-is-this-repository-for and https://rocm.docs.amd.com/projects/HIP/en/latest/user_guide/faq.html. automatic CUDA to HIP translation
- GPU compute shaders via vulkan via kompute
- Halide?
- Kokkos?
- SYCL
-
- multi-machine
- OpenMP remote offloading
- manual messaging
- MPI
TODO:
- flops/byte estimation for the two different calculations
- flops/byte (maximum FLOPS throughput / maximum memory bandwidth) estimation for all the different hardware (CPU, SIMD, threaded, GPU, multi-machine)
- is it possible to create a flops/byte measurement tool, and a set of stress tests to measure
Similar projects
a parallel programming ecosystem needs:
- language/language-extensions/compiler for describing parallelism, tasks, async, dependencies, etc
- backends for SIMD, multi-core, GPU, and multi-machine
- standard algorithms (blas, sort, reduce, etc)
- tools for debugging and profiling
ecosystem | compiler | SIMD | Multi-core | GPU | Multi-machine | sort | reduce | blas |
---|---|---|---|---|---|---|---|---|
C++ STL | any (plain c++) | std::ex::simd |
std::thread |
std::executors (future) |
❔ asio? (future) | std::sort par_unseq |
std::ex::parallel::reduce |
stdblas (future) |
OpenMP | gcc, clang, icc (pragma extended c++) | ✔️ | ✔️ | ✔️ | ✔️ | ❌ | ✔️ | ❔ OpenBLAS? Eigen? |
sandia Kokkos | any (plain c++) | ✔️ | ✔️ | ✔️ | ✔️ MPI | ✔️ | ✔️ | ✔️ stdblas |
intel oneAPI | intel dpc++ (sycl) | ✔️ | ✔️ | ✔️ | ✔️ MPI | ✔️ TBB | ✔️ TBB | ✔️ MKL |
nvidia CUDA | clang, nvc++ (extended c++) | ❌ | ❌ | ✔️ | ✔️ NCCL | ✔️ thrust / libcu++ | ✔️ thrust / libcu++ | ✔️ cutlass / cuBLAS |
sudo pacman -S \
cmake clang \
benchmark \
python-matplotlib python-pandas \
openmp \
vulkan-tools vulkan-driver vulkan-headers glslang \
eigen
./test.sh
For O(N) vector addition on specific hardware, this graph answers the question in the following way:
- for input size <= 2^8 floats, use CPU SIMD
- for input size between 2^12 and 2^20 floats, use OpenMP
- for input size >= 2^24 floats (and negligible device/host memory transfer cost) use the GPU
For O(N^3) matrix multiplication