[ CUDA C++ Programming Guide, CUDA toolkit, CUDA zone, CUDA C++ Best Practices Guide, NVIDIA CUDA Compiler Driver, CUDA Math API Reference Manual ]
libraries : [ cuBLAS, cuFFT, cuRAND Library, cuSOLVER Library, cuSPARSE Library, NVIDIA cuDSS, NVIDIA Collective Communications Library (NCCL) ]
[ NVIDIA TESLA V100 GPU ARCHITECTURE, Volta Tuning Guide, NVIDIA A100 Tensor Core GPU Architecture, Ampere Tuning Guide, Hopper H100 White Paper (pdf), Hopper Tuning Guide ]
@github/cuda-smaples.
CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) developed by NVIDIA. It allows developers to harness the power of NVIDIA GPUs (Graphics Processing Units) for high-performance computing tasks beyond graphics rendering. CUDA is widely used in scientific computing, machine learning, image processing, and other fields where massive parallelism can significantly accelerate computations.
-
Parallel Computing Architecture: At the heart of CUDA is the concept of parallelism. Traditional CPUs are designed for serial processing, handling one instruction at a time, while GPUs are built for parallel processing, executing thousands of tasks simultaneously. CUDA enables developers to tap into this parallel computing power.
-
CUDA-enabled GPUs: CUDA is compatible with NVIDIA GPUs. To leverage CUDA, a GPU must have CUDA cores, which are specialized processing units designed for parallel tasks. Modern GPUs have thousands of these cores, making them powerful parallel processors.
-
CUDA Programming Model: CUDA extends the C and C++ programming languages with additional constructs for parallel programming. Developers write CPU-hosted code (the host) and GPU-executed code (the device) to work together. The host and device communicate through function calls, memory transfers, and synchronization mechanisms.
-
Kernels and Threads: In CUDA, developers define parallel tasks as kernels, which are functions that run in parallel on the GPU. Kernels are executed by thousands of lightweight threads, which are organized into thread blocks. Threads within a block can cooperate and communicate via shared memory.
-
Memory Hierarchy: CUDA provides a memory hierarchy with different types of memory:
- Global Memory: Large, device-wide memory accessible by all threads.
- Shared Memory: Low-latency, per-thread-block memory for fast inter-thread communication.
- Constant Memory: Read-only memory optimized for constant data.
- Texture Memory: Specialized memory for texture access in graphics applications.
Efficient memory management is crucial for optimizing CUDA applications.
-
Synchronization: Developers can use synchronization primitives, such as barriers, to coordinate thread execution and ensure data consistency. Proper synchronization is essential for avoiding race conditions and ensuring correct results.
-
CUDA Libraries: NVIDIA provides CUDA libraries that offer optimized functions for common tasks like linear algebra (cuBLAS), Fast Fourier Transforms (cuFFT), and neural network inference (cuDNN). These libraries save developers time and effort in optimizing low-level code.
-
CUDA Toolchain: NVIDIA provides a comprehensive toolchain for CUDA development:
- CUDA Compiler (nvcc): Compiles CUDA code into executable binaries.
- CUDA Debugger (cuda-gdb): Debugging tool for CUDA applications.
- CUDA Profiler (nvprof): Profiles GPU performance and memory usage.
- Visual Profiler: GUI-based profiling tool for detailed performance analysis.
-
GPU Virtualization: CUDA supports GPU virtualization, allowing multiple virtual machines to share a single GPU. This is crucial for cloud computing and data center environments.
-
CUDA Ecosystem: The CUDA ecosystem includes a large community of developers, researchers, and third-party libraries. Frameworks like TensorFlow, PyTorch, and CUDA-based language extensions like CUDA C++ and CUDA Python simplify GPU-accelerated development in machine learning and scientific computing.
- Device Code: Code that runs on the GPU. This code is defined as kernels and is executed in parallel by multiple threads.
- Host Code: Code that runs on the CPU (host). It manages the GPU, transfers data between the CPU and GPU, and launches kernels.
Kernels are the core of CUDA programming. A kernel is a C/C++ function that executes in parallel on the GPU. It's defined with the __global__
qualifier in CUDA C/C++. Here's a simple CUDA kernel that adds two arrays element-wise:
__global__ void add(int* a, int* b, int* c, int size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
c[idx] = a[idx] + b[idx];
}
}
In this kernel, blockIdx.x
and threadIdx.x
are special variables that identify the thread's block and thread index within the block, respectively.
Launching Kernels: Kernels are launched from host code. To launch the add kernel from the previous example:
int main() {
int size = 1000;
int* a, * b, * c; // Allocate memory on the CPU
// Allocate memory on the GPU
cudaMalloc(&a, size * sizeof(int));
cudaMalloc(&b, size * sizeof(int));
cudaMalloc(&c, size * sizeof(int));
// Initialize arrays a and b, and copy them to the GPU
// ...
// Launch the kernel with a grid of blocks and threads
int threadsPerBlock = 256;
int blocksPerGrid = (size + threadsPerBlock - 1) / threadsPerBlock;
add<<<blocksPerGrid, threadsPerBlock>>>(a, b, c, size);
// Copy the result (array c) from GPU to CPU
cudaMemcpy(c_host, c, size * sizeof(int), cudaMemcpyDeviceToHost);
// Free GPU memory
cudaFree(a);
cudaFree(b);
cudaFree(c);
// Perform further processing with the result c_host
// ...
return 0;
}
Grid
: A grid is the highest-level organization of threads. It consists of multiple blocks.Block
: Each grid is divided into blocks. Blocks are executed independently on SMs (Streaming Multiprocessors) on the GPU.Thread
: Each block is divided into threads, and threads within a block can cooperate and share data using shared memory.
Synchronization: Developers can use synchronization barriers (__syncthreads()
) within a block to coordinate thread execution. Synchronization is important for avoiding race conditions.
Memory Management: CUDA provides functions for allocating and transferring data between the CPU and GPU memory (cudaMalloc
, cudaMemcpy
, etc.). It's crucial to manage memory efficiently to avoid memory leaks.
Libraries and Ecosystem: CUDA offers libraries like cuBLAS, cuFFT, and cuDNN for optimized numerical computations. Additionally, popular machine learning frameworks like TensorFlow and PyTorch provide GPU support through CUDA.
Efficient scheduling and memory access are crucial aspects of CUDA programming.
CUDA executes threads in groups called warps. A warp typically contains 32 threads. The GPU scheduler selects warps to run on the SM (Streaming Multiprocessor), and all threads in a warp execute the same instruction simultaneously.
Thread divergence occurs when threads within a warp take different execution paths within a kernel. For example, if an if
condition is evaluated differently among threads, some threads may follow one path, and others a different path. Thread divergence can lead to performance degradation because the warp executes both paths serially.
Efficient memory access is vital in CUDA programming. Global memory access is most efficient when threads within a warp access consecutive memory locations. This ensures memory coalescing, reducing memory access latency.
Memory access coalescing is a fundamental optimization technique in CUDA (Compute Unified Device Architecture) programming, especially for achieving high memory bandwidth and efficient memory access patterns.
Why Memory Access Matters: Efficient memory access is crucial for achieving high GPU performance because memory latency is a significant bottleneck. Coalesced memory access minimizes the number of memory transactions and reduces memory latency.
Coalescing refers to the combining of multiple memory accesses into a single transaction, reducing the number of memory requests to global memory. In CUDA, memory transactions are performed in chunks of 32, 64, or 128 bytes, depending on the GPU architecture. Coalesced memory access tries to ensure that threads in a warp access memory locations within these chunks.
Coalescing Rules: CUDA GPUs have specific rules for achieving coalesced memory access:
-
Rule 1 - Contiguous Memory Access: Threads in a warp should access contiguous memory locations. For example, if thread 0 accesses
A[0]
, thread 1 should accessA[1]
, thread 2 should accessA[2]
, and so on. -
Rule 2 - Aligned Data Types: Coalescing is more efficient when data types are aligned to their size. For example, use
int
for 4-byte data,float
for 4-byte floating-point data, etc. -
Rule 3 - Proper Indexing: The index calculations for memory access should ensure that threads in a warp access locations that are within the same chunk of memory.
Below is a code example demonstrating coalesced memory access in a matrix multiplication kernel:
__global__ void matrixMultiply(float* A, float* B, float* C, int width) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0;
for (int i = 0; i < width; ++i) {
sum += A[row * width + i] * B[i * width + col]; // Coalesced memory access
}
C[row * width + col] = sum;
}
CUDA GPUs have multiple levels of memory, each with different characteristics:
- Global Memory: The largest, but slowest, memory accessible by all threads. Global memory accesses are typically cached.
- Shared Memory: Fast, per-thread-block memory that is shared among threads within a block. Shared memory is manually managed and explicitly allocated by the programmer.
- Registers: Local memory accessible by individual threads within a warp. Registers are fast but limited in quantity.
Shared memory is a fast, on-chip memory shared among threads within a thread block. It can be used to store data that needs to be shared and accessed quickly among threads. Proper utilization of shared memory can significantly improve performance.
Matrix multiplication is a common example used to illustrate CUDA scheduling and memory access. Below is a simplified CUDA code example for matrix multiplication:
// Matrix multiplication kernel
__global__ void matrixMultiply(int* A, int* B, int* C, int width) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
int sum = 0;
for (int i = 0; i < width; ++i) {
sum += A[row * width + i] * B[i * width + col];
}
C[row * width + col] = sum;
}
int main() {
// Initialize matrices A and B
// ...
int* d_A, * d_B, * d_C; // Device memory pointers
int size = width * width * sizeof(int);
// Allocate memory on the GPU
cudaMalloc((void**)&d_A, size);
cudaMalloc((void**)&d_B, size);
cudaMalloc((void**)&d_C, size);
// Copy data from host to device
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
// Define thread block and grid dimensions
dim3 blockDim(16, 16);
dim3 gridDim((width + blockDim.x - 1) / blockDim.x, (width + blockDim.y - 1) / blockDim.y);
// Launch the kernel
matrixMultiply<<<gridDim, blockDim>>>(d_A, d_B, d_C, width);
// Copy the result from device to host
cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
// Perform further processing with the result C
// ...
return 0;
}
In this code example:
- The
matrixMultiply
kernel is launched with a grid of blocks and threads. - Each thread calculates an element of the result matrix
C
by iterating over the corresponding row and column. - Proper indexing and memory access patterns are essential for efficient memory coalescing.
Optimizing memory access patterns and minimizing thread divergence are essential for achieving high performance in CUDA programs. Properly tuned CUDA kernels can significantly accelerate parallel computing tasks, especially those involving large datasets like matrix multiplication.
cuBLAS (CUDA Basic Linear Algebra Subroutines) is a GPU-accelerated library developed by NVIDIA that provides a collection of high-performance linear algebra routines for use in CUDA applications. cuBLAS allows developers to harness the computational power of NVIDIA GPUs to perform essential linear algebra operations efficiently.
-
Matrix Multiplication: cuBLAS provides highly optimized functions for matrix-matrix multiplication (GEMM - General Matrix Multiply), which is a fundamental operation in many scientific and machine learning applications.
-
Linear Solvers: It offers routines for solving linear systems of equations (e.g., LU decomposition, triangular solvers) and computing matrix inverses.
-
Matrix Factorization: cuBLAS supports Cholesky decomposition and QR factorization.
-
Vector Operations: Basic vector operations like dot product, vector-vector addition, and scaling are available.
-
Batched Operations: cuBLAS supports batched versions of many routines, allowing you to efficiently perform the same operation on multiple sets of matrices or vectors.
-
Stream Support: cuBLAS can operate asynchronously with CUDA streams, which is important for overlapping computation and data transfers.
Below is a simple code example demonstrating matrix multiplication using cuBLAS:
#include <iostream>
#include <cublas_v2.h>
int main() {
int m = 4; // Number of rows in A
int n = 3; // Number of columns in B
int k = 2; // Number of columns in A and rows in B
// Host matrices
float* h_A = new float[m * k];
float* h_B = new float[k * n];
float* h_C = new float[m * n];
// Initialize host matrices A and B
// ...
// Device matrices
float* d_A, * d_B, * d_C;
cudaMalloc((void**)&d_A, sizeof(float) * m * k);
cudaMalloc((void**)&d_B, sizeof(float) * k * n);
cudaMalloc((void**)&d_C, sizeof(float) * m * n);
// Copy data from host to device
cudaMemcpy(d_A, h_A, sizeof(float) * m * k, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, sizeof(float) * k * n, cudaMemcpyHostToDevice);
// cuBLAS initialization
cublasHandle_t handle;
cublasCreate(&handle);
// Perform matrix multiplication with cuBLAS
float alpha = 1.0;
float beta = 0.0;
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, n, m, k, &alpha, d_B, n, d_A, k, &beta, d_C, n);
// Copy the result from device to host
cudaMemcpy(h_C, d_C, sizeof(float) * m * n, cudaMemcpyDeviceToHost);
// Free device and host memory
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
delete[] h_A;
delete[] h_B;
delete[] h_C;
// Destroy cuBLAS handle
cublasDestroy(handle);
return 0;
}
resources : CUDA C++ Programming Guide, py-numba, A beginner's guide to GPU programming and parallel computing with CUDA, Introduction to GPU Programming with CUDA and Thrust, OpenCL, CUDA Crash Course, Parallel Computing with Nvidia CUDA, CUDA Programming - C/C++ Basics, CUDA Programming Tutorials NPTEL, GPU Architectures and Programming, CUDA programming in Python with numba and cupy, CUDA Programming on Python, Getting Started With CUDA for Python Programmers, How CUDA Programming Works | GTC 2022, CUDA Simply Explained - GPU vs CPU Parallel Computing for Beginners.