University of Pennsylvania, CIS 565: GPU Programming and Architecture Final Project
- Introduction
- Gaussian Mixure Models
- ICP Misalignment
- GMM Registration with Noisy Targets
- Implementation
- Use Cases
- Performance Analysis
- Requirements for Code
- References
Representing continuous geometry via discretizing point clouds introduces artifacts and does not enable us to deal with noise or uncertainity in the data. Also, discrete representations are non-differentiable. Having a probabilistic representation of point clouds can be used for up-sampling, mesh-reconstruction, and effectively dealing with noise and outliers. In this project, we focus on training Gaussian Mixture Models, a class of generative models, on 3D Point Clouds. Subsequently, we use learned GMM for Point Cloud Registration. PCR is extensively used in the computer vision and robotics for 3D object mapping, SLAM, dense Reconstruction, 3D pose estimation etc. In autonomous driving, massive ammounts of 3D point cloud data is captured from various sensors (LiDAR) from different coordinate systems. PCR is used to align a pair of point clouds to a common coordinate frame.
The widely used algorithm for registration, Iterative Closest Point (ICP), does not work well when we are dealing with noise or outliers or if the point cloud data has uneven density or includes occlusion artifacts. Statistical approaches based on GMMs help in addressing these drawbacks but have been slow and inefficient to be used for real-time applications. But with the help of GPU and recent advancements from Accelerated Generative Models for 3D Point Cloud Data and Fast and Accurate Point Cloud Registration using Trees of Gaussian Mixtures, we can efficiently scale GMM training and registration on 3D point clouds.
Gaussian Mixtures model the joint probability distribution of a data via a mixture of normal distrubtions (called components/clusters). The parameters of GMM can be estimated by maximizing the log-likelihood via the Expecation-Maximization algorithm.
In the following figure, we cluster a set of 2D points with GMM. Unlike KMeans, GMM can be used for soft-clustering where a point can belong to multiple clusters with a corresponding probability.
The following figure illustrates GMM clustering on a set of 2D points
Before Clustering | After Clustering |
---|---|
Another example,
Before Clustering | After Clustering |
---|---|
The Gaussian Mixure Models are implemented on Stanford Bunny for visualizing GMM in 3D point cloud data. The below two gifs show for 100 and 800 components respectively.
100 Components | 800 Components |
---|---|
On the Dragon,
10 Components | 50 Components |
---|---|
The issue with conventional GMM modelling is bottleneck due to linear search over all the components fitting onto the dataset. The GMM modelling requires unnecessary computation for the likelihood of a point to a cluster which is very far from it again and again, which results in reduced efficiency. One of the solutions to prevent search over all the clusters is to implement Hierarchical gaussian Mixture Model tree so as to get benefits of octree structure. While designing the Gaussians in form of trees, it gives us additional benefit of adaptive scaling, hence we can choose some points to be fitted with very small variant Gaussians while some points can be clustered with large variance at same time. The algorithm is GPU-Accelerated construction algorithm which is reliable and parallelizable for maximum benefit from the GPU’s. As mentioned in the reference, the complexity of the algorithms is measured in terms of Associative Complexity, which is complexity of data association problem over all N points (E Step in the case of EM based methods); and Optimization Complexity which is the size of the optimization problem (M Step in the case of EM-based methods).
Source Link: GPU-Accelerated-3d-Point-Cloud-Processing-with-Hierarchical-Gaussian-Mixtures
For N points and J clusters, the complexities of HMM registration and HGMM registration methods are given below:
Algorithm | Associative Complexity | Optimization Complexity | Construction Complexity |
---|---|---|---|
EM-ICP | O(N log N) | O(N2) | --- |
GMM-Reg | O(N) | O(N2) | O(NJ) |
HGMM-Reg | O(N log J ) | O(log J) | O(N log J) |
In our implementation, we have chosen the number of nodes per parent as 8, which implies that each node can be defined as the weighted sum of 8 child Gaussians. Once a point is assigned to a parent node cluster by EM algorithm, the point will later check the likelihood with the child of the associated parent only, hence, the search keeps reducing In exponential order for the points, which leads to high efficiency.
Iterative Closest point (ICP) estimates correspondence and uses point-to-point comparison for registration. The algorithms tries to align each source point to its nearest target point by using the SVD optimization to get the transformation parameters (Rotation and translational in case of rigid transformation) and we kepp doing the process iteratively till the loss converges to a minimum value. The drawback of the above algorithm is that it tries to align source's center of mass with the target one. In this process, the optimization might converge to the local minima where the point cloud data are not purely aligned, as mentioned in the figures below.
Some of the alignments done using the Gaussian Mixure models probabilistic models with the noisy pixels in the target (source is red and target is green color):
Bunny Dataset | Dragon Dataset |
---|---|
Initially, we started with a pure C++ approach for GMM, but realized rapid iteration and integrating with other components can't be done easily. So we settled on the following configuration
- Implement in Python and use CuPy for numpy arrays on the GPU. This allows for accelerated matrix operations (multiply, dot product, reduction etc) on the GPU while providing a NumPy like interface.
- For fine-grained control and custom kernels, we use Numba. Numba enables us to write CUDA kernels in Python that get JIT compiled to CUDA-C code. An example Numba CUDA kernel illustrated below (source),
@cuda.jit
def increment_by_one(an_array):
# Thread id in a 1D block
tx = cuda.threadIdx.x
ty = cuda.blockIdx.x
bw = cuda.blockDim.x
pos = tx + ty * bw
if pos < an_array.size:
an_array[pos] += 1
Also, CuPy enables inter-operability with Numba CUDA Device Arrays. So we can seamlessly shift between Numba kernels and CuPy code.
- We use Open3D to process and visualize 3D point clouds
- For reference of CPU implementation and pre-defined transformations, we use ProbReg
In the case of supervised image segmentation, the architectures in general assigns labels to pixels that denote the cluster to which the pixel belongs. In the unsupervised scenario, however, no training images or ground truth labels of pixels are given beforehand. Therefore, once when a target image is input, we jointly labels together with feature representations while minimizing the cost function using optimization. In our case, clustering can be used as a way to segment the features which are alike in the same gaussian. More the number of compoennts, more finer the groups will become. We will use our GPU implementation to speedup the clustering and segment the image faster. The reference for the unsupervised image segmentation can be seen here.
On the Lounge (10k points with 10 and 50 components)
On the left image, we can see how different objects in the scene are neatly grouped together. This can serve as a starting point for semantic segmentation.
We can visualize the GMM trained on a sequence of LIDAR point clouds from Waymo Open Dataset
Starting with just 1k points and 50 Gaussian components (GMM is re-trained every 10 frames),
With 10k points,
With 50k points,
We have 3 running issues while training GMM on moving point clouds,
- Since Open3D visualizer is on the CPU, we are forced to initiate a GPU to CPU transfer every frame and that creates a bottleneck for faster real-time visualizations
- GMM convergence heavily depends on initial parameters. Ideally, the parameters are initialized using KMeans, which introduces additional overhead, so we chose to initialize the parameters randomly due to which for some frames the algorithm converges to a poor local optimum
- Every time the GMM is re-trained on the incoming point cloud, we have to assign new colours based on new cluster assignments for each point. While being semantically correct, the changing colours are a bit jarring from a visualization perspective.
We can use PCR for state-estimation/localization of a self-driving car/mobile robot. Aligning the frames from the data received through the sensors (LIDAR point cloud data) can help us to locate and trace the path of the robot. The rotation and translational paramenters required for the robot to stay in the path could be dteremined from the parameters recieved through the point cloud registration.
The following slide from State-Estimation and Localization for Self-Driving Cars neatly illustrates this,
On Waymo Open Dataset, we use PCR to localize the car. The path of the car is illustrated below.
The above visual is using a CPU implementation. Unfortunately, due to logistics issues we couldn't recreate it using our GPU implementation.
We compare the CPU and GPU implemenations of GMM and HGMM (Level 2 with 72 components) trained on 3D point clouds. GMM CPU and GPU were trained for fixed number of iterations and HGMM was trained till convergence.
There are two variations to consider,
- Performance by varying total number of points
- Performance by varying total number of components
No. of Points with Time | No. of Components with Time |
---|---|
From the graph, it's clear that HGMM is the winner. We also observe the performance of HGMM by varying the number of levels of the GMM Tree.
Note that as we increase the depth of the GMM tree, the total number of Gaussian components increase as follows,
Tree Level | Number of Components |
---|---|
2 | 72 |
3 | 544 |
4 | 4680 |
5 | 37448 |
We also evaluate the FPS on Waymo's LIDAR data and compare CPU vs GPU.
While, the GPU is much faster than the CPU, we are still not close to real-time performance.
- For 10k points, while we can achieve close to 10 FPS, we are bottlenecked by Open3D visualizer rendering via the CPU.
- For larger point clouds (> 50k points), the GPU to CPU data transfer is no longer the bottleneck. Our GMM implementation itself is not fast enough.
- Python 3.5 or above
- Numba 0.43.1
- CuPy 7.0.0
- Intel Open 3D
- Scikit-learn
- Probreg 0.1.7
- Waymo Open Dataset
- Stanford Bunny Dataset
- Lounge Dataset
- For running the GMM modelling on Waymo Dataset, run the command
python ./src/python/gmm_waymo/run_gmm_waymo_gpu.py
- For running the GMM PCR on various Bunny, Lounge and Dragon Datasets, look at the file
/src/python/python/gmm_waymo/run_gmm_static.py
- For running HMGMM PCR, run the command
python src/python/hgmm/hgmm_gpu.py