-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FBP filtering on GPU #423
base: master
Are you sure you want to change the base?
FBP filtering on GPU #423
Conversation
98fa77f
to
262ac8f
Compare
Worked. (a39e6b2) Results of python test codeConditions0: NVIDIA GeForce GTX 1060 6GB BeforeCPU: 24.03577409999707 s AfterCPU: 24.188547600002494 s |
@AnderBiguri Do you think we should keep the option to use/not to use GPU for FBP filtering for |
@tsadakane Fantastic work, honestly! I am actually surprised that GPU is not much faster, but still the speedup achieved is quite significant. Perhaps you are running in a CPU with many cores? I will test this on my side in a couple of machines (old/new) and report speed (sometime next week). I think if we can more or less verify that the filtering is always faster on GPU, we can just remove the CPU option from FBP/FDK. We can always leave the matlab/python code there, deprecated, for people to be able to read what the CUDA part does, but in an easier mode. What do you think? |
@AnderBiguri Thank you for your reply. I think it would be better to remove the old code if possible. Let me know the result and I can remove the option before submitting this PR.
The CPU is i7-11700 (8 cores, 16 threads). |
@tsadakane fantastic, I think its not a bad idea, it will live in git's history anyway, so you are right. In your time comments, you are saying that most of the time (7s) is on the FFT maths, and the rest is memory and context management? |
In my i7-7700HQ/GTX 1070 laptop, GPU is often slower: MATLAB: geo.nDetect 256^2, nagles=100, niter=10:
geo.nDetect 512^2, nagles=100, niter=10:
geo.nDetect 1024^2, nagles=100, niter=10:
I still am quite surprised by this performance, I really did think it would be much faster. I'll try to dig a bit further, I am not sure if there is something extra we can optimize in the code that we may not be aware of, or if my assumption that it would be much faster is seriously flawed. |
Sorry for unclear English.
(*) In (**) Returning at the beginning of |
@tsadakane I see, thanks. So if we were to try to optimize this we'd need to look at the python<->cuda interoperability. I need to sit down and read the code more carefully, as this seems surprisingly high? What do you think? |
I have no idea so far.
to reduce the number of calling cuda function to 1/32, but the total time did not change (12s). |
I wrote multi-gpu version and cudaMemcpyAsync version, but as expected, the both were no effect. |
Just removing padding from
I'm worried that by moving "padding" to cuda, the "if GPU" code is not parallel to CPU. What do you think? I have not yet run this modification on Matlab and not yet commited. |
@tsadakane interesting! Sorry, I am a bit busy this week, perhaps next week, so I don't have time to look at it with the detail it deserves. |
About aa51ec6"CPU" uses CPU for convolution. TestsPythontest code
Result
Malab
Test code
Result
|
The table below is the summary of the results above.
Here, two GPUs are used. |
I ran the python test program above after adding Python
It seems that the overhead of the copying is larger than the benefit of doubling the GPU resource. |
@tsadakane that makes sense. Stil surprised about the times, but I think its my own false assumptions of how fast it should be. Still really good :) |
I am also a bit baffled by the MATLAB slowness, maybe something to do with extra copies? let me check. |
I think matlab code (copies &) transposes the matrix; the fastest variable in the memory is |
@tsadakane ah! We may be able to do then the same that we do with the other cuda code? Thanks again, your work is really appreciated! |
I can do it, but I am not sure if it works, because cuda is not good at transposing. It was not necessary to transpose the projection in ax and atb, because it is copied to the 3d texture, where there is no difference in the memory access efficiency between u and v axes. On the other hand, the convolution must be applied to the u-direction, so transposing in the memory layout is necessary. That's said, I think, from the point of view of symmetry, permutation line should be moved to the filtering function if possible, i.e Lines 61 to 77 in f9a38ed
=>
This modification makes the code a little bit beautiful, but the specification of the filtering function will be changed (the first arg What do you think? |
@tsadakane I see what you mean with the filtering. My convolution/fft maths are a bit rusty, so I may be completely wrong, but isn't there a way to change the shape of the filter itself so it is equivalent to transposing the projections? That would be the ideal solution, if possible, right? About changing the transpose to inside |
Hi @tsadakane, sorry I have been very busy with my life. How is this going? do you need any help? Is it done? |
Hi @AnderBiguri, I've been busy lately too. I've implemented the code to transpose matrices, but I am not satisfied with the result and trying several granularities of processing to pass to mex/cuda. |
* Added Common files
* Modified and added Matlab files
* Modified and added Python files
* Use cufftExecR2C and cufftExecC2R to omit slow copy.
* Padding and FBP filtration are applied in `apply_filtration2` function
* Modified matlab files
* Modified python files
* Moved transposing projections into filtering
bb7e0d7
to
7c97471
Compare
Aparently making the |
Under the condition of
On
Not bad. |
I think the GPU implementation can be valuable too. I think the fastest FDK would be one that does interleave filtering and backprojection. I read a paper about it at some point. |
Hi @tsadakane |
Hi, @AnderBiguri , It's difficult to answer, but I'd like to say no.
|
@tsadakane makes sense! Flexibility is a bit less important because we can add a flag with default CPU, but I think the rest of the points make total sense. It may be worth leaving the PR here (if you are happy with that), in case anyone in the future wants to use the code? |
Summary
Implemented FBP filtering on GPU using cuFFT.
See #312
Test
Condition
Python
Matlab
Discussion
This code consumes more time than CPU even when the projections are size of 2048x2048.
TIGRE/Common/CUDA/FbpFiltration.cu
Lines 109 to 113 in 98fa77f
and
TIGRE/Common/CUDA/FbpFiltration.cu
Lines 158 to 162 in 98fa77f
consumes time, while FFT and multiplying the filter in the Fourier space is fast enough.
These are necessary as long as we use FFT C2C.
R2C and C2R may work.
As for now, the CUDA codes are common for Matlab and Python because
filtering.m
transposes the matrix before calling fft.Only one GPU is used.
Only one stream is used.