Gunicorn Workers Not Using GPU in Parallel #2985
vibhas-singh
started this conversation in
General
Replies: 3 comments
-
I moved it to a discussion sin eit's more likely an OS issue than directly related to gunicorn. |
Beta Was this translation helpful? Give feedback.
0 replies
-
I'm having a similar issue. @vibhas-singh did you resolve this issue? |
Beta Was this translation helpful? Give feedback.
0 replies
-
Any updates on this @vibhas-singh or @Irtiza17 ? I am also facing similar issue. Any help will be greatly appriciated. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I am trying to deploy a Pytorch image classification model wrapped in
Flask
ong4dn.xlarge
(4 vCPU, 16GB RAM, T4 GPU with 16GB Memory) instances on AWS.For selecting the optimal number of workers I performed some experiments:
Experiment 1:
Concurrent Requests: 1
Total Time To Process 15 Requests By A Client: 15.87s (
model.forward
takes 14.98s)Experiment 2:
Concurrent Requests: 2 (2 clients sending requests in parallel)
Total Time To Process 15 Requests By A Client: 29.35s (
model.forward
takes 28.34s, 2x of a single request, every other step taking a similar time)Experiment 3:
Concurrent Requests: 3 (3 clients sending requests in parallel)
Total Time To Process 15 Requests By A Client: 43.82s (
model.forward
takes 41.81s, 3x of a single request, every other step taking a similar time).Using 3x workers is enabling me to process 3 requests in parallel but the overall processing time of all those requests is also becoming 3x - hence no improvement in real terms.
I initially thought CPU or IO processes are the bottlenecks in the app - but upon intensively logging the time taken at each step, I found the bottleneck is coming from the GPU processing (
model.forward
starts taking 2x-3x times).Upon checking the process ids of the workers for each request - I can also confirm that all the workers are getting the requests in parallel - but those are not able to perform the GPU processings in parallel at the same time.
Any guidance on what can be the bottleneck here will be very helpful.
Also - is there a recommended worker type to be used for such kinds of processing which are GPU-dependent?
Beta Was this translation helpful? Give feedback.
All reactions