Testing & Benchmark #5

DEKHTIARJonathan · 2019-02-07T13:35:55Z

Hi,

Very interesting work. I have some remarks:

In your paper, you oftenly speak about "compared to Tensorflow", what kind of distribution strategy do you talk about ? What kind of strategy: xring ? nccl ?
I guess you have run the experiment on vanilla TF, could we see the code you used to collect these numbers ? Btw. if I understood correctly you used a non official and not the best implementation available, comparing with: https://github.com/tensorflow/benchmarks would be a lot more interesting.
They offer different tf.distributed strategies + Horovod
And btw. if we can have a docker container and a script to tests your results with Corssbow, it could be interesting ;) => RN50 seems a decent benchmark as you pointed out ;)

Thanks for your help and congrats for this interesting project

alexandroskoliousis · 2019-02-08T20:17:59Z

Thanks for your comments. We have indeed used the TensorFlow benchmarks in our evaluation. We have experimented with both replicated and parameter_server (on the GPU) options for variable updates; and nccl for all-reduce.

We will commit our TensorFlow experimental setup in due time. Our docker container should already be sufficient to re-run our ResNet-50 experiments. I will document this shortly too.

DEKHTIARJonathan · 2019-02-09T12:48:16Z

Thanks for your prompt answer ;) I will be very happy to provide you back the results I can obtain on 8x Tesla V100.
I think the paper can be improved by providing more details on what was tested and how. Especially because this system aims to target efficiency and scalibility ;) So this part should be as detailled as possible ;)

Thanks a lot once again,

All the best

alexandroskoliousis · 2019-02-12T14:27:00Z

Experiments with ResNet-50 on 8x V100 certainly aligns with our course of action - I am about to give it a go. I am more than happy to share this setup with you.

Re: paper improvements, besides the variable update and all-reduce strategy used, what else would you consider missing from the experimental setup? Feel free to ping me with additional comments.

I will leave this issue open to inform you as I make progress with your requests.

DEKHTIARJonathan · 2019-02-12T19:27:01Z

thanks for the additional information, much appreciated.

If you want we can set up a call that way I can launch experiments with your help on DGX1 & DGX2 (8x Tesla V100 - 16GB - 8 x Tesla V100 - 32GB and 16 x Tesla V100 32GB)

Make sure that you use TFRecords imagenet, it allows to maximise throughtput
Trying with FP16 is free performance for no accuracy cost in RN50, it will be more intensive on your solution (can it keep up with the load) ?
And having a simple container that we can build/pull easily.
And an exact command to reproduce the same setup as you did.

This kind of systems can really be interesting, interesting however you need to compare on all factors:

average CPU load: TF.Distributed/Horovod/Crossbow
average RAM used: TF.Distributed/Horovod/Crossbow
if multi-node, average network speed: TF.Distributed/Horovod/Crossbow

It's quite likely that your approach is much more intense on CPU/RAM for example as your launch more threads ;) So it's important to highlight that point or people may highlight that they cant reproduce because they dont have as much RAM as you.

metrics: imgs/sec seems the best performance proxy to measure throughput.

I genuinely think this is really interesting, and want to try it as soon as possible. However, the publication felt a little unclear at first read.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing & Benchmark #5

Testing & Benchmark #5

DEKHTIARJonathan commented Feb 7, 2019 •

edited

Loading

alexandroskoliousis commented Feb 8, 2019

DEKHTIARJonathan commented Feb 9, 2019

alexandroskoliousis commented Feb 12, 2019

DEKHTIARJonathan commented Feb 12, 2019 •

edited

Loading

Testing & Benchmark #5

Testing & Benchmark #5

Comments

DEKHTIARJonathan commented Feb 7, 2019 • edited Loading

alexandroskoliousis commented Feb 8, 2019

DEKHTIARJonathan commented Feb 9, 2019

alexandroskoliousis commented Feb 12, 2019

DEKHTIARJonathan commented Feb 12, 2019 • edited Loading

DEKHTIARJonathan commented Feb 7, 2019 •

edited

Loading

DEKHTIARJonathan commented Feb 12, 2019 •

edited

Loading