Skip to main content
Log in

Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

This paper presents a novel “Distributed Deep Learning Framework” for a heterogeneous multi-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems 23

  2. Heigold, G., McDermott, E., Vanhoucke, V., Senior, A., Bacchiani, M.: Asynchronous stochastic optimization for sequence training of deep neural networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

  3. Sergeev, A., Balso, M.D..: Horovod: fast and easy distributed deep learning in tensorflow. In: arxiv.org, Feb 2018

  4. Ho, Q., Cipar, J., Cui, H., Lee, S., Kim, J.K., Gibbons, P.B., Gibson, G.A., Ganger, G., Xing, E.P.: More effective distributed ml via a stale synchronous parallel parameter server. In: Advances in Neural Information Processing Systems 26 (NIPS 2013)

  5. TensorFlow: an open source machine learning library for research and production. https://www.tensorflow.org/

  6. Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour. In: arxiv.org, April 2018

  7. MPICH: high-performance portable MPI, https://www.mpich.org/

  8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: arxiv.org, 2015

  9. ImageNet, https://image-net.org, 2017

  10. Distributed TensorFlow, https://www.tensorflow.org/deploy/distributed/

  11. asyncio—Asynchronous I/O, http://docs.python.org/3/library/asyncio.html

  12. py_func, https://www.tensorflow.org/ api_docs/python/tf/py_func

  13. Mathuriya, A., Bard, A., Mendygral, P., Meadows, L., Arnemann, J., Shao, L., He, S., Karna, t., Moise, D., Pennycook, S.J., Maschoff, K., Sewall, J., Kumar, N., Ho, S., Ringenburg, M., Prabhat, Lee, V.: Cosmoflow: using deep learning to learn the universe at scale. In: arxiv.org, Aug 2018

  14. Kim, S., Yu, G.-I., Park, H., Cho, S., Jeong, E., Ha, H., Lee, S., Jeong, J.S., Chun, B.-G. Parallax: sparsity-aware data parallel training of deep neural networks. In: EuroSys 2019, March 2019

  15. Lian, X., Zhang, W., Zhang, C., Liu, J.: Asynchronous decentralized parallel stochastic gradient descent. In: Dy, J.G., Krause, A., Eds., Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, series Proceedings of Machine Learning Research, vol. 80. PMLR, 2018, pp. 3049–3058. http://proceedings.mlr.press/v80/lian18a.html

  16. Luo, Q., Lin, J., Zhuo, Y., Qian, X.: Hop: Heterogeneity-aware decentralized training. In: Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019, pp. 893–907

  17. Awan, A.A., Manian, K.V., Chu, C.-H., Subramoni, H., Panda, D.K.: Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2? Parallel Comput. 85, 141–152 (2019)

    Article  Google Scholar 

  18. SONY Breaks ResNet-50 Training Record with NVIDIA V100 Tensor Core GPUs. http://news.developer.nvidia.com/sony-breaks-resnet-50-training-record-with-nvidia-v100-tensor-core-gpus/

Download references

Acknowledgements

This research was supported by Basic Science Research Program and Next-Generation Information Computing Development Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (2020R1F1A1072696, 2015M3C4A7065646), Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT)(No. : 2020- 0-01305, Development of AI Deep-Learning Processor and Module for 2,000 TFLOPS Server), GRRC program of Gyeong-gi province (No. GRRC-KAU-2020-B01, “Study on the Video and Space Convergence Platform for 360VR Services”) and ITRC (Information Technology Research Center) support program (IITP-2020-2018-0-01423)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jaehwan Lee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, Y., Choi, H., Lee, J. et al. Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster. Cluster Comput 23, 2287–2300 (2020). https://doi.org/10.1007/s10586-020-03144-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-020-03144-9

Keywords

Navigation