Abstract
With the rapid increase of the size of applications and the complexity of the supercomputer architecture, topology-aware process mapping becomes increasingly important. High communication cost has become a dominant constraint of the performance of applications running on the supercomputer. To avoid a bad mapping strategy which can lead to terrible communication performance, we propose an optimized heuristic topology-aware mapping algorithm (OHTMA). The algorithm attempts to minimize the hop-byte metric that we use to measure the mapping results. OHTMA incorporates a new greedy heuristic method and pair-exchange-based optimization. It reduces the number of long-distance communications and effectively enhances the locality of the communication. Experimental results on the Tianhe-3 exascale supercomputer prototype indicate that OHTMA can significantly reduce the communication costs.
Similar content being viewed by others
References
Agarwal T, Sharma A, Laxmikant A, et al., 2006. Topology-aware task mapping for reducing communication contention on large parallel machines. Proc 20th IEEE Int Parallel & Distributed Processing Symp, p. 1–10. https://doi.org/10.1109/IPDPS.2006.1639379
Bailey DH, Barszcz E, Barton JT, et al., 1991. The NAS parallel benchmarks—summary and preliminary results. Proc ACM/IEEE Conf on Supercomputing, p.158–165. https://doi.org/10.1145/125826.125925
Bhatele A, 2010. Automating Topology Aware Mapping for Supercomputers. PhD Thesis, University of Illinois at Urbana-Champaign, Urbana, USA.
Bhatele A, Laxmikant V, 2009. An evaluative study on the effect of contention on message latencies in large supercomputers. Proc IEEE Int Symp on Parallel & Distributed Processing, p.1–8. https://doi.org/10.1109/IPDPS.2009.5161094
Brandfass B, Alrutz T, Gerhold T, 2013. Rank reordering for MPI communication optimization. Comput Fluid, 80:372–380. https://doi.org/10.1016/j.compfluid.2012.01.019
Chen X, Liu J, Li S, et al., 2018. TAMM: a new topology-aware mapping method for parallel applications on the Tianhe-2A supercomputer. Proc 18th Int Conf on Algorithms and Architectures for Parallel Processing, p.242–256. https://doi.org/10.1007/978-3-030-05051-1_17
Deveci M, Kaya K, Uçar B, et al., 2015. Fast and high quality topology-aware task mapping. Proc IEEE Int Parallel and Distributed Processing Symp, p.197–206. https://doi.org/10.1109/IPDPS.2015.93
Hoefler T, Snir M, 2011. Generic topology mapping strategies for large-scale parallel architectures. Proc Int Conf on Supercomputing, p.75–84. https://doi.org/10.1145/1995896.1995909
Hoefler T, Jeannot E, Mercier G, 2014. An overview of topology mapping algorithms and techniques in highperformance computing. In: Jeannot E, Zilinskas J (Eds.), High-Performance Computing on Complex Environments. Wiley, Hoboken, New Jersey, USA. https://doi.org/10.1002/9781118711897.ch5
Jeannot E, Mercier G, 2010. Near-optimal placement of MPI processes on hierarchical NUMA architectures. In: D’Ambra P, Guarracino M, Talia D (Eds.), Euro-Par 2010 Parallel Processing. Springer Berlin Heidelberg, Germany, p.199–210. https://doi.org/10.1007/978-3-642-15291-7_20
Jeannot E, Mercier G, Tessier F, 2014. Process placement in multicore clusters: algorithmic issues and practical techniques. IEEE Trans Parall Distrib Syst, 25(4):993–1002. https://doi.org/10.1109/TPDS.2013.104
Karypis G, Kumar V, 1998. METIS—A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes and Computing Fill-Reducing Ordering of Sparse Matrices. Technical Report, University of Minnesota, Minneapolis, USA.
Liao X, Pang Z, Wang K, et al., 2015. High performance interconnect network for Tianhe system. J Comput Sci Technol, 30(2):259–272. https://doi.org/10.1007/s11390-015-1520-7
Mercier G, Clet-Ortega J, 2009. Towards an efficient process placement policy for MPI applications in multicore environments. In: Ropo M, Westerholm J, Dongarra J (Eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer Berlin Heidelberg, Germany, p.104–115. https://doi.org/10.1007/978-3-642-03770-2_17
Mirsadeghi SH, Afsahi A, 2016. PTRAM: a parallel topology-and routing-aware mapping framework for large-scale HPC systems. Proc IEEE Int Parallel and Distributed Processing Symp Workshops, p.386–396. https://doi.org/10.1109/IPDPSW.2016.146
Pellegrini F, Roman J, 1996. SCOTCH: a software package for static mapping by dual recursive bipartitioning of process and architecture graphs. Proc Int Conf and Exhibition on High-Performance Computing and Networking, p.493–498. https://doi.org/10.1007/3-540-61142-8_588
Rodrigues E, Madruga F, Navaux P, et al., 2009. Multi-core aware process mapping and its impact on communication overhead of parallel applications. Int Symp on Computers and Communications, p.811–817. https://doi.org/10.1109/ISCC.2009.5202271
Sahni S, Gonzalez T, 1976. P-complete approximation problems. JACM, 23(3):555–565. https://doi.org/10.1145/321958.321975
Sudheer CD, Srinivasan A, 2012. Optimization of the hopbyte metric for effective topology aware mapping. Proc 19th Int Conf on High Performance Computing, p.1–9. https://doi.org/10.1109/HiPC.2012.6507513
Tuncer O, Leung VJ, Coskun AK, 2015. PaCMap: topology mapping of unstructured communication patterns onto non-contiguous allocations. Proc 29th ACM on Int Conf on Supercomputing, p.37–46. https://doi.org/10.1145/2751205.2751225
Walshaw C, Cross M, 2007. JOSTLE—parallel multilevel graph-partitioning software: an overview. In: Magoulès F (Ed.), Mesh Partitioning Techniques and Domain Decomposition Methods. Saxe-Coburg Publications, Stirlingshire, UK, p.22–58. https://doi.org/10.4203/csets.17.2
Wang T, Qing P, Wei D, et al., 2015. Optimization of process-to-core mapping based on clustering analysis. Chin J Comput, 38(5):1044–1055 (in Chinese).
Wylie BJN, Böhme D, Mohr B, et al., 2010. Performance analysis of Sweep3D on Blue Gene/P with the Scalasca toolset. Proc IEEE Int Symp on Parallel & Distributed Processing, Workshops and PhD Forum, p.1–8. https://doi.org/10.1109/IPDPSW.2010.5470816
Zerr RJ, Baker RS, 2013. Snap: SN (Discrete Ordinates) Application Proxy-Proxy Description. Technical Report, LA-UR-13–21070, Los Alamos National Laboratory, Los Alamos, USA.
Author information
Authors and Affiliations
Contributions
Yi-shui LI designed the research. Jie LIU guided the research. Xin-hai CHEN helped perform experiments. Yishui LI drafted the manuscript. Bo YANG, Chun-ye GONG, and Xin-biao GAN helped organize the manuscript. Shengguo LI and Han XU helped modify the manuscript. Yi-shui LI revised and finalized the paper.
Corresponding author
Additional information
Compliance with ethics guidelines
Yi-shui LI, Xin-hai CHEN, Jie LIU, Bo YANG, Chun-ye GONG, Xin-biao GAN, Sheng-guo LI, and Han XU declare that they have no conflict of interest.
Project supported by the National Key Research and Development Program of China (No. 2017YFB0202104)
Rights and permissions
About this article
Cite this article
Li, Ys., Chen, Xh., Liu, J. et al. OHTMA: an optimized heuristic topology-aware mapping algorithm on the Tianhe-3 exascale supercomputer prototype. Front Inform Technol Electron Eng 21, 939–949 (2020). https://doi.org/10.1631/FITEE.1900075
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1900075