QTMS: A quadratic time complexity topology-aware process mapping method for large-scale parallel applications on shared HPC system
Introduction
High performance computing (HPC) systems are generally used to execute parallel applications that solve complex science and engineering problems, such as numerical simulation, high-energy physics, energy exploration, etc [1]. The ever-increasing number of computing nodes in HPC systems causes the impact of communication operations on overall application performance to approach or even surpass arithmetic operations [2], [3]. Fig. 1 shows the relationship between the communication ratio and the number of processes for several representative parallel applications. On this account, how to improve the application communication performance has become an important challenge.
The topology-aware process mapping aims to minimize the communication cost by embedding the application communication topology to the physical topology graph of computing resources (or the underlying network topology). This can effectively reduce the communication cost without changing the system hardware, communication protocols, the application codes. Hence, this type of method is widely used to optimize the communication performance of parallel applications [1]. However, the challenge of finding an optimal mapping scheme is commonly an NP-hard problem [7].
To address the mapping problem, numerous methods have been proposed in the past several decades. In the early stages, the topology mapping problem was regarded as the quadratic assignment problem (QAP) [8], and the mapping scheme was determined through some heuristic searches, such as [9], [10], [11], [12], [13], [14], [15]. These methods usually do not require to assume the allocated computing nodes formed a regular structure. However, they are often excessively expensive for large-scale parallel applications due to the grand scale solution space and high algorithmic cost.
Fortunately, many existing applications often have significant communication locality, and many applications can benefit from improving the communication locality [16]. Numerous studies [7], [17], [18], [19], [20] based on graph-partitioning have been proposed to improve the communication locality, and these methods commonly have lower algorithmic cost than the heuristic-search based methods. Nevertheless, some drawbacks remain, weakening the applicability and optimization effect of these methods. Some graph-partitioning based methods assume that the allocated computing nodes form a regular structure, such as [21] and [16]. In fact, the assumption is hardly satisfied on most HPC systems, which are often shared by multiple users and applications [22], [23]. This results in performance degradation of these methods. Further, the majority of these methods lack consideration of improving the placement from processes to the computing cores in process-groups. Although other graph-partitioning based mapping methods do not depend on the assumption of regular computing allocation, they still have high cost or require users to provide accurate network structure information [17]. For example, MPIPP adopts the k-way [24] algorithm to solve the topology mapping problem. The time complexity of each pass in MPIPP reaches up to O(n3). Wu et al. [20] is a recursive bisection partitioning based method that needs the inter-node hops to construct a tree topology. However, the hops or other network structure information are hardly obtained for ordinary users on most current HPC systems, especially for shared HPC systems. Therefore, how to fit on the shared HPC system environment while maintaining algorithmic cost is an important challenge for topology-aware process mapping.
To tackle the above issues, in this study, a quadratic time complexity topology-aware process mapping method is proposed, named QTMS. Instead of requiring the network structure information provided by users, QTMS constructs the node-groups by considering the neighbor-feature among the allocated nodes. The neighbor-feature is represented by the latency information. To avoid the restriction of the irregular computing node locations, QTMS improves CNM [25] to partition processes according to the constructed node-groups meanwhile maintains a low time cost. Besides, QTMS employs RCM [26] to improve the placement from processes to cores in each process-group. QTMS has O(dlog(d)n) to O(n2)) time complexity, where n is the number of processes, Vp is the set of processes. All definitions in this paper are listed in Table. 2. Four Mantevo Project[4]’s real-world scientific applications with different mapping schemes are executed on the TianHe-2 HPC system [27], [28] to evaluate the proposed mapping method. The experimental results show that QTMS is more than 32.2 times faster than TreeMatch on the 16,384 process-scale, meanwhile, the proposed method achieves a similar or even better optimization effect in comparison with other state-of-the-arts. The main contributions of this study are follows:
- •
A quadratic time complexity topology-aware process mapping method is proposed, which can efficiently improve parallel application communication performance on shared HPC systems and has a lower time cost.
- •
Intra-process-group communication cost is considered to reorder the processes in each process-group to further improve communication performance.
- •
The optimization effect and algorithm cost of the proposed method is verified for real-world scientific applications on a real shared HPC system and a general workstation.
Section snippets
Related works
Numerous topology-aware process mapping methods have been proposed to improve the parallel application communication performance. To expose their advantages and drawbacks, existing methods are briefly reviewed in this section. The topology mapping problem is the assignment of n processes onto m computing cores. This problem has in the past been solved by some heuristic search algorithms. However, this type of methods commonly has extremely high time complexity and time cost. Some graph
Application communication topology
Application communication topology is the abstract representation of parallel application communication behaviors. In existing methods, the topology is defined as a weighted undirected graph where Vp is the set of processes, Ep presents the communication among processes, and Wp is the set of edge weights. The communication volume or communication count is taken as the edge weight in different mapping methods [7], [29], [36], [37]. In QTMS, the weight of communication topology is
QTMS Method
To address the above three limitations of the existing mapping methods, a quadratic time complexity topology-aware process mapping method (QTMS) is proposed to tackle these issues. There are four main novelties comprised in the proposed method.
- •
QTMS has a quadratic time complexity and low algorithmic cost.
- •
QTMS does not require users to provide the network structure information to group the allocated computing nodes.
- •
QTMS improves the CNM algorithm to partition the processes according to the
Experimental evaluation
In this section, the performance of QTMS is evaluated through the algorithm execution time and the real-world application execution time. Six other mapping methods and the system default placement scheme are compared. Four real-world scientific application proxies from Mantevo Project3.0 [4] are executed with the above placement schemes on the TianHe-2 HPC system [27], [28]. In these experiments, the TianHe-2 HPC system is shared.
Conclusion and future works
Parallel application communication performance improvement has been one of the most important challenges in the field of parallel computing. Topology-aware process mapping methods can efficiently reduce the execution time of parallel application. However, the heuristic-search based mapping methods takes excessive execution time to find the mapping scheme for the large-scale process mapping problem. Some graph-partitioning based mapping methods assume the allocated computing nodes form a regular
Declaration of Competing Interest
We declare that there no conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.
Acknowledgment
The work presented in this study was supported by the National Key Research and Development Program of China under Grant No. 2017YFB1010000, the National Natural Science Foundation of China under Grant No. 61772053, Science Challenge Project, No. TZ2016002. The final version has benefited greatly from the many detailed comments and suggestions from the selfless reviewers. The authors heartily acknowledge these comments and suggestions.
References (44)
- et al.
Deploying a large petascale system: the blue waters experience
Procedia Comput. Sci.
(2014) - et al.
Rank reordering for MPI communication optimization.
Comput. Fluids
(2013) - et al.
Automatically optimized core mapping to subdomains of domain decomposition method on multicore parallel environments
Comput. Fluids
(2013) Encyclopedia of Parallel Computing
(2011)Top Ten Exascale Research Challenges
(2014)- et al.
Automatic topology mapping of diverse large-scale parallel applications
Proceedings of the International Conference on Supercomputing
(2017) - et al.
Improving Performance via Mini-Applications
Technical Report
(2009) Nas parallel benchmarks
Encycl. Parallel Comput.
(2011)- et al.
Massively parallel FDTD program JEMS-FDTD and its applications in platform coupling simulation
2014 International Symposium on Electromagnetic Compatibility
(2014) - et al.
Generic topology mapping strategies for large-scale parallel architectures
International Conference on Supercomputing, 2011, Tucson, Az, Usa, May 31, - June
(2011)
A review of the placement and quadratic assignment problems
SIAM Rev.
On the mapping problem
IEEE Trans. Comput.
A mapping strategy for parallel processing
IEEE Trans. Comput.
Heuristic technique for processor and link assignment in multicomputers
An evaluative study on the effect of contention on message latencies in large supercomputers
2009 IEEE International Symposium on Parallel Distributed Processing
Automated mapping of regular communication graphs on mesh interconnects
2010 International Conference on High Performance Computing
Topology mapping of irregular parallel applications on torus-connected supercomputers
J. Supercomput.
Placement of Communicating Processes on Multiprocessor Networks
Near-optimal placement of MPI processes on hierarchical NUMA architectures
International Euro-Par Conference on Parallel Processing
Process placement in multicore clusters:algorithmic issues and practical techniques
IEEE Trans. Parallel Distrib. Syst.
MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters
International Conference on Supercomputing
Randomized heuristic for the mapping problem
Int. J. High Speed Comput.
Cited by (2)
Asynchronous task based Eulerian-Lagrangian parallel solver for combustion applications
2022, Journal of Computational PhysicsFPGA Delay-Oriented Process Mapping Algorithm of Xiangxi Minority Based on LUT
2022, Mathematical Problems in Engineering