QTMS: A quadratic time complexity topology-aware process mapping method for large-scale parallel applications on shared HPC system

https://doi.org/10.1016/j.parco.2020.102637Get rights and content

Highlights

  • A quadratic time complexity topology-aware process mapping method is proposed.

  • The method is effective for process mapping on shared HPC systems.

  • The method further considers inter-process-group communication improvement.

Abstract

Communication exacerbates the performance for parallel applications with thousands of CPU cores and quantities of data to exchange. The high communication cost is usually attributed to the mismatch between the communication patterns of parallel applications and the physical topology graphs of the computing resources (or the underlying network topologies). The topology-aware process mapping method can usually obtain a better embedding scheme with the aim to improve communication performance. Many existing heuristic-search based mapping methods have high execution time for large-scale applications. Some low-cost graph-partitioning based mapping methods depend on that the allocated resources form a regular structure, which is usually impractical in most high performance computing systems shared by multiple users and applications. This weakens their performance. Other graph-partitioning based mapping methods come at a high cost or require users to provide the network structure information. To address these issues, a quadratic time complexity topology-aware process mapping method is presented in this paper. The experimental results show that the proposed method often achieves a better application communication performance than several state-of-the-art mapping methods on a shared HPC system, while maintaining a significantly lower execution cost. Moreover, the real-world scientific application proxies gain an execution time reduction as large as 14.60% in the 512 process-scale compared to the system default process placement on the TianHe-2 HPC systems.

Introduction

High performance computing (HPC) systems are generally used to execute parallel applications that solve complex science and engineering problems, such as numerical simulation, high-energy physics, energy exploration, etc [1]. The ever-increasing number of computing nodes in HPC systems causes the impact of communication operations on overall application performance to approach or even surpass arithmetic operations [2], [3]. Fig. 1 shows the relationship between the communication ratio and the number of processes for several representative parallel applications. On this account, how to improve the application communication performance has become an important challenge.

The topology-aware process mapping aims to minimize the communication cost by embedding the application communication topology to the physical topology graph of computing resources (or the underlying network topology). This can effectively reduce the communication cost without changing the system hardware, communication protocols, the application codes. Hence, this type of method is widely used to optimize the communication performance of parallel applications [1]. However, the challenge of finding an optimal mapping scheme is commonly an NP-hard problem [7].

To address the mapping problem, numerous methods have been proposed in the past several decades. In the early stages, the topology mapping problem was regarded as the quadratic assignment problem (QAP) [8], and the mapping scheme was determined through some heuristic searches, such as [9], [10], [11], [12], [13], [14], [15]. These methods usually do not require to assume the allocated computing nodes formed a regular structure. However, they are often excessively expensive for large-scale parallel applications due to the grand scale solution space and high algorithmic cost.

Fortunately, many existing applications often have significant communication locality, and many applications can benefit from improving the communication locality [16]. Numerous studies [7], [17], [18], [19], [20] based on graph-partitioning have been proposed to improve the communication locality, and these methods commonly have lower algorithmic cost than the heuristic-search based methods. Nevertheless, some drawbacks remain, weakening the applicability and optimization effect of these methods. Some graph-partitioning based methods assume that the allocated computing nodes form a regular structure, such as [21] and [16]. In fact, the assumption is hardly satisfied on most HPC systems, which are often shared by multiple users and applications [22], [23]. This results in performance degradation of these methods. Further, the majority of these methods lack consideration of improving the placement from processes to the computing cores in process-groups. Although other graph-partitioning based mapping methods do not depend on the assumption of regular computing allocation, they still have high cost or require users to provide accurate network structure information [17]. For example, MPIPP adopts the k-way [24] algorithm to solve the topology mapping problem. The time complexity of each pass in MPIPP reaches up to O(n3). Wu et al. [20] is a recursive bisection partitioning based method that needs the inter-node hops to construct a tree topology. However, the hops or other network structure information are hardly obtained for ordinary users on most current HPC systems, especially for shared HPC systems. Therefore, how to fit on the shared HPC system environment while maintaining algorithmic cost is an important challenge for topology-aware process mapping.

To tackle the above issues, in this study, a quadratic time complexity topology-aware process mapping method is proposed, named QTMS. Instead of requiring the network structure information provided by users, QTMS constructs the node-groups by considering the neighbor-feature among the allocated nodes. The neighbor-feature is represented by the latency information. To avoid the restriction of the irregular computing node locations, QTMS improves CNM [25] to partition processes according to the constructed node-groups meanwhile maintains a low time cost. Besides, QTMS employs RCM [26] to improve the placement from processes to cores in each process-group. QTMS has O(dlog(d)n) to O(n2)) time complexity, where n is the number of processes, d=max{degree(v)|vVp}, Vp is the set of processes. All definitions in this paper are listed in Table. 2. Four Mantevo Project[4]’s real-world scientific applications with different mapping schemes are executed on the TianHe-2 HPC system [27], [28] to evaluate the proposed mapping method. The experimental results show that QTMS is more than 32.2 times faster than TreeMatch on the 16,384 process-scale, meanwhile, the proposed method achieves a similar or even better optimization effect in comparison with other state-of-the-arts. The main contributions of this study are follows:

  • A quadratic time complexity topology-aware process mapping method is proposed, which can efficiently improve parallel application communication performance on shared HPC systems and has a lower time cost.

  • Intra-process-group communication cost is considered to reorder the processes in each process-group to further improve communication performance.

  • The optimization effect and algorithm cost of the proposed method is verified for real-world scientific applications on a real shared HPC system and a general workstation.

Section snippets

Related works

Numerous topology-aware process mapping methods have been proposed to improve the parallel application communication performance. To expose their advantages and drawbacks, existing methods are briefly reviewed in this section. The topology mapping problem is the assignment of n processes onto m computing cores. This problem has in the past been solved by some heuristic search algorithms. However, this type of methods commonly has extremely high time complexity and time cost. Some graph

Application communication topology

Application communication topology is the abstract representation of parallel application communication behaviors. In existing methods, the topology is defined as a weighted undirected graph Gp=(Vp,Ep,Wp), where Vp is the set of processes, Ep presents the communication among processes, and Wp is the set of edge weights. The communication volume or communication count is taken as the edge weight in different mapping methods [7], [29], [36], [37]. In QTMS, the weight of communication topology is

QTMS Method

To address the above three limitations of the existing mapping methods, a quadratic time complexity topology-aware process mapping method (QTMS) is proposed to tackle these issues. There are four main novelties comprised in the proposed method.

  • QTMS has a quadratic time complexity and low algorithmic cost.

  • QTMS does not require users to provide the network structure information to group the allocated computing nodes.

  • QTMS improves the CNM algorithm to partition the processes according to the

Experimental evaluation

In this section, the performance of QTMS is evaluated through the algorithm execution time and the real-world application execution time. Six other mapping methods and the system default placement scheme are compared. Four real-world scientific application proxies from Mantevo Project3.0 [4] are executed with the above placement schemes on the TianHe-2 HPC system [27], [28]. In these experiments, the TianHe-2 HPC system is shared.

Conclusion and future works

Parallel application communication performance improvement has been one of the most important challenges in the field of parallel computing. Topology-aware process mapping methods can efficiently reduce the execution time of parallel application. However, the heuristic-search based mapping methods takes excessive execution time to find the mapping scheme for the large-scale process mapping problem. Some graph-partitioning based mapping methods assume the allocated computing nodes form a regular

Declaration of Competing Interest

We declare that there no conflict of interest exits in the submission of this manuscript, and manuscript is approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere, in whole or in part. All the authors listed have approved the manuscript that is enclosed.

Acknowledgment

The work presented in this study was supported by the National Key Research and Development Program of China under Grant No. 2017YFB1010000, the National Natural Science Foundation of China under Grant No. 61772053, Science Challenge Project, No. TZ2016002. The final version has benefited greatly from the many detailed comments and suggestions from the selfless reviewers. The authors heartily acknowledge these comments and suggestions.

References (44)

  • M. Hanan et al.

    A review of the placement and quadratic assignment problems

    SIAM Rev.

    (1972)
  • S. H. Bokhari

    On the mapping problem

    IEEE Trans. Comput.

    (1981)
  • S.-Y Lee et al.

    A mapping strategy for parallel processing

    IEEE Trans. Comput.

    (1987)
  • S.W. Bollinger et al.

    Heuristic technique for processor and link assignment in multicomputers

    (1991)
  • A. Bhatele et al.

    An evaluative study on the effect of contention on message latencies in large supercomputers

    2009 IEEE International Symposium on Parallel Distributed Processing

    (2009)
  • A. Bhatel et al.

    Automated mapping of regular communication graphs on mesh interconnects

    2010 International Conference on High Performance Computing

    (2010)
  • J. Wu et al.

    Topology mapping of irregular parallel applications on torus-connected supercomputers

    J. Supercomput.

    (2016)
  • C.S. Steele

    Placement of Communicating Processes on Multiprocessor Networks

    (1985)
  • E. Jeannot et al.

    Near-optimal placement of MPI processes on hierarchical NUMA architectures

    International Euro-Par Conference on Parallel Processing

    (2010)
  • E. Jeannot et al.

    Process placement in multicore clusters:algorithmic issues and practical techniques

    IEEE Trans. Parallel Distrib. Syst.

    (2014)
  • H. Chen et al.

    MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters

    International Conference on Supercomputing

    (2006)
  • S. Arunkumar et al.

    Randomized heuristic for the mapping problem

    Int. J. High Speed Comput.

    (2012)
  • Cited by (2)

    View full text