Asynchronous runtime with distributed manager for task-based programming models
Introduction
The multicore processors popularization started due to the end of Dennard scaling law which states that the power density of an integrated circuit can stay constant meanwhile the transistors get smaller. Until 2006, Dennard’s law and Moore’s law have guided processor manufacturers to periodically reduce the transistors length and increase the clock frequency which also increases the processors performance. However, the leakage current grows much faster at small transistor sizes; therefore the clock frequency cannot increase without impacting the overall power consumption. Since the transistor still reduces its size periodically as Moore’s law states, processor manufacturers started to introduce multiple cores in their processors to keep the processors performance increase.
As multicore processors have become popular, parallel programming has become a need to take advantage of these processors. Instead of dealing with complex applications programmed for one specific processor architecture, parallel programming models decouple applications from hardware. Their goal is to allow programmers to indicate the potential parallelism in the applications source code without directly managing it. There are several examples like MapReduce [1], OpenMP [2], OpenCL [3], StarSs [4], etc. The exposed parallelism is then managed by a runtime library that coordinates the application execution transparently to the application programmer. Similarly, there are parallelized libraries that can be used from sequential applications that implement several parallel skeletons for commonly used operations. Some examples of these libraries are: Spark [5], OpenBLAS [6], Intel MKL, etc.
The task oriented paradigm is one powerful way to define potential parallelism in one application. Programmers only have to annotate code regions called tasks that can run in parallel. Additionally, developers can provide additional task information like data requirements. This information defines the task execution order enforced by the runtime libraries at execution time. The OpenMP standard introduced task dependences in the 4.0 version greatly influenced by the OmpSs programming model which extends the standard syntax with additional features.
The runtimes of these models are responsible for guaranteeing the task execution order correctness defined by the task data requirements. Therefore, the runtime updates a task graph when a task is created and when a task finalizes its execution. Usually, these modifications require to read and write the information in the task graph atomically to ensure the order correctness.
In a processor with a lot of cores, the probability of collisions between threads trying to access the task dependence graph increases. Each collision implies that a thread is wasting its computation time waiting for another one modifications. This problem that has currently started arising is expected to be an important bottleneck as the number of cores in the future processors is expected to keep growing [7]. Thereby, the access contention on some runtime structures will kill the application performance if runtimes do not redesign their internals to tackle the problem.
To improve the current task-based parallel programming runtimes and avoid the contention expected in the many-core processors, we propose an asynchronous runtime organization where the runtime threads do not update the runtime structures directly. Instead, the threads request the needed actions to the runtime and this request will be handled in the future. This asynchronous approach avoids the problem of actively waiting for the exclusive access and allows the threads to return immediately to the application code. Moreover, such structure tries to maximize the utilization of the processor cores to run application code and avoid active waiting on the locks.
The threads requests to the runtime are handled by a runtime manager who updates the runtime structures. Initially, we proposed a centralized implementation based on an extra thread (DAS Thread, DAST) together with a mechanism to avoid the manager saturation [7]. In this work, we present a new distributed implementation of the runtime manager (Distributed DAST, DDAST) based on a mechanism where any thread may become a runtime manager thread. Therefore, the runtime tries to use all the available threads in a smart way to restrict the accesses to the runtime structures.
The proposed asynchronous runtime model provides similar performance to the original runtime when the application has a small number of tasks or when the execution uses a reduced amount of threads. However, when the number of tasks and/or the number of threads is large, the new runtime achieves better performance due to the better thread utilization, data locality and contention reduction. All the changes proposed here are transparent for application developers and are general enough to be used in a wide range of task-based parallel programming models. Even more, the design could be adapted for particular heterogeneous architectures, like big.LITTLE [8], allowing a subset of the worker threads to become manager threads.
The remainder of the paper is organized as follows. Section 2 describes the OmpSs task-based programming model, as an example of task-based programming model whose runtime takes care of the dependences management. Section 3 describes the design and implementation of the new distributed runtime manager (DDAST). Section 4 presents the experimental setup. Section 5 presents the tuning results for the manager internal parameters. Section 6 shows the performance of the new runtime and analyzes its behavior during the executions. Finally, Section 7 presents the related work and Section 8 concludes.
Section snippets
Background
The implementation of the asynchronous runtime model has been developed using the OmpSs programming model, which is a forerunner of the standard OpenMP parallel programming model. Therefore, the following sections introduce the OmpSs programming model (Section 2.1) and Nanos++ (Section 2.2), which is the runtime library used to run OmpSs applications. The OmpSs programming model is also supported by the source-to-source Mercurium compiler for C, C++ and Fortran. However, the change proposed in
Distributed runtime manager
The design of the asynchronous runtime with the distributed manager is based on the idea that any worker thread can become a manager thread and start executing only runtime code. With this approach, all threads can cooperate to execute the pending runtime operations when there are several of them. Correspondingly, all the threads can execute application tasks when the number of pending runtime operations is small. The implementation of the newer design is based on the knowledge acquired in our
Experimental setup
The evaluation of the asynchronous runtime implementation has been done in different architectures and under different benchmarks to show the adaptability and capabilities in different contexts. They are introduced and explained in the following sections: Section 4.1 for the different machines/architectures and Section 4.2 for the different benchmarks. However, a common criteria for all the architectures and benchmarks has been followed to enhance the evaluation quality, avoid external
DDAST Tuning
The initial executions with the new runtime structure were intended to find good default values for the callback parameters explained in Section 3.3. To this end, some initial values are defined, based on a reasonable approximation to the expected optimal ones, and the same execution is repeated changing only one parameter value. The executions for each parameter are done with two benchmarks that have different task dependence patterns: Matmul and Sparse LU. Each execution set is doubling the
Performance comparison
The following sections present the performance comparison of the new runtime model against other task-based parallel programming models. Section 6.1 shows the scalability results for the different runtimes and Section 6.2 presents the most relevant executions where the main differences between both runtime approaches can be seen.
Related work
Several works exist about characterization and improvement of parallel programming models. They are over different models working at different levels and with different approaches. OmpSs tools (Mercurium and Nanos++), which are open source and are the ones used to test our model, can execute inter-node and intra-node applications [20] and are under constant development introducing new features. Moreover, several people use this programming model as a base to develop different prototypes or
Conclusions
The multicore processors have become popular and are present in almost any electronic device nowadays. Task-based parallel programming models, like OmpSs, facilitate programmers to use such processor architectures by simply annotating the sequential applications source code. However, the runtime libraries that support such models present a contention problem when the number of threads grows to some tens. As current many-core processors, the future processors are expected to have several cores;
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work is partially supported by the European Union H2020 Research and Innovation Action (projects 801051, 754337 and 780681), by the Spanish Government (projects SEV-2015-0493 and TIN2015-65316-P, grant BES-2016-078046), and by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328).
References (33)
- et al.
A dynamically tuned sorting library
Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization
(2004) - et al.
Nexus: A Distributed Hardware Task Manager for Task-Based Programming Models
Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International
(2015) - et al.
MapReduce: simplified data processing on large clusters
Commun. ACM
(2008) - et al.
OpenMP: an industry standard API for shared-Memory programming
Computational Science Engineering, IEEE
(1998) - et al.
OpenCL: A Parallel programming standard for heterogeneous computing systems
Computing in Science & Engineering
(2010) - et al.
Hierarchical task-Based programming with starSs
Int. J. High Perform. Comput. Appl.
(2009) - A. Spark, Apache Spark - Unified Analytics Engine for Big Data, 2018,...
- Z. Xianyi, W. Qian, W. Saar, OpenBLAS: An optimized BLAS library, 2018,...
- et al.
Characterizing and Improving the Performance of Many-Core Task-Based Parallel Programming Runtimes
Proceedings of the Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2017 IEEE International
(2017) Big.LITTLE processing with ARM cortex-A15 & cortex-A7
ARM White Paper
(2011)
OmpSs: a proposal for programming heterogeneous multi-Core architectures
Parallel. Process. Lett.
Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor
Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS)
ThunderX Rattles Server Market
Microprocessor Report
Cited by (2)
Parallel program testing based on critical communication and branch transformation
2024, Journal of SupercomputingIntegration of interdisciplinary and evidence-based approach into research policy
2022, IOP Conference Series: Earth and Environmental Science