Elsevier

Parallel Computing

Volume 97, September 2020, 102664
Parallel Computing

Asynchronous runtime with distributed manager for task-based programming models

https://doi.org/10.1016/j.parco.2020.102664Get rights and content

Highlights

  • Characterization of runtime overheads based on task dependences management.

  • Runtime design suitable to reduce runtime overheads on many-core architectures.

  • Speedup of runtime structures management using an asynchronous approach.

  • Insights on how to adapt the runtime behavior to the underlying architecture.

Abstract

Parallel task-based programming models, like OpenMP, allow application developers to easily create a parallel version of their sequential codes. The standard OpenMP 4.0 introduced the possibility of describing a set of data dependences per task that the runtime uses to order the tasks execution. This order is calculated using shared graphs, which are updated by all threads in exclusive access using synchronization mechanisms (locks) to ensure the dependence management correctness. The contention in the access to these structures becomes critical in many-core systems because several threads may be wasting computation resources waiting their turn.

This paper proposes an asynchronous management of the runtime structures, like task dependence graphs, suitable for task-based programming model runtimes. In such organization, the threads request actions to the runtime instead of doing them directly. The requests are then handled by a distributed runtime manager (DDAST) which does not require dedicated resources. Instead, the manager uses the idle threads to modify the runtime structures. The paper also presents an implementation, analysis and performance evaluation of such runtime organization. The performance results show that the proposed asynchronous organization outperforms the speedup obtained by the original runtime for different benchmarks and different many-core architectures.

Introduction

The multicore processors popularization started due to the end of Dennard scaling law which states that the power density of an integrated circuit can stay constant meanwhile the transistors get smaller. Until 2006, Dennard’s law and Moore’s law have guided processor manufacturers to periodically reduce the transistors length and increase the clock frequency which also increases the processors performance. However, the leakage current grows much faster at small transistor sizes; therefore the clock frequency cannot increase without impacting the overall power consumption. Since the transistor still reduces its size periodically as Moore’s law states, processor manufacturers started to introduce multiple cores in their processors to keep the processors performance increase.

As multicore processors have become popular, parallel programming has become a need to take advantage of these processors. Instead of dealing with complex applications programmed for one specific processor architecture, parallel programming models decouple applications from hardware. Their goal is to allow programmers to indicate the potential parallelism in the applications source code without directly managing it. There are several examples like MapReduce [1], OpenMP [2], OpenCL [3], StarSs [4], etc. The exposed parallelism is then managed by a runtime library that coordinates the application execution transparently to the application programmer. Similarly, there are parallelized libraries that can be used from sequential applications that implement several parallel skeletons for commonly used operations. Some examples of these libraries are: Spark [5], OpenBLAS [6], Intel MKL, etc.

The task oriented paradigm is one powerful way to define potential parallelism in one application. Programmers only have to annotate code regions called tasks that can run in parallel. Additionally, developers can provide additional task information like data requirements. This information defines the task execution order enforced by the runtime libraries at execution time. The OpenMP standard introduced task dependences in the 4.0 version greatly influenced by the OmpSs programming model which extends the standard syntax with additional features.

The runtimes of these models are responsible for guaranteeing the task execution order correctness defined by the task data requirements. Therefore, the runtime updates a task graph when a task is created and when a task finalizes its execution. Usually, these modifications require to read and write the information in the task graph atomically to ensure the order correctness.

In a processor with a lot of cores, the probability of collisions between threads trying to access the task dependence graph increases. Each collision implies that a thread is wasting its computation time waiting for another one modifications. This problem that has currently started arising is expected to be an important bottleneck as the number of cores in the future processors is expected to keep growing [7]. Thereby, the access contention on some runtime structures will kill the application performance if runtimes do not redesign their internals to tackle the problem.

To improve the current task-based parallel programming runtimes and avoid the contention expected in the many-core processors, we propose an asynchronous runtime organization where the runtime threads do not update the runtime structures directly. Instead, the threads request the needed actions to the runtime and this request will be handled in the future. This asynchronous approach avoids the problem of actively waiting for the exclusive access and allows the threads to return immediately to the application code. Moreover, such structure tries to maximize the utilization of the processor cores to run application code and avoid active waiting on the locks.

The threads requests to the runtime are handled by a runtime manager who updates the runtime structures. Initially, we proposed a centralized implementation based on an extra thread (DAS Thread, DAST) together with a mechanism to avoid the manager saturation [7]. In this work, we present a new distributed implementation of the runtime manager (Distributed DAST, DDAST) based on a mechanism where any thread may become a runtime manager thread. Therefore, the runtime tries to use all the available threads in a smart way to restrict the accesses to the runtime structures.

The proposed asynchronous runtime model provides similar performance to the original runtime when the application has a small number of tasks or when the execution uses a reduced amount of threads. However, when the number of tasks and/or the number of threads is large, the new runtime achieves better performance due to the better thread utilization, data locality and contention reduction. All the changes proposed here are transparent for application developers and are general enough to be used in a wide range of task-based parallel programming models. Even more, the design could be adapted for particular heterogeneous architectures, like big.LITTLE [8], allowing a subset of the worker threads to become manager threads.

The remainder of the paper is organized as follows. Section 2 describes the OmpSs task-based programming model, as an example of task-based programming model whose runtime takes care of the dependences management. Section 3 describes the design and implementation of the new distributed runtime manager (DDAST). Section 4 presents the experimental setup. Section 5 presents the tuning results for the manager internal parameters. Section 6 shows the performance of the new runtime and analyzes its behavior during the executions. Finally, Section 7 presents the related work and Section 8 concludes.

Section snippets

Background

The implementation of the asynchronous runtime model has been developed using the OmpSs programming model, which is a forerunner of the standard OpenMP parallel programming model. Therefore, the following sections introduce the OmpSs programming model (Section 2.1) and Nanos++ (Section 2.2), which is the runtime library used to run OmpSs applications. The OmpSs programming model is also supported by the source-to-source Mercurium compiler for C, C++ and Fortran. However, the change proposed in

Distributed runtime manager

The design of the asynchronous runtime with the distributed manager is based on the idea that any worker thread can become a manager thread and start executing only runtime code. With this approach, all threads can cooperate to execute the pending runtime operations when there are several of them. Correspondingly, all the threads can execute application tasks when the number of pending runtime operations is small. The implementation of the newer design is based on the knowledge acquired in our

Experimental setup

The evaluation of the asynchronous runtime implementation has been done in different architectures and under different benchmarks to show the adaptability and capabilities in different contexts. They are introduced and explained in the following sections: Section 4.1 for the different machines/architectures and Section 4.2 for the different benchmarks. However, a common criteria for all the architectures and benchmarks has been followed to enhance the evaluation quality, avoid external

DDAST Tuning

The initial executions with the new runtime structure were intended to find good default values for the callback parameters explained in Section 3.3. To this end, some initial values are defined, based on a reasonable approximation to the expected optimal ones, and the same execution is repeated changing only one parameter value. The executions for each parameter are done with two benchmarks that have different task dependence patterns: Matmul and Sparse LU. Each execution set is doubling the

Performance comparison

The following sections present the performance comparison of the new runtime model against other task-based parallel programming models. Section 6.1 shows the scalability results for the different runtimes and Section 6.2 presents the most relevant executions where the main differences between both runtime approaches can be seen.

Related work

Several works exist about characterization and improvement of parallel programming models. They are over different models working at different levels and with different approaches. OmpSs tools (Mercurium and Nanos++), which are open source and are the ones used to test our model, can execute inter-node and intra-node applications [20] and are under constant development introducing new features. Moreover, several people use this programming model as a base to develop different prototypes or

Conclusions

The multicore processors have become popular and are present in almost any electronic device nowadays. Task-based parallel programming models, like OmpSs, facilitate programmers to use such processor architectures by simply annotating the sequential applications source code. However, the runtime libraries that support such models present a contention problem when the number of threads grows to some tens. As current many-core processors, the future processors are expected to have several cores;

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work is partially supported by the European Union H2020 Research and Innovation Action (projects 801051, 754337 and 780681), by the Spanish Government (projects SEV-2015-0493 and TIN2015-65316-P, grant BES-2016-078046), and by the Generalitat de Catalunya (contracts 2017-SGR-1414 and 2017-SGR-1328).

References (33)

  • A. Duran et al.

    OmpSs: a proposal for programming heterogeneous multi-Core architectures

    Parallel. Process. Lett.

    (2011)
  • J. Bueno Hedo, Run-time Support for Multi-level Disjoint Memory Address Spaces...
  • Programming Models Group BSC, OmpSs User Guide, 2017,...
  • Programming Models Group BSC, Nanos++ Runtime Library, 2017,...
  • A. Sodani

    Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor

    Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS)

    (2015)
  • L. Gwennap

    ThunderX Rattles Server Market

    Microprocessor Report

    (2014)
  • View full text