Elsevier

Performance Evaluation

Volume 134, October 2019, 102003
Performance Evaluation

An analytical model for thread-core mapping for tiled CMPs

https://doi.org/10.1016/j.peva.2019.102003Get rights and content

Abstract

Modern computing chips are composed of multiple, simple, low-power processing cores. Increasing the number of processing cores in a single chip brings the opportunity to exploit the inherent massive level of thread parallelism and further improved performance. However, efficient allocation of applications (threads) to available cores is a complicated process. Failing to do so, the mapping can be the limiting factor for achieving better performance on a tiled chip-multiprocessor (CMP). In this paper, we propose a mathematical formulation based on mixed integer linear program (MILP) to map application threads on cores at worst-case scenario by keeping into account the spatial topology of a two-dimensional mesh (2D-mesh) Networks-on-Chip (NoC). Our model allows evaluating in absolute term the performance of different mapping and routing algorithms. The proposed analytical model is general enough to consider a different optimising policy from energy to latency and a different number of memory controllers. In the experiments, we have shown that the proposed approach can achieve up to 40% reduction over the traditional zig-zag heuristic, therefore showing that there is a range for improving application mapping.

Introduction

Recent computing platform paradigm has been shifted towards the communication centric design which is optimised both for performance and power. Over the past few years, researchers from industry and academia started to develop chips containing more than hundreds of cores. (e.g., PEZY-SC processor [1]) or more than thousands of cores (e.g., Epiphany-V Chip [2]). Stacking many low powered processing cores together offers a great level of parallelism, but requires to solve multiple execution management issues. In fact for chips with large core count, there could be higher chances of resource contention due to a significant amount of data exchange between the applications.

On-chip packet-switched micro-network of interconnects (i.e., NoCs [3]) provides the physical substrate used by processing cores to communicate each other (also to the memory). In general, NoC’s architectural components (such as channel width, buffer size, thread-to-core mapping, and routing algorithms) are very critical for better support for data traffic [4]. There is a need to optimise latency and energy cost for the data exchange inside the NoC to achieve higher performance. An increasing number of core count and communications between threads inside the NoC may yield to hotspot issues. Hence, on-chip communication may become a barrier for higher performance. In general, the main design goals for NoC based interconnects are higher bandwidth, low energy cost, and low latency. However, NoC designs with several hundreds of processing cores inside the chip suffer from large energy consumption. In fact, the on-chip network can consume a significant fraction of the whole chip power budget. For instance, experiments [5] have shown that for large core count (i.e., a 256-core based CMP) conventional 2D-mesh based NoC can consume up to 45% of the total energy; while the NoC for the MIT Raw processor [6] can consume up to 36% of total system power. In another work, Vangal et al. [7] showed that on the Intel TeraFLOPS chip of 80 cores, the NoC uses up to 28% of tile power.

Thread-to-core mapping problem (T2CMP) in NoC is a problem instance to assign given applications’ threads on the available processing cores to optimise user-given performance metrics (such as energy cost, latency, throughput) [8], [9], [10]. In general, NoC based research works are mainly focused on the reduction of either overall energy consumption [11], [12] or communication cost [13], [14] by placing applications threads on cores using various complex methodologies. At execution time, application threads require data exchange with other running in other cores or with primary memory (needing to fetch or to write the data). Clearly, threads with a large amount of data packet exchange (or high memory requirements) should be placed as close as possible to improve the performance and to reduce energy consumption [15]. The presence and positioning of multiple memory controllers (MCs) should be considered since they can manage different memory modules/banks to serve the requests from cores.

In this paper, we propose a mixed-integer linear program (MILP) based solution for thread-to-core (T2C) mapping so that the overall energy consumption can be reduced (in the worst-case traffic scenario). Our analytical model can place the application threads on the available core to optimise (i) energy consumption or (ii) overall average latency. The proposed model is generic enough to be applied to any specific set of tasks, by appropriately setting the simulation parameters. Our primary aim is to introduce a theoretical technique to evaluate the best practically achievable thread-to-core mapping performance even if the computational requirements may time prevent the real-time use of the proposed approach. The main contributions of our work are:

  • We propose an MILP model to map application threads on available cores for optimising the power cost and latency.

  • Our model supports near data processing (NDP) which is unique to our model. We also show how changing the number of MCs affects the latency and energy consumption inside the chip.

  • We provide worst-case quality of service (QoS) regarding energy cost, latency.

  • Finally, we compare our proposed model with the existing approaches (such as zig-zag heuristic) and show our model achieves 13% reduction in the average energy consumption and 27% reduction in the average packet latency.

  • In this work, the MILP technique provides bounds on the optimal solution thus enabling to use the model as a yardstick (i) to compare different thread-to-core mapping policies proposed by users; (ii) to identify the performance gap (either power cost or latency) with the best practically possible thread-to-core mapping; (iii) to identify the influence of different network topologies and configurations.

The paper is organised as follows: Section 2 reviews the literature on the topic. Next, Section 3 formally introduces the core mapping problem while in the following Section 4, we introduce the mathematical formulation based on MILP. In Section 5, we present and discuss the results of our thorough simulation campaign. Finally, in Section 6 conclusion and future research conclude the paper.

Section snippets

Related work

In past years, NoC based research has gained popularity. Broadly, the research topics discussed in [16], [17], [18], [19] could be classified into multiple research streams: (i) micro-architectural domain (mainly deals with network topology, architecture, capacity management); (ii) the communication infrastructure (mainly proposing the models, switching techniques, congestion control, power management, fault tolerance); (iii) analytical methods for evaluating proposed NoC’s performance; and (iv)

Problem description and assumptions

The main idea of increasing core count together with robust NoC architecture is mainly to host multiple applications on a single chip to increase the overall system utilisation. Usually, threads communicate with other threads or other resource components (such as local memory banks, last-level cache (LLC), DRAM controllers). A good criterion is to place frequently communicating threads close to each other to minimise (average) latency and energy consumption. However, after threads terminate

Mathematical formulation

In this section, we propose a mathematical formulation for the T2CMP. The formulation section is divided into three sub-sections dedicated to variables, optimisation functions and the used constraints to ease the understanding of the model.

Simulation results

In this section, we first introduce a deterministic and a non-deterministic heuristic approaches, then we describe the test instance generation procedure. A detailed report on our results and a discussion on the practical applicability of the proposed approach follow in the next subsections.

Conclusion and future work

Recently, power-aware computing has become popular for the increasing power demand for computation. There are processor chips with large low-power cores offering a huge potential for higher performance. Efficient interconnects are required to support the information exchange of such a vast number of processing cores, but still need to provide low latency, high network throughput and scalability with a better area and power costs. In this paper, we have proposed an MILP based mathematical

Acknowledgements

Authors wish to thank the editor and the anonymous referees for the helpful and insightful comments, which significantly improved the quality of this paper.

Declaration of competing interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.peva.2019.102003.

Marco Pranzo is Assistant Professor in Operations Research at University of Siena from 2006. In 2003 he obtained the PhD in Operations Research from University La Sapienza (Rome), and in 1999 he graduated in Computer Science Engineering from Rome Tre University. His main research interests focus on scheduling theory with applications to complex systems as public transportation, production, computer systems and healthcare management.

References (47)

  • WangH. et al.

    Power-driven design of router microarchitectures in on-chip networks

  • VangalS.R. et al.

    An 80-tile sub-100-w teraflops processor in 65-nm cmos

    IEEE J. Solid-State Circuits

    (2008)
  • HeshamS. et al.

    Survey on real-time networks-on-chip

    IEEE Trans. Parallel Distrib. Syst.

    (2017)
  • KarkarA. et al.

    A survey of emerging interconnects for on-chip efficient multicast and broadcast in many-cores

    IEEE Circuits Syst. Mag.

    (2016)
  • HuJ. et al.

    Energy-and performance-aware mapping for regular noc architectures

    IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.

    (2005)
  • ShinD. et al.

    Power-aware communication optimization for networks-on-chips with voltage scalable links

  • PoplavkoP. et al.

    Task-level timing models for guaranteed performance in multiprocessor networks-on-chip

  • ChenG. et al.

    Compiler-directed application mapping for noc based chip multiprocessors

    ACM SIGPLAN Not.

    (2007)
  • MarculescuR. et al.

    Outstanding research problems in noc design: system, microarchitecture, and circuit perspectives

    IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.

    (2009)
  • BjerregaardT. et al.

    A survey of research and practices of network-on-chip

    ACM Comput. Surv.

    (2006)
  • OgrasU.Y. et al.

    Key research problems in noc design: a holistic perspective

  • AleliunasR. et al.

    On embedding rectangular grids in square grids

    IEEE Trans. Comput.

    (1982)
  • GaryM.R. et al.

    Computers and Intractability: A Guide to the Theory of NP-completeness

    (1979)
  • Cited by (1)

    Marco Pranzo is Assistant Professor in Operations Research at University of Siena from 2006. In 2003 he obtained the PhD in Operations Research from University La Sapienza (Rome), and in 1999 he graduated in Computer Science Engineering from Rome Tre University. His main research interests focus on scheduling theory with applications to complex systems as public transportation, production, computer systems and healthcare management.

    Somnath Mazumdar received the MS degree in Distributed Computing and Networking from University of Nice Sophia Antipolis, France and the PhD degree in Computing Systems from the University of Siena, Italy, in 2011 and 2017, respectively. His main research interests are power efficient heterogeneous HP/T computing, computer architectures, performance analysis. He is also interested in Big Data-MultiCloud-Fog architectures and in-memory computing. He was/is associated primarily with multiple international European research projects.

    Both authors contributed equally.

    View full text