An analytical model for thread-core mapping for tiled CMPs☆
Introduction
Recent computing platform paradigm has been shifted towards the communication centric design which is optimised both for performance and power. Over the past few years, researchers from industry and academia started to develop chips containing more than hundreds of cores. (e.g., PEZY-SC processor [1]) or more than thousands of cores (e.g., Epiphany-V Chip [2]). Stacking many low powered processing cores together offers a great level of parallelism, but requires to solve multiple execution management issues. In fact for chips with large core count, there could be higher chances of resource contention due to a significant amount of data exchange between the applications.
On-chip packet-switched micro-network of interconnects (i.e., NoCs [3]) provides the physical substrate used by processing cores to communicate each other (also to the memory). In general, NoC’s architectural components (such as channel width, buffer size, thread-to-core mapping, and routing algorithms) are very critical for better support for data traffic [4]. There is a need to optimise latency and energy cost for the data exchange inside the NoC to achieve higher performance. An increasing number of core count and communications between threads inside the NoC may yield to hotspot issues. Hence, on-chip communication may become a barrier for higher performance. In general, the main design goals for NoC based interconnects are higher bandwidth, low energy cost, and low latency. However, NoC designs with several hundreds of processing cores inside the chip suffer from large energy consumption. In fact, the on-chip network can consume a significant fraction of the whole chip power budget. For instance, experiments [5] have shown that for large core count (i.e., a 256-core based CMP) conventional 2D-mesh based NoC can consume up to 45% of the total energy; while the NoC for the MIT Raw processor [6] can consume up to 36% of total system power. In another work, Vangal et al. [7] showed that on the Intel TeraFLOPS chip of 80 cores, the NoC uses up to 28% of tile power.
Thread-to-core mapping problem (T2CMP) in NoC is a problem instance to assign given applications’ threads on the available processing cores to optimise user-given performance metrics (such as energy cost, latency, throughput) [8], [9], [10]. In general, NoC based research works are mainly focused on the reduction of either overall energy consumption [11], [12] or communication cost [13], [14] by placing applications threads on cores using various complex methodologies. At execution time, application threads require data exchange with other running in other cores or with primary memory (needing to fetch or to write the data). Clearly, threads with a large amount of data packet exchange (or high memory requirements) should be placed as close as possible to improve the performance and to reduce energy consumption [15]. The presence and positioning of multiple memory controllers (MCs) should be considered since they can manage different memory modules/banks to serve the requests from cores.
In this paper, we propose a mixed-integer linear program (MILP) based solution for thread-to-core (T2C) mapping so that the overall energy consumption can be reduced (in the worst-case traffic scenario). Our analytical model can place the application threads on the available core to optimise (i) energy consumption or (ii) overall average latency. The proposed model is generic enough to be applied to any specific set of tasks, by appropriately setting the simulation parameters. Our primary aim is to introduce a theoretical technique to evaluate the best practically achievable thread-to-core mapping performance even if the computational requirements may time prevent the real-time use of the proposed approach. The main contributions of our work are:
- •
We propose an MILP model to map application threads on available cores for optimising the power cost and latency.
- •
Our model supports near data processing (NDP) which is unique to our model. We also show how changing the number of MCs affects the latency and energy consumption inside the chip.
- •
We provide worst-case quality of service (QoS) regarding energy cost, latency.
- •
Finally, we compare our proposed model with the existing approaches (such as zig-zag heuristic) and show our model achieves 13% reduction in the average energy consumption and 27% reduction in the average packet latency.
- •
In this work, the MILP technique provides bounds on the optimal solution thus enabling to use the model as a yardstick (i) to compare different thread-to-core mapping policies proposed by users; (ii) to identify the performance gap (either power cost or latency) with the best practically possible thread-to-core mapping; (iii) to identify the influence of different network topologies and configurations.
The paper is organised as follows: Section 2 reviews the literature on the topic. Next, Section 3 formally introduces the core mapping problem while in the following Section 4, we introduce the mathematical formulation based on MILP. In Section 5, we present and discuss the results of our thorough simulation campaign. Finally, in Section 6 conclusion and future research conclude the paper.
Section snippets
Related work
In past years, NoC based research has gained popularity. Broadly, the research topics discussed in [16], [17], [18], [19] could be classified into multiple research streams: (i) micro-architectural domain (mainly deals with network topology, architecture, capacity management); (ii) the communication infrastructure (mainly proposing the models, switching techniques, congestion control, power management, fault tolerance); (iii) analytical methods for evaluating proposed NoC’s performance; and (iv)
Problem description and assumptions
The main idea of increasing core count together with robust NoC architecture is mainly to host multiple applications on a single chip to increase the overall system utilisation. Usually, threads communicate with other threads or other resource components (such as local memory banks, last-level cache (LLC), DRAM controllers). A good criterion is to place frequently communicating threads close to each other to minimise (average) latency and energy consumption. However, after threads terminate
Mathematical formulation
In this section, we propose a mathematical formulation for the T2CMP. The formulation section is divided into three sub-sections dedicated to variables, optimisation functions and the used constraints to ease the understanding of the model.
Simulation results
In this section, we first introduce a deterministic and a non-deterministic heuristic approaches, then we describe the test instance generation procedure. A detailed report on our results and a discussion on the practical applicability of the proposed approach follow in the next subsections.
Conclusion and future work
Recently, power-aware computing has become popular for the increasing power demand for computation. There are processor chips with large low-power cores offering a huge potential for higher performance. Efficient interconnects are required to support the information exchange of such a vast number of processing cores, but still need to provide low latency, high network throughput and scalability with a better area and power costs. In this paper, we have proposed an MILP based mathematical
Acknowledgements
Authors wish to thank the editor and the anonymous referees for the helpful and insightful comments, which significantly improved the quality of this paper.
Declaration of competing interest
No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.peva.2019.102003.
Marco Pranzo is Assistant Professor in Operations Research at University of Siena from 2006. In 2003 he obtained the PhD in Operations Research from University La Sapienza (Rome), and in 1999 he graduated in Computer Science Engineering from Rome Tre University. His main research interests focus on scheduling theory with applications to complex systems as public transportation, production, computer systems and healthcare management.
References (47)
- et al.
A survey on energy-efficient methodologies and architectures of network-on-chip
Comput. Electr. Eng.
(2014) - et al.
Resource-efficient routing and scheduling of time-constrained streaming communication on networks-on-chip
J. Syst. Archit.
(2008) - et al.
A survey on application mapping strategies for network-on-chip design
J. Syst. Archit.
(2013) Cluster-based application mapping method for network-on-chip
Adv. Eng. Softw.
(2011)- et al.
Efficient routing techniques in heterogeneous 3d networks-on-chip
Parallel Comput.
(2013) - . Pezy Computing, Pezy Computing Cores....
Parallella million cores
(2017)- et al.
Principles and Practices of Interconnection Networks
(2004) - et al.
A survey of research and practices of network-on-chip
ACM Comput. Surv.
(2006) - et al.
Energy and performance benefits of active messages
(2012)
Power-driven design of router microarchitectures in on-chip networks
An 80-tile sub-100-w teraflops processor in 65-nm cmos
IEEE J. Solid-State Circuits
Survey on real-time networks-on-chip
IEEE Trans. Parallel Distrib. Syst.
A survey of emerging interconnects for on-chip efficient multicast and broadcast in many-cores
IEEE Circuits Syst. Mag.
Energy-and performance-aware mapping for regular noc architectures
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.
Power-aware communication optimization for networks-on-chips with voltage scalable links
Task-level timing models for guaranteed performance in multiprocessor networks-on-chip
Compiler-directed application mapping for noc based chip multiprocessors
ACM SIGPLAN Not.
Outstanding research problems in noc design: system, microarchitecture, and circuit perspectives
IEEE Trans. Comput.-Aided Design Integr. Circuits Syst.
A survey of research and practices of network-on-chip
ACM Comput. Surv.
Key research problems in noc design: a holistic perspective
On embedding rectangular grids in square grids
IEEE Trans. Comput.
Computers and Intractability: A Guide to the Theory of NP-completeness
Cited by (1)
Isolated Routing Algorithm without Virtual Channels for Network-on-Chip
2021, Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics
Marco Pranzo is Assistant Professor in Operations Research at University of Siena from 2006. In 2003 he obtained the PhD in Operations Research from University La Sapienza (Rome), and in 1999 he graduated in Computer Science Engineering from Rome Tre University. His main research interests focus on scheduling theory with applications to complex systems as public transportation, production, computer systems and healthcare management.
Somnath Mazumdar received the MS degree in Distributed Computing and Networking from University of Nice Sophia Antipolis, France and the PhD degree in Computing Systems from the University of Siena, Italy, in 2011 and 2017, respectively. His main research interests are power efficient heterogeneous HP/T computing, computer architectures, performance analysis. He is also interested in Big Data-MultiCloud-Fog architectures and in-memory computing. He was/is associated primarily with multiple international European research projects.
- ☆
Both authors contributed equally.