Improving multitask performance and energy consumption with partial-ISA multicores
Introduction
The Internet of Things (IoT) domain is composed of systems of different complexity: from very simple nodes driven by ultra-low power microcontrollers forming wireless sensors networks [20] to complex wearables that demand high-performance processors [10]. It is natural that, as these systems evolve, computation will be brought closer to the user, which makes edge computing important for IoT applications [25]. Nonetheless, although it requires high-performance processors, power and area will always be a strong constraint in IoT systems. Therefore, in an environment of fast-paced application evolution, the adaptability of executing different applications provided by General Propose Processors (GPP), allied with techniques for reducing energy consumption, will be essential for IoT devices.
Current embedded systems implement a variety of strategies to efficiently deliver high-performance throughput at small power budgets, which can inspire IoT processor designs. One of the strategies that engage directly into the processor core efficiency is the single-ISA (Instructions Set Architecture) heterogeneous processors [14]. Examples adopted by the industry are the ARM big.LITTLE [3], and – more recently – ARM DynamIQ [8], which comprise distinct cores with different performance and energy characteristics in the same die. As they all implement the same ISA, threads can transparently migrate between processors, allowing a scheduler to allocate jobs according to the applications needs and non-functional requirements, such as performance and energy.
The ISA of these GPPs has been incrementally tailored to increase the performance of emerging applications. Each architectural iteration adds newer instructions in the form of extensions (e.g., SSE and AVX in the x86, and NEON and SVE in the ARM), increasing the complexity of the microarchitecture. However, not all applications will take advantage of such instructions. For instance, x86 AVX SIMD (Single Instruction Multiple Data) instructions are specifically used for highly vectorized applications.
As Fig. 1 shows, this is not different for NEON instructions in ARM architectures, which is the instruction extension that contains both the Floating Point (FP) and the Single Instruction Multiple Data (SIMD) instructions. This figure shows the percentage of dynamic instructions executed in a wide range of workloads from different benchmark sets (details on the experiments are in Section 3). It demonstrates how NEON instructions (i.e., both SIMD and FP operations) are underused, with many of the analyzed benchmarks not issuing any Floating Point or SIMD instructions at all. Besides, the NEON functional unit adds considerable area overhead, as one can observe in Fig. 2, which shows the area breakdown of components for two ARM processors. According to our experiments, the ARM A7, an in-order processor, has a single NEON pipeline that occupies 26% of its total core area, while the A15, an out-of-order processor, has two larger NEON pipelines that fill 69% of its core area. This data suggests that, even though there should be some kind of hardware support to execute these extensions — as they can be important for specific applications, they will come at high implementation costs. Furthermore, it is very likely that implementing such support in every core in a multicore system is neither performance — nor energy-wise.
Given this scenario, we propose the PHISA (Partially Heterogeneous ISA) Multicore. A PHISA system comprises cores that partially implement an ISA, removing selected instruction extensions and all the physical components that are specific to the execution of those instructions. The best candidates for removal are the instructions that require vast portions of the processor area, such as shown in Fig. 2, and that are not commonly used, as demonstrated in Fig. 1. Although full-ISA cores may still be required to execute the extensions, partial-ISA cores can replace a portion of these full cores, reducing the area and the power dissipation of the entire multicore system. With the resultant freed area and power, the designer can introduce new cores, which may significantly improve performance on multi-workload environments. The base reasoning of the PHISA system is that the designer can trade part of the performance provided by the ISA extension units (in the form of processed Instructions Per Cycle (IPC) for specific applications) for more cores that can increase the system throughput (in the form of Task Parallelism (TP) for general applications). Therefore, instead of introducing more area and power to the highly constrained IoT system environment, we use the resources once given to the expensive instruction extensions to improve our system.
As in any heterogeneous system of homogeneous ISA, correctly allocating threads accordingly to their needs is also essential in our partial-ISA system. Nonetheless, our system introduces an extra challenge: the scheduler must deal with tasks that require removed instruction extensions and allocate them to full-ISA cores. Therefore, we also propose different approaches for scheduling and emulation of such instructions, aiming for both performance and energy optimizations.
We evaluate the proposed system with different ratios of Full/Partial ISA cores, using as a case study the ARM architecture with and without the NEON instruction set, even though PHISA can be generalized to most ISAs and extensions. Furthermore, our evaluation extends to different organization scenarios, comparing both systems of symmetric and asymmetric performance and partial- and full-ISA. We show that a PHISA multicore can improve performance and reduce energy in traditional edge computing scenarios when compared to its full ISA system counterpart, considering the same power budget. Furthermore, we show how PHISA compares to existing heterogeneous processors (such as DynamIQ designs) and how it can also improve performance and energy consumption in these scenarios using different scheduling policies.
The remaining of this article is organized as follows. In Section 2 we present details on the implementation of PHISA multicores and its scheduler requirements. Section 3 describes the methodology used to evaluate the proposed system, while Section 4 presents the many analyzed results for the different scenarios. Section 6 discusses recent works related to this paper. Finally, Section 7 concludes this work.
Section snippets
Proposed system
The PHISA multicore design removes hardware components specifically used by an ISA extension while leaving the remaining microarchitecture of a core unaltered. This core, which we call a Partial ISA core, keeps its ability to execute instructions from its base ISA. Therefore, performance is only affected for the removed instructions, as parameters such as issue-width, execution order, and branch prediction are all kept the same. With the area and power freed from removing these ISA extensions,
Methodology
Modeling and Simulation: We have used the gem5 simulator [4] to model the different versions of the ARM’s A7 and A15 processors. For area and power measurements, we have modeled the same processors in McPAT [17] using a node technology of , with both running at the same frequency of 2 GHz. Our models consider the entire core (including MMU and instruction and data L1 caches) without L2 caches. Although McPAT models its components according to an A9 processor, we have used an approach
Results
In this section, we present the results for the Setups described in Table 3. From Sections 4.1 Impact of partial ISA cores, 4.2 Full core vs PHISA multicore — sharing a power budget, 4.3 PHISA vs traditional heterogeneous systems (DynamIQ), we present all results using the simple scheduler described in Section 2.2. This scheduler is meant to provide the behavior of a PHISA system without necessarily optimizing it for any particular requirement. Later, in Section 4.4, we introduce new results
Analysis of PHISA on high NEON usage
Our experiments have been using scenarios with some of the single-threaded workloads presented in Fig. 1. Although the selected set of workloads covers a wide range of applications from the embedded system and IoT market, one may question the behavior of the system when exposed to higher amounts of NEON instructions. Considering that the number of instructions from removed extensions will directly influence the behavior of a PHISA multicore, we now use an analytic model of hypothetical
Related work
In this section we will present state-of-the-art works related to this paper. We will discuss works that have explored single-ISA heterogeneous processors, as well as researches on the impact of the different ISAs in a system. We also present works that have explored the concept of partial (or overlapping) ISAs, both by the software (and scheduling) and hardware sides. Finally, we summarize the novelty of our work when compared to the state-of-the-art.
Single-ISA heterogeneous processors have
Conclusions
In this paper, we have proposed PHISA multicores as a mean to reduce area and power from multicore systems, by removing ISA extensions that are not constantly used. We show that it is possible to use partial ISA cores in edge computing systems, without incurring into large performance impacts. Furthermore, we show how the extra area and power can be used to add more cores to the multicore system and further increase performance and decrease energy consumption. When coupled with a specialized
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior, Brasil (CAPES) - Finance Code 001, the Fundação de Amparo à Pesquisa do Estado do RS (FAPERGS), Brazil and the Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq), Brazil .
Jeckson Dellagostin Souza received his M.Sc. and Ph.D. degrees from UFRGS, Brazil, in 2015 and 2020, respectively. His primary research interests include binary compatibility, heterogeneous processors and multicore environments, particularly focusing on power reduction techniques. For more information, please visit http://www.inf.ufrgs.br/ jdsouza/.
References (30)
MediaBench II video: Expediting the next generation of video systems research
Microprocess. Microsyst.
(2009)Microprocessor optimizations for the internet of things: A survey
IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
(2018)- A. Annamalai, et al. An opportunistic prediction-based thread scheduling to maximize throughput/watt in AMPs, in:...
- (2019)
The gem5 simulator
ACM SIGARCH Comput. Archit. News
(2011)- E. Blem, et al. Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures, in:...
ISA Wars: Understanding the relevance of ISA being RISC or CISC to performance, power, and energy on modern architectures
ACM Trans. Comput. Syst.
(2015)Performance implications of single thread migration on a chip multi-core
SIGARCH Comput. Archit. News
(2005)- (2019)
- F. Endo, et al. Micro-architectural simulation of embedded core heterogeneity with gem5 and McPAT, in: RAPIDO ’15,...
Exynos 7 dual 7270 processor: Specs, features — samsung exynos
MiBench: A free, commercially representative embedded benchmark suite
Single-ISA heterogeneous multi-core architectures for multithreaded workload performance
Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction
Cited by (2)
Parallel Compilation Optimization Method for Sunway High Performance Multi-Core Processors
2022, Jisuanji Gongcheng/Computer Engineering
Jeckson Dellagostin Souza received his M.Sc. and Ph.D. degrees from UFRGS, Brazil, in 2015 and 2020, respectively. His primary research interests include binary compatibility, heterogeneous processors and multicore environments, particularly focusing on power reduction techniques. For more information, please visit http://www.inf.ufrgs.br/ jdsouza/.
Pedro Henrique Exenberger Becker received his B.Sc. degree in Computer Engineering in 2018 and his M.Sc. degree in Computer Science in 2019, both from Universidade Federal do Rio Grande do Sul (Brazil). In January 2020 he joined Universitat Politècnic de Catalunya (Spain) where he is currently pursuing his PhD. His research focuses on the area of computer architecture for performance- and energy-constrained applications, particularly targeting hardware support for autonomous driving systems. Contact him at [email protected].
Antonio Carlos Schneider Beck received his Dr. degree from UFRGS, Brazil, in 2008. Currently, he is an associate professor at the Applied Informatics Department at the Informatics Institute of UFRGS, in charge of Embedded Systems and Computer Organization disciplines at the undergraduate and graduate levels. His primary research interests include computer architectures and embedded systems design, focusing on power consumption. For more information, visit www.inf.ufrgs.br/ caco/.