Performance optimization for parallel systems with shared DWM via retiming, loop scheduling, and data placement

https://doi.org/10.1016/j.sysarc.2020.101842Get rights and content

Abstract

Domain Wall Memory (DWM) as an ideal candidate for replacing traditional memories especially in parallel systems, has many desirable characteristics such as low leakage power, high density and low access latency. However, due to the tape-like architecture of DWM, shift operations have a vital impact on performance. Considering data-intensive applications with massive loops and arrays, increasing parallelism of loops, appropriate loop scheduling and data placement on DWM will significantly improve the performance of parallel systems. This paper explores optimizing performance of parallel systems through retiming, loop scheduling and data placement especially when the data are arrays. It proposes Integer Linear Programming (ILP) formulation and Scheduling While Placing (SWP) algorithm to generate optimal or nearly optimal loop scheduling and data placement with minimum execution time. The experimental results show that SWP and ILP can effectively reduce execution time when compared with greedy List Scheduling First Access First Place (LF) algorithm. Besides, this paper proposes Threshold Retiming Repetition (TRR) algorithm to combine the retiming technique with SWP and ILP. The experimental results show that SWP+TRR and ILP+TRR can further reduce the execution time when compared to results without retiming.

Introduction

With the development of big data and artificial intelligence techniques, memory with large capacity and high access performance are required in modern computing systems. Domain Wall Memory (DWM) is a novel non-volatile memory with high access performance, high density and low power leakage [1]. For data-massive applications, DWM can meet the requirements of access speed, capacity and power consumption [2]. When compared to traditional memory, for example DRAM, DWM shows the same high density with close-to-zero standby power and it becomes an ideal candidate for future memory.

In previous works, since the write lifetime of NVM is limited, many researches aim at reducing the number of write operations when NVM is selected in hybrid memory [3], [4]. DWM has no write endurance problem [18], however, before accessing data in DWM, data in nanowires need to be shifted to align to read/write port, which is called shift operation. Shift operation will decrease the access speed of DWM and decrease system performance further. Therefore, reducing shift operations of DWM can significantly improve the performance. From hardware perspective, adding extra read/write ports to reduce the average distance between data and its closest head is considered in work [5], [6]. This enables higher access speed of DWM, however, more ports lower density which is a primary advantage of DMW. From software aspect, when given the number of ports, work [7] designs a compiler technique to reallocate the accessed data, which reduces the access time on DWM. The improvement of this technique is limited since the data access sequence is fixed. There is also a new work [17] figuring out a heuristic way to reschedule the sequence of instructions and reallocate data according to the intimacy between accessed data groups and DWM ports. It reduces number of shift operation to a great degree, however, it only considers single core system for simplicity. When it comes to multi-core system, there is at most one core shifting domains to align data to read/write ports while other are achieving inner chip operations, like calculating two registers, or waiting for shifting DWM later. Besides, different cores are able to access the same data at the same time. This introduces complexity when rescheduling instructions in programs. Moreover, big data and artificial intelligence system often repeat plenty of calculations on matrixes [20], [21], [22]. We can regard these operations as loops and each matrix as composition of a series of arrays. It’s necessary to think about how arrays are organized on DMW when data are accessed in loop programs repeatedly. Take these into account, in this paper, we target on parallel systems which may have multiple processing units or mutiple processors. In this kind of systems, single head DWM is used as a shared scratch-pad memory. We explore improving performance of system through retiming, loop scheduling and data placement when the data are arrays.

The main contributions of this paper are: 1) We explore retiming, loop scheduling and data placement simultaneously for loop program in parallel systems with shared single head DWM. 2) We propose Integer Linear Programming (ILP) formulation to generate optimal instructions schedule and data placement with minimum execution time. 3) We design a heuristic algorithm, Scheduling While Placing (SWP) algorithm, to generate nearly optimal loop schedule and data placement. 4) We propose the Threshold Retiming Repetition (TRR) algorithm which can be combined with ILP or SWP to further reduce the execution time of loop program.

The experimental results show that our ILP algorithm and heuristic SWP algorithm can effectively reduce execution time. When these two algorithms used with proposed TRR algorithm, the average performance of SWP indicates more room for improvement than ILP and the result of SWP+TRR is closer to the optimal result of ILP+TRR algorithm.

The rest of this paper is organized as follows: the architecture of DWM and related definitions are presented in section II. Section III introduces a motivation example to show the main ideas of this paper. We propose ILP formulation, TRR algorithm, and SWP algorithm in following three sections. Section VII analyzes the experimental results. Finally we conclude this paper in section VIII.

Section snippets

Architectural view and related works

In this section, we first give the architectural view of Domain Wall Memory (DWM). Then, introduce the retiming technique along with some definitions used in this paper.

Motivation example

In this section, we use a motivational example to show the main contributions of this paper. In this example, we assume the target system consists of 2 processing units and a shared head DWM. Since a load or shift operation takes 1 time unit, which is about 5ns, it is easy to know an addition costs 1 time unit, and a multiplication costs 2.

Fig. 2 (a) depicts one loop portion of a program. Among them, A, B, C, D, E are all one-dimensional arrays. Fig. 2(c) is the DFG of this loop. When we need

ILP formulation for loop scheduling and data placement

In this section, we first define some notations, which are shown in Table 1. Then, we present details of Integer Linear Programming (ILP) formulation to generate loop schedule and data placement for mutiple processing units system with shared single head DWM.

Threshold repetition retiming algorithm

In this section, we propose the Threshold Repetition Retiming(TRR) algorithm to minimize the maximum delay of a single iteration of loop program. This technique will be used before ILP or SWP to find a better DFG which fits for data organization on DWM. Details are shown in Algorithm 1.

Suppose there are Z nodes in original DFG G, and variable flag indicates whether exists a node that we can do retiming (line 1–2). First of all, we assume flag is true to enter in the while loop. Every time

Scheduling while placing algorithm

ILP formulation takes exponential time to complete. When input is large, ILP is not a practical technique. In this section, we propose a heuristic algorithm, Scheduling While Placing (SWP) algorithm, which can generate near optimal loop schedule and data placement in polynomial time. The main idea of SWP is: generate a good schedule while placing data reasonably to reduce shift operations as much as possible. The details of SWP are shown in Algorithm 2.

Set able introduced in line 1 contains all

Experiments

In this section, we first introduce the experimental setups. Then we present the experimental results and analysis.

We compare the proposed Scheduling While Placing (SWP) algorithm and the ILP algorithm with List scheduling First Access First Place (LF) algorithm. The benchmarks are from MediaBench [12] and the data-flow graphs (DFGs) are extracted by SimpleScalar [13]. In our experiments, number of nodes in “pegwit” is minimal, followed by “rasta”, “pgp”, “mesa”, “mpeg2”, “ghostscript”, “jpeg”,

Conclusion

In this work, we propose a heuristic algorithm to reduce the execution time of loop program by scheduling instructions on two FUs while generating a specific data placement on DWM. Experimental results show that it can effectively reduce the finish time and when combined with the retiming technique, execution time can be further reduced to be much closer to its optimal results. In the future, we will study the data placement on more DWM tapes to further take advantage of the relevance of

Declaration of competing interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled, Performance Optimization for Parallel Systems with Shared DWM via Retiming, Loop Scheduling, and Data Placement

Siyuan gao

References (22)

  • X. Chen et al.

    Optimizing data placement for reducing shift operations on domain wall memories

    52nd ACM/EDAC/IEEE Design Automation Conference

    (2015)
  • Parkin et al.

    Magnetic domain-wall racetrack memory

    J. Sci.

    (2008)
  • Y. Wang et al.

    An energy-efficient nonvolatile in-memory computing architecture for extreme learning machine by domain-wall nanowire devices

    J. IEEE Trans. Nanotechnol.

    (2015)
  • Z. Sun et al.

    AIMR: an adaptive page management policy for hybrid memory architecture with NVM and DRAM

    2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems

    (2015)
  • Y. Zhang, J. Zhan, et al., Dynamic memory management for hybrid dram-nvm main memory systems, in: 13th International...
  • W. Yuhao, H. Yu, An ultralow-power memory-based big-data computing platform by nonvolatile domain-wall nanowire...
  • Venkatesan et al.

    Tapecache: a high density, energy efficient cache based on domain wall memory

    Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design

    (2012)
  • L. Thomas et al.

    Racetrack memory: a high-performance, low-cost, non-volatile memory based on magnetic domain walls

    2011 International Electron Devices Meeting

    (2011)
  • C. Liang-Fang et al.

    Scheduling data-flow graphs via retiming and unfolding

    J. IEEE Trans. Parallel Distrib. Syst.

    (1997)
  • Leiserson et al.

    Retiming synchronous circuitry

    J. Alg.

    (1991)
  • C. Liang-Fang

    Scheduling and Behavioral Transformations for Parallel Systems

    (1993)
  • Cited by (0)

    Siyuan gao

    Shouzhen Gu

    Edwin Hsing-Mean Sha

    Qingfeng Zhuge

    View full text