当前期刊: IEEE Transactions on Computers Go to current issue    加入关注   
显示样式:        排序: 导出
  • Per-Operation Reusability Based Allocation and Migration Policy for Hybrid Cache
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-09-27
    Minsik Oh; Kwangsu Kim; Duheon Choi; Hyuk-Jun Lee; Eui-Young Chung

    Recently, a hybrid cache consisting of SRAM and STT-RAM has attracted much attention as a future memory by complementing each other with different memory characteristics. Prior works focused on developing data allocation and migration techniques considering write-intensity to reduce write energy at STT-RAM. However, these works often neglect the impact of operation-specific reusability of a cache line. In this paper, we propose an energy-efficient per-operation reusability-based allocation and migration policy (ORAM) with a unified LRU replacement policy. First, to select an adequate memory type for allocation, we propose a cost function based on per-operation reusability – gain from an allocated cache line and loss from an evicted cache line for different memory types – which exploits the temporal locality. Besides, we present a migration policy, victim and target cache line selection scheme, to resolve memory type inconsistency between replacement policy and the allocation policy, with further energy reduction. Experiment results show an average energy reduction in the LLC and the main memory by 12.3 and 21.2 percent, and the improvement of latency and execution time by 21.2 and 8.8 percent, respectively, compared with a baseline hybrid cache management. In addition, the Energy-Delay Product (EDP) is improved by 36.9 percent over the baseline.

  • Footprint-Based DIMM Hotplug
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-04
    Shinobu Miwa; Masaya Ishihara; Hayato Yamaki; Hiroki Honda; Martin Schulz

    Power-efficiency has become one of the most critical concerns for HPC as we continue to scale computational capabilities. A significant fraction of system power is spent on large main memories, mainly caused by the substantial amount of DIMM standby power needed. However, while necessary for some workloads, for many workloads large memory configurations are too rich, i.e., these workloads only make use of a fraction of the available memory, causing unnecessary power usage. This observation opens new opportunities for power reduction by powering DIMMs on and off depending on the current workload. In this article, we propose footprint-based DIMM hotplug that enables a compute node to adjust the number of DIMMs that are powered on depending on the memory footprint of a running job. Our technique relies on two main subcomponents—memory footprint monitoring and DIMM management—which we both implement as part of an optimized page management system with small control overhead. Using Linux's memory hotplug capabilities, we implement our approach on a real system, and our results show that our proposed technique can save 50.6–52.1 percent of the DIMM standby energy and the CPU+DRAM energy of up to 1.50 Wh for various small-memory-footprint applications without loss of performance.

  • Collaborative Adaptation for Energy-Efficient Heterogeneous Mobile SoCs
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-04
    Amit Kumar Singh; Karunakar Reddy Basireddy; Alok Prakash; Geoff V. Merrett; Bashir M. Al-Hashimi

    Heterogeneous Mobile System-on-Chips (SoCs) containing CPU and GPU cores are becoming prevalent in embedded computing, and they need to execute applications concurrently. However, existing run-time management approaches do not perform adaptive mapping and thread-partitioning of applications while exploiting both CPU and GPU cores at the same time. In this paper, we propose an adaptive mapping and thread-partitioning approach for energy-efficient execution of concurrent OpenCL applications on both CPU and GPU cores while satisfying performance requirements. To start execution of concurrent applications, the approach makes mapping (number of cores and operating frequencies) and partitioning (distribution of threads between CPU and GPU) decisions to satisfy performance requirements for each application. The mapping and partitioning decisions are made by having a collaboration between the CPU and GPU cores’ processing capabilities such that balanced execution can be performed. During execution, adaptation is triggered when new application(s) arrive, or an executing one finishes, that frees cores. The adaptation process identifies a new mapping and thread-partitioning in a similar collaborative manner for remaining applications provided it leads to an improvement in energy efficiency. The proposed approach is experimentally validated on the Odroid-XU3 hardware platform with varying set of applications. Results show an average energy saving of 37%, compared to existing approaches while satisfying the performance requirements.

  • Optimal Metastability-Containing Sorting via Parallel Prefix Computation
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-07
    Johannes Bund; Christoph Lenzen; Moti Medina

    Friedrichs et al. (TC 2018) showed that metastability can be contained when sorting inputs arising from time-to-digital converters, i.e., measurement values can be correctly sorted without resolving metastability using synchronizers first. However, this work left open whether this can be done by small circuits. We show that this is indeed possible, by providing a circuit that sorts Gray code inputs (possibly containing a metastable bit) and has asymptotically optimal depth and size. Our solution utilizes the parallel prefix computation (PPC) framework (JACM 1980). We improve this construction by bounding its fan-out by an arbitrary $f\geq 3$f≥3 , without affecting depth and increasing circuit size by a small constant factor only. Thus, we obtain the first PPC circuits with asymptotically optimal size, constant fan-out, and optimal depth. To show that applying the PPC framework to the sorting task is feasible, we prove that the latter can, despite potential metastability, be decomposed such that the core operation is associative. We obtain asymptotically optimal metastability-containing sorting networks. We complement these results with simulations, independently verifying the correctness as well as small size and delay of our circuits. Proofs are omitted in this version; the article with full proofs is provided online at http://arxiv.org/abs/1911.00267 .

  • Optimizing Parallel I/O Accesses through Pattern-Directed and Layout-Aware Replication
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-08
    Shuibing He; Yanlong Yin; Xian-He Sun; Xuechen Zhang; Zongpeng Li

    As the performance gap between processors and storage devices keeps increasing, I/O performance becomes a critical bottleneck of modern high-performance computing systems. In this paper, we propose a pattern-directed and layout-aware data replication design, named PDLA, to improve the performance of parallel I/O systems. PDLA includes an HDD-based scheme H-PDLA and an SSD-based scheme S-PDLA . For applications with relatively low I/O concurrency, H-PDLA identifies access patterns of applications and makes a reorganized data replica for each access pattern on HDD-based servers with an optimized data layout. Moreover, to accommodate applications with high I/O concurrency, S-PDLA replicates critical access patterns that can bring performance benefits on SSD-based servers or on HDD-based and SSD-based servers. We have implemented the proposed replication scheme under MPICH2 library on top of OrangeFS file system. Experimental results show that H-PDLA can significantly improve the original parallel I/O system performance and demonstrate the advantages of S-PDLA over H-PDLA.

  • Secure and Efficient Control Data Isolation with Register-Based Data Cloaking
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-11
    Xiayang Wang; Fuqian Huang; Haibo Chen

    Attackers often exploit memory corruption vulnerabilities to overwrite control data and further gain control over victim applications. Despite progress in advanced defensive techniques, such attacks still remain a major security threat. In this article, we present Niffler, a new technique that provides lightweight and practical defense against such attacks. Niffler eliminates the threat of memory corruption over control data by cloaking all control data in registers along its execution and only spilling them into a dedicated read-only area in memory upon a shortage of registers. As an attacker cannot directly overwrite any register or read-only memory pages, no direct memory corruption on control data is feasible. Niffler is made efficient by compactly encoding return address, balancing register allocation, dynamically determining register spilling and leveraging the recent Intel Memory Protection Extensions (MPX) for control data lookup during register restoring. We implement Niffler based on LLVM and conduct a set of evaluations on SPECCPU 2006 and real-world applications. Performance evaluation shows that Niffler introduces an average of only 6.3 percent overhead on SPECCPU 2006 C programs and an average of 28.2 percent overhead on C++ programs.

  • Adaptive-Length Coding of Image Data for Low-Cost Approximate Storage
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-11
    Qianqian Fan; David J. Lilja; Sachin S. Sapatnekar

    In the past few years, ever-increasing amounts of image data have been generated by users globally, and these images are routinely stored in cold storage systems in compressed formats. This article investigates the use of approximate storage that leverages the use of cheaper, lower reliability memories that can have higher error rates. Since traditional JPEG-based schemes based on variable-length coding are extremely sensitive to error, the direct use of approximate storage results in severe quality degradation. We propose an error-resilient adaptive-length coding (ALC) scheme that divides all symbols into two classes, based on their frequency of occurrence, where each class has a fixed-length codeword. This provides a balance between the reliability of fixed-length coding schemes, which have a high storage overhead, and the storage-efficiency of Huffman coding schemes, which show high levels of error on low-reliability storage platforms. Further, we use data partitioning to determine which bits are stored in approximate or reliable storage to lower the overall cost of storage. We show that ALC can be used with general non-volatile storage, and can substantially reduce the total cost compared to traditional JPEG-based storage.

  • A New Class of Single Burst Error Correcting Codes with Parallel Decoding
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-15
    Abhishek Das; Nur A. Touba

    With technology scaling, burst errors or clustered errors are becoming increasingly common in different types of memories. Multiple bit upsets due to particle strikes, write disturbance errors, and magnetic field coupling are a few of the mechanisms which cause clustered errors. In this article, a new class of single burst error correcting codes are presented which correct a single burst of any size b within a codeword. A code construction methodology is presented which enables us to construct the proposed scheme from existing codes, e.g., Hamming codes. A new single step decoding methodology for the proposed class of codes is also presented which enables faster decoding. Different code constructions using Hamming codes, and BCH codes have been presented in this paper and a comparison is made with existing schemes in terms of decoding complexity and data redundancy. The proposed scheme in all cases reduces the decoder complexity for little to no increase in data redundancy, specifically for higher burst error sizes.

  • WAL-SSD: Address Remapping-Based Write-Ahead-Logging Solid-State Disks
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-16
    Kyuhwa Han; Hyukjoong Kim; Dongkun Shin

    Recent advances in flash memory technology have reduced the cost-per-bit of flash storage devices such as solid-state drives (SSDs), thereby enabling the development of large-capacity SSDs for enterprise-scale storage. However, two major concerns arise in designing SSDs. First, the size of the address mapping table is increasing in proportion to the capacity of the SSD. The SSD-internal firmware, called flash translation layer (FTL), must maintain the address mapping table in the internal DRAM. Although the previously proposed demand map loading technique uses a small size of cached map table, the technique aggravates poor random performance. Second, there are many redundant writes in storage workloads, which have an adverse effect on the performance and lifetime of the SSD. For example, many transaction-supporting applications use the write-ahead-log (WAL) scheme, which writes the same data twice. To resolve these problems, we propose a novel transaction-supporting SSD, called WAL-SSD, which logs transaction data at the internally-managed WAL area and relocates the data atomically via the FTL-level remap operation at the transaction checkpointing. It can also be used to transform random write requests to sequential requests. We implemented a prototype of WAL-SSD with a real SSD device. Experiments demonstrate the performance improvement by WAL-SSD with three use cases: remap-journaling, atomic multi-block update, and random write logging.

  • Low Latency Floating-Point Division and Square Root Unit
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-16
    Javier D. Bruguera

    Digit-recurrence algorithms are widely used in actual microprocessors to compute floating-point division and square root. These iterative algorithms present a good trade-off in terms of performance, area and power. We present a floating-point division and square root unit, which implements a radix-64 floating-point division and a radix-16 floating-point square root. To have an affordable implementation, each radix-64 division iteration and radix-16 square root iteration are made of simpler radix-4 iterations: 3 radix-4 iterations in division and 2 in square root. Speculation is used between consecutive radix-4 iterations to get a reduced timing. There are three different parts in digit-recurrence implementations: initialization, digit iterations, and rounding. The digit iteration is the iterative part and it uses the same logic for several cycles. Division and square root share partially the initialization and rounding stages, whereas each one has different logic for the digit iterations. The result is a low-latency floating-point divider and square root, requiring 11, 6, and 4 cycles for double, single and half-precision division with normalized operands and result, and 15, 8 and 5 cycles for square root. One or two additional cycles are needed in case of subnormal operand(s) or result.

  • NV-Journaling: Locality-Aware Journaling Using Byte-Addressable Non-Volatile Memory
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-17
    Cheng Chen; Qingsong Wei; Weng-Fai Wong; Chundong Wang

    Modern file systems rely on the journaling mechanism to maintain crash consistency. The use of non-volatile memory (NVM) significantly improves the performance of journaling file systems. However, the superior performance of NVM will increase the likelihood of the journal filling up more often, thereby increasing the frequency of checkpointing. Together with the large amount of random checkpointing I/O found in most use cases, the checkpointing process becomes a new performance bottleneck. This paper proposes NV-Journaling, a strategy that reduces the frequency of checkpointing as well as reshapes the I/O pattern of checkpointing from one of random I/O to that which is more sequential I/O. NV-Journaling introduces fine-grained commits along with a cache-friendly NVM journaling layout that exploits the idiosyncrasies of NVM technology. Under this scheme, only the modified portion of a block, rather than the entire block, is written into the NVM journal device. Doing so significantly reduces checkpoint frequency and achieves better space utilization. NV-Journaling further reshapes the I/O pattern of checkpoint using a locality-aware checkpointing process. Checkpointed blocks are classified into hot and cold blocks. NV-Journaling maintains a hot block list to absorb repeated updates, and a cold bucket list to group blocks by their proximity on disk. When a checkpoint is required, cold buckets are selected such that blocks are sequentially flushed to the hard disk. We built a prototype of NV-Journaling by modifying the JBD2 layer in the Linux kernel and evaluated it using different workloads. Our experimental results show that NV-Journaling can improve performance by up to 4.3× compared to traditional journaling.

  • Comparing Neural Network Based Decoders for the Surface Code
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-23
    Savvas Varsamopoulos; Koen Bertels; Carmen Garcia Almudever

    Matching algorithms can be used for identifying errors in quantum systems, being the most famous the Blossom algorithm. Recent works have shown that small distance quantum error correction codes can be efficiently decoded by employing machine learning techniques based on neural networks (NN). Various NN-based decoders have been proposed to enhance the decoding performance and the decoding time. Their implementation differs in how the decoding is performed, at logical or physical level, as well as in several neural network related parameters. In this work, we implement and compare two NN-based decoders, a low level decoder and a high level decoder, and study how different NN parameters affect their decoding performance and execution time. Crucial parameters such as the size of the training dataset, the structure and the type of the neural network, and the learning rate used during training are discussed. After performing this comparison, we conclude that the high level decoder based on a Recurrent NN shows a better balance between decoding performance and execution time and it is much easier to train. We then test its decoding performance for different code distances, probability datasets and under the depolarizing and circuit error models.

  • Power- and Cache-Aware Task Mapping with Dynamic Power Budgeting for Many-Cores
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-08-20
    Martin Rapp; Mark Sagi; Anuj Pathania; Andreas Herkersdorf; Jörg Henkel

    Two factors primarily affect the performance of multi-threaded tasks on many-core processors with logically-shared and physically-distributed Last-Level Cache (LLC): the LLC latencies of threads running on different cores and the per-core power budgets that aim to guarantee thermally safe operation. Two knobs affect these factors: First, the mapping of threads to cores affects both the LLC latencies and the power budgets. Second, dynamic power budgeting refines the power budgets during task execution. A mapping that spatially distributes threads across the many-core increases the power budgets, but unfortunately also increases the LLC latencies. Contrarily, mapping all threads near the center of the many-core minimizes the LLC latencies, but unfortunately also decreases the power budgets. Consequently, both metrics cannot be simultaneously optimal, which leads to a Pareto-optimization for task mapping that has formerly not been exploited. Dynamic power budgeting reallocates the power budgets according to the tasks’ execution phases. This results in a dynamically changing non-uniform power budget, which further increases the performance. We are the first to present a run-time algorithm PCGov combining task-agnostic task mapping and task-aware dynamic power budgeting for many-cores with shared distributed LLC. PCGov yields up to 21 percent lower response time and 13 percent lower energy consumption compared to the state-of-the-art, with a low overhead of less than 0.5 percent.

  • Lightweight Power Monitoring Framework for Virtualized Computing Environments
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-08-23
    James Phung; Young Choon Lee; Albert Y. Zomaya

    The pervasive use of virtualization techniques in today's datacenters poses challenges in power monitoring since it is not possible to directly measure the power consumption of a virtual entity such as a virtual machine (VM) and a container. In this paper, we present cWatts++, a lightweight virtual power meter that enables accurate power usage measurement in virtualized computing environments such as VMs and containers of Cloud data centers. At the core of cWatts++ is its application-agnostic power model. To this end, we devise two power models (eventModel and raplModel) that are driven by CPU event counters and the Running Average Power Limit (RAPL) feature of modern Intel CPUs, respectively. While eventModel is more generic and, thus, applicable to a wide range of workloads, raplModel is particularly good for CPU-bound workloads. We have evaluated cWatts++ with its two power models in a real system using the PARSEC benchmark suite and our in-house benchmarks. Our evaluation study demonstrates that these power models have an average error of 4.55 and 1.25 percent, respectively, compared with actual power usage measurements of a real power meter, Cabac Power-Mate.

  • New Flexible Multiple-Precision Multiply-Accumulate Unit for Deep Neural Network Training and Inference
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-09-05
    Hao Zhang; Dongdong Chen; Seok-Bum Ko

    In this paper, a new flexible multiple-precision multiply-accumulate (MAC) unit is proposed for deep neural network training and inference. The proposed MAC unit supports both fixed-point operations and floating-point operations. For floating-point format, the proposed unit supports one 16-bit MAC operation or sum of two 8-bit multiplications plus a 16-bit addend. To make the proposed MAC unit more versatile, the bit-width of exponent and mantissa can be flexibly exchanged. By setting the bit-width of exponent to zero, the proposed MAC unit also supports fixed-point operations. For fixed-point format, the proposed unit supports one 16-bit MAC or sum of two 8-bit multiplications plus a 16-bit addend. Moreover, the proposed unit can be further divided to support sum of four 4-bit multiplications plus a 16-bit addend. At the lowest precision, the proposed MAC unit supports accumulating of eight 1-bit logic AND operations to enable the support of binary neural networks. Compared to the standard 16-bit half-precision MAC unit, the proposed MAC unit provides more flexibility with only 21.8 percent area overhead. Compared to a standard 32-bit single-precision MAC unit, the proposed MAC unit requires much less hardware cost but still provides 8-bit exponent in the numerical format to maintain large dynamic range for deep learning computing.

  • Utilization-Tensity Bound for Real-Time DAG Tasks under Global EDF Scheduling
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-08-20
    Xu Jiang; Jinghao Sun; Yue Tang; Nan Guan

    Utilization bound is a well-known concept in real-time scheduling theory for sequential periodic tasks, which can be used both for quantifying the performance of scheduling algorithms and as efficient schedulability tests. However, the schedulability of parallel real time task graphs depends on not only utilization, but also another parameter tensity , the ratio between the longest path length and period. In this paper, we use utilization-tensity bounds to better characterize the schedulability of parallel real-time tasks. In particular, we derive utilization-tensity bounds for parallel DAG tasks under global EDF scheduling, which facilitate significantly more precise schedulability analysis than the state-of-the-art analysis techniques based on capacity augmentation bound and response time analysis. Moreover, we apply the above results to the federated scheduling paradigm to improve the system schedulability by choosing proper scheduling strategies for tasks with different workload and structure features.

  • TTADF: Power Efficient Dataflow-Based Multicore Co-Design Flow
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-08-27
    Ilkka Hautala; Jani Boutellier; Olli Silvén

    The era of mobile communications and the Internet of Things (IoT) has introduced numerous challenges for mobile processing platforms that are responsible for increasingly complex signal processing tasks from different application domains. In recent years, the power efficiency of computing has been improved by adding more parallelism and workload-specific computing resources to such platforms. However, programming of parallel systems can be time-consuming and challenging if only low-level programming methods are used. This work presents a dataflow-based co-design framework TTADF that reduces the design effort of both software and hardware design for mobile processing platforms. The paper presents three application examples from the fields of video coding, machine vision, and wireless communications. The application examples are mapped and profiled both on a pipelined and a shared-memory multicore platform that is generated by TTADF. The results of the TTADF co-design-based solutions are compared against previous manually created designs and a recent dataflow-based design flow, showing that TTADF provides very high energy efficiency together with a high level of automation in software and hardware design.

  • KnightSim: A Fast Discrete Event-Driven Simulation Methodology for Computer Architectural Simulation
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-08-30
    Christopher E. Giles; Christina L. Peterson; Mark A. Heinrich

    In this paper we introduce a fast discrete event-driven simulation methodology, called KnightSim, that is intended for use in the development of future computer architectural simulations. KnightSim extends an older event-driven simulation library by (1) incorporating corrections to functional issues that were introduced by the recent additions of stack protection, pointer mangling, and source fortification in the Linux software stack, (2) incorporating optimizations to the event engine, and (3) introducing a novel parallel implementation. KnightSim implements events as independently executable x86 “KnightSim Contexts”. KnightSim Contexts comprise a mechanism for fast context execution and automatically model occupancy and contention, which readily lends itself to use in computer architectural simulations. We present the implementation methodologies of KnightSim and Parallel KnightSim with a detailed performance analysis. Our performance analysis makes direct comparisons between KnightSim, Parallel KnightSim, and the discrete event-driven simulation engines found in three different mainstream computer architectural simulators. Our results show that on average KnightSim achieves speedups of 2.8 to 11.9 over the other discrete event-driven simulation engines. Our results also show that on average Parallel KnightSim can achieve speedups over KnightSim of 1.89, 3.33, 5.84, and 9.24 for 2, 4, 8, and 16 threaded executions respectively.

  • Maximizing I/O Throughput and Minimizing Performance Variation via Reinforcement Learning Based I/O Merging for SSDs
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-09-03
    Chao Wu; Cheng Ji; Qiao Li; Congming Gao; Riwei Pan; Chenchen Fu; Liang Shi; Chun Jason Xue

    Merging technique is widely adopted by I/O schedulers to maximize system I/O throughput. However, I/O merging could increase the latency of individual I/O, thus incurring prolonged I/O latencies and enlarged performance variations. Even with better system throughput, higher worst-case latency experienced by some requests could block the SSD storage system, which violates the QoS (Quality of Service) requirement. In order to improve QoS performance while providing higher I/O throughput, this paper proposes a reinforcement learning based I/O merging approach. Through learning the characteristic of various I/O patterns, the proposed approach makes merging decisions adaptively based on different I/O workloads. Evaluation results show that the proposed scheme is capable of reducing the standard deviation of I/O latency by 19.1 percent on average, worst-case latency by 7.3-60.9 percent at the 99.9th percentile compared with the latest I/O merging scheme, while maximizing system throughput.

  • A Novel Sequence Generation Approach to Diagnose Faults in Reconfigurable Scan Networks
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-09-03
    Riccardo Cantoro; Aleksa Damljanovic; Matteo Sonza Reorda; Giovanni Squillero

    With the complexity of nanoelectronic devices rapidly increasing, an efficient way to handle large number of embedded instruments became a necessity. The IEEE 1687 standard was introduced to provide flexibility in accessing and controlling such instrumentation through a reconfigurable scan chain. Nowadays, together with testing the system for defects that may affect the scan chains themselves, the diagnosis of such faults is also important. This article proposes a method for generating stimuli to precisely identify permanent high-level faults in a IEEE 1687 reconfigurable scan chain: the system is modeled as a finite state automaton where faults correspond to multiple incorrect transitions; then, a dynamic greedy algorithm is used to select a sequence of inputs able to distinguish between all possible faults. Experimental results on the widely-adopted ITC'02 and ITC'16 benchmark suites, as well as on synthetically generated circuits, clearly demonstrate the applicability and effectiveness of the proposed approach: generated sequences are two orders of magnitude shorter compared to previous methodologies, while the computational resources required remain acceptable even for larger benchmarks.

  • Signal Strength-Aware Adaptive Offloading with Local Image Preprocessing for Energy Efficient Mobile Devices
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-09-03
    Young Geun Kim; Young Seo Lee; Sung Woo Chung

    To prolong battery life of mobile devices, image processing applications often exploit offloading techniques which run some or all of the computations on remote servers. Unfortunately, the existing offloading techniques do not consider the fact that data transmission time and energy consumption of wireless network interfaces exponentially increase when signal strength decreases. In this paper, we propose an adaptive offloading for image processing applications, which considers wireless signal strength. To improve performance and energy efficiency of offloading, we also propose to adaptively exploit local preprocessing (executing image preprocessing on local mobile devices), considering wireless signal strength; the local preprocessing usually reduces the size of transmission image in offloading. Our proposed technique estimates performance and energy consumption of the following three methods, depending on the wireless signal strength: 1) local execution (executing all the computations on the local mobile devices), 2) offloading without local preprocessing, and 3) offloading with local preprocessing. Based on the estimated performance and energy consumption, our technique employs one among the three methods, which is expected to result in the best performance or energy efficiency. In our evaluation on an off-the-shelf smartphone, when a user prefers performance to energy, our proposed technique improves performance by 27.1 percent, compared to the conventional offloading technique that does not consider the signal strength. On the other hand, when a user prefers energy to performance, our proposed technique saves system-wide (not just CPU nor wireless network interface) energy consumption by 26.3 percent, on average, compared to the conventional offloading technique.

  • Scrabble: A Fine-Grained Cache with Adaptive Merged Block
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-09-06
    Chao Zhang; Yuan Zeng; Xiaochen Guo

    A large fraction of the microprocessor energy is consumed by the data movement in the system. One of the reasons is the inefficiency in the conventional cache design. Cache blocks larger than a word are used in conventional caches to exploit spatial locality. However, many applications only use a small part of a cache block before its eviction. Transferring and storing unused data wastes bandwidth, energy, and limited cache space. Prior work on fine-grained caches can reduce data access and storage granularity to reduce the amount of unused data. However, small data blocks typically require greater metadata and control overhead. Sharing the common bits among tags of fine-grained blocks can reduce the metadata overhead but the constraints on which fine-grained blocks can share tag bits can cause fragmentation. This work proposes scrabble, a fine-grained cache that can merge multiple non-contiguous fine-grained blocks into a variable size merged block. The length of the shared tag is maximized to reduce the metadata overhead. The space utilization is improved by supporting merged blocks with variable size. The control overhead can be reduced by moving the merged block together from memory to the last level cache. For applications with poor spatial locality, Scrabble cache can achieve more than 40 percent of performance improvement. Even for application with good spatial locality, the speedup is still more than 7 percent. In general, for an evaluated set of benchmarks, Scrabble cache achieves an average of 2.41× effective capacity over the baseline cache with the same cache capacity which leads to a 16.7 percent performance improvement and an 11 percent on-chip energy reduction. As compared to a state-of-the-art fine-grained cache, Scrabble cache achieves a 1.25× effective capacity, a 7.9 percent speedup, and a 5.8 percent on-chip energy reduction.

  • FACCT: FAst, Compact, and Constant-Time Discrete Gaussian Sampler over Integers
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-09-12
    Raymond K. Zhao; Ron Steinfeld; Amin Sakzad

    The discrete Gaussian sampler is one of the fundamental tools in implementing lattice-based cryptosystems. However, a naive discrete Gaussian sampling implementation suffers from side-channel vulnerabilities, and the existing countermeasures usually introduce significant overhead in either the running speed or the memory consumption. In this paper, we propose a fast, compact, and constant-time implementation of the binary sampling algorithm, originally introduced in the BLISS signature scheme. Our implementation adapts the Rényi divergence and the transcendental function polynomial approximation techniques. The efficiency of our scheme is independent of the standard deviation, and we show evidence that our implementations are either faster or more compact than several existing constant-time samplers. In addition, we show the performance of our implementation techniques applied to and integrated with two existing signature schemes: qTesla and Falcon. On the other hand, the convolution theorems are typically adapted to sample from larger standard deviations, by combining samples with much smaller standard deviations. As an additional contribution, we show better parameters for the convolution theorems.

  • Fast and Efficient Convolutional Accelerator for Edge Computing
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-09-16
    Arash Ardakani; Carlo Condo; Warren J. Gross

    Convolutional neural networks (CNNs) are a vital approach in machine learning. However, their high complexity and energy consumption make them challenging to embed in mobile applications at the edge requiring real-time processes such as smart phones. In order to meet the real-time constraint of edge devices, recently proposed custom hardware CNN accelerators have exploited parallel processing elements (PEs) to increase throughput. However, this straightforward parallelization of PEs and high memory bandwidth require high data movement, leading to large energy consumption. As a result, only a certain number of PEs can be instantiated when designing bandwidth-limited custom accelerators targeting edge devices. While most bandwidth-limited designs claim a peak performance of a few hundred giga operations per second, their average runtime performance is substantially lower than their roofline when applied to state-of-the-art CNNs such as AlexNet, VGGNet and ResNet, as a result of low resource utilization and arithmetic intensity. In this work, we propose a zero-activation-skipping convolutional accelerator (ZASCA) that avoids noncontributory multiplications with zero-valued activations. ZASCA employs a dataflow that minimizes the gap between its average and peak performances while maximizing its arithmetic intensity for both sparse and dense representations of activations, targeting the bandwidth-limited edge computing scenario. More precisely, ZASCA achieves a performance efficiency of up to 94 percent over a set of state-of-the-art CNNs for image classification with dense representation where the performance efficiency is the ratio between the average runtime performance and the peak performance. Using its zero-skipping feature, ZASCA can further improve the performance efficiency of the state-of-the-art CNNs by up to 1.9× depending on the sparsity degree of activations. The implementation results in 65-nm TSMC CMOS technology show that, compared to the most energy-efficient accelerator, ZASCA can process convolutions from 5.5× to 17.5× faster, and is between 2.1× and 4.5× more energy efficient while occupying 2.1× less silicon area.

  • 2019 Reviewers List
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2020-01-03

    Presents the reviewers who contributed to this publication in 2019.

  • 2019 Index IEEE Transactions on Computers Vol. 68
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2020-01-03

    Presents the 2019 subject/author index for this publication.

  • TAP: Reducing the Energy of Asymmetric Hybrid Last-Level Cache via Thrashing Aware Placement and Migration
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-05-16
    Jing-Yuan Luo; Hsiang-Yun Cheng; Ing-Chao Lin; Da-Wei Chang

    Emerging non-volatile memories (NVMs) have favorable properties, such as low leakage and high density, and have attracted a lot of attention in recent years. Among them, spin-transfer torque magnetoresistive random access memory (STT-MRAM) with SRAM-comparable read speed is a good candidate to build large last-level caches (LLCs). However, STT-MRAM suffers from long write latency and high write energy. To mitigate the impact of asymmetric read/write energy and latency, hybrid cache designs have been proposed to combine the merits of STT-MRAM and SRAM. In such a hybrid SRAM/STT-MRAM LLC, intelligent block placement and migration policies are needed to improve the energy efficiency. Prior studies map write-intensive blocks to SRAM and keep read-intensive blocks in STT-MRAM for reducing the energy consumption of hybrid LLCs. The write-intensive/read-intensive blocks are usually captured by sampling the address (PC) of memory access instructions or adding simple access counters in each cache line. Nevertheless, these prior approaches cannot fully capture the energy-harmful access behavior in STT-MRAM, especially the writes caused by repetitive data transfer between the LLC and upper-level caches. In this paper, we find that conflict misses in L2 often generate thrashing blocks which move back and forth between L2 and LLC. If dirty thrashing blocks that incur extensive writes are placed in STT-MRAM, energy consumption would excessively increase, especially when running memory-bound workloads. Thus, we propose a thrashing aware placement and migration policy (TAP) to tackle the challenge. TAP places dirty thrashing blocks into SRAM and migrates clean thrashing blocks from SRAM to STT-MRAM. Evaluation results show that TAP can provide significant energy savings with minimal performance loss.

  • Resilience of Randomized RNS Arithmetic with Respect to Side-Channel Leaks of Cryptographic Computation
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-06-25
    Jérôme Courtois; Lokman Abbas-Turki; Jean-Claude Bajard

    In this paper, we want to promote the influence of randomized arithmetic on the leaks during a code execution. When somebody wants to extract some specific information from these leaks, one can observe different emanations of the device like power consumption. These leaks mostly come from the variations of the Hamming distances of the successive states of the system. This phenomenon is particularly critical for cryptographic devices. Our work evaluates the resilience of randomized moduli in Residue Number System (RNS) against Correlation Power Analysis (CPA), Differential Power Analysis (DPA). Our analysis is illustrated through the evaluation of scalar multiplication on an elliptic curve using the Montgomery Powering Ladder (MPL) algorithm which protects from Simple Power Analysis (SPA). We also propose an evaluation based on the Maximum Likelihood Estimator (MLE), which crosses the information of the whole state vector, instead of analysing only the current state like with CPA or DPA. Furthermore, MLE gives better performance and smooths the results allowing a better evaluation of the behaviour of the leakage. Our experimental evaluation suggests that the number of observations, needed to perform exploitable information leakage, is proportional to the number of possible RNS bases.

  • A New Cube Attack on MORUS by Using Division Property
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-07-17
    Tao Ye; Yongzhuang Wei; Willi Meier

    MORUS is an authenticated encryption algorithm and one of the candidates in the CAESAR competition. Currently, the security of MORUS received extensive attention. In this paper, a new existence terms detection method in superpoly recovery phase in cube attack is proposed. More precisely, the upper bounding degree of superpoly is first estimated by using the cube attack based on the division property with Mixed Integer Linear Programming tool. Moreover, the t-degree monomials that may be involved in the superpoly are divided into two groups, where the elements of the first group can be directly determined without using the solver via the embedded property. Compared with previous methods, the time consumption by the solvers of our new method is reduced significantly. In particular, the truth table from only the existent terms can be used to recover the superpoly in the offline phase of the cube attack. Therefore, the time complexity of cube attack can be further reduced. As illustrative example, the security of the reduced-step variants of MORUS-640-128 against cube attack is evaluated by using this new method. It is demonstrated that the key recovery attacks can be applied to 6/7-step MORUS-640-128. Furthermore, some integral distinguishers of 7-step MORUS-640-128/MORUS-1280-256 are achieved.

  • DC-PCM: Mitigating PCM Write Disturbance with Low Performance Overhead by Using Detection Cells
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-07-25
    Jungwhan Choi; Jaemin Jang; Lee-Sup Kim

    As DRAM scaling becomes ever more difficult, Phase Change Memory (PCM) is attracting attention as a new memory or storage class memory. Unfortunately, PCM cell data can be changed by frequently writing `0' to adjacent cells. This phenomenon is called Write Disturbance (WD). To mitigate WD errors with low performance overhead, we propose a Detection Cell PCM (DC-PCM). In the DC-PCM, additional cells called Detection Cells (DC) are allocated to a memory-line to pre-detect WD errors. For pre-detection, we propose schemes that give DCs higher WD-vulnerability than normal cells. However, additional time is needed to verify DCs. To hide the time needed to perform the verifications during a WRITE, DC-PCM enables the local word-lines of DCs to operate independently (Decoupled Word-line), and verifies different directions in parallel (Parallel DC-Verification). After verification, the DC-PCM increases the WD-vulnerability of the DCs, or restores the memory-line data (DC-Correction). In our simulation, DC-PCMs showed performance comparable to a WD-free PCM for all workloads.

  • Fast Coflow Scheduling via Traffic Compression and Stage Pipelining in Datacenter Networks
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-07-30
    Qihua Zhou; Kun Wang; Peng Li; Deze Zeng; Song Guo; Baoliu Ye; Minyi Guo

    Big data analytics in datacenters often involve scheduling of data-parallel jobs. Traditional scheduling techniques based on improving network resource utilization are subject to limited bandwidth in datacenter networks. To alleviate the shortage of bandwidth, some cluster frameworks employ techniques of traffic compression to reduce transmission consumption. However, they tackle scheduling in a coarse-grained manner at task level and do not perform well in terms of flow-level metrics due to high complexity. Fortunately, the abstraction of coflow pioneers a new perspective to facilitate scheduling efficiency. In this paper, we introduce a coflow compression mechanism to minimize the completion time in data-intensive applications. Due to the NP-hardness, we propose a heuristic algorithm called Fastest-Volume-Disposal-First (FVDF) to solve this problem. For online applicability, FVDF supports stage pipelining to accelerate scheduling and exploits recurrent neural networks (RNNs) to predict compression speed. Meanwhile, we build Swallow, an efficient scheduling system that implements our proposed algorithms. It minimizes coflow completion time (CCT) while guaranteeing resource conservation and starvation freedom. The results of both trace-driven simulations and real experiments show the superiority of our algorithm, over existing one. Specifically, Swallow speeds up CCT and job completion time (JCT) by up to 1.47χ and 1.66χ on average, respectively, over the SEBF in Varys, one of the most efficient coflow scheduling algorithms so far. Moreover, with coflow compression, Swallow reduces data traffic by up to 48.41 percent on average.

  • Integration and Boost of a Read-Modify-Write Module in Phase Change Memory System
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-08-08
    Hyokeun Lee; Moonsoo Kim; Hyunchul Kim; Hyun Kim; Hyuk-Jae Lee

    Phase-change memory (PCM) is a non-volatile memory device with favorable characteristics such as persistence, byte-addressability, and lower latency when compared to flash memory. However, it comprises memory cells that have limited lifetime and higher access latency than DRAM. The row buffer size of a PCM is preferred to be larger than 128B to fill the latency gap between two memories and to reduce the metadata overhead incurred by wear leveling. As the cache line size in a general-purpose processor is 64B, a read-modify-write (RMW) module is required to be placed between the processor and the PCM, which in turn induces a performance degradation. To reduce such an overhead and enhance the reliability of a device, this paper presents a new RMW architecture. The proposed model introduces a DRAM cache in the RMW module, which minimizes redundant read operations for write operations by pre-fetching the entire transaction unit instead of merely caching the 64B requested data. Furthermore, a typeless merge operation is performed with the proposed cache by gathering multiple commands accessing consecutive addresses, irrespective of whether they are READ or WRITE. Simulation results indicate that the proposed method enhances the speed by 3.2 times and the reliability by 49 percent as compared to the baseline model.

  • Improving Availability of Multicore Real-Time Systems Suffering Both Permanent and Transient Faults
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-08-14
    Junlong Zhou; Xiaobo Sharon Hu; Yue Ma; Jin Sun; Tongquan Wei; Shiyan Hu

    CMOS scaling has greatly increased concerns for both lifetime reliability due to permanent faults and soft-error reliability due to transient faults. Most existing works only focus on one of the two reliability concerns, but often times techniques used to increase one type of reliability may adversely impact the other type. A few efforts do consider both types of reliability together and use two different metrics to quantify the two types of reliability. However, for many systems, the user's concern is to maximize system availability by improving the mean time to failure (MTTF), regardless of whether the failure is caused by permanent or transient faults. Addressing this concern requires a uniform metric to measure the effect due to both types of faults. This paper introduces a novel analytical expression for calculating the MTTF due to transient faults. Using this new formula and an existing method to evaluate system MTTF, we tackle the problem of maximizing availability for multicore real-time systems with consideration of permanent and transient faults. A framework is proposed to solve the system availability maximization problem. Experimental results on a hardware board and simulation results of synthetic tasks show that our scheme significantly improves system MTTF (and hence availability) compared with existing techniques.

  • OverCome: Coarse-Grained Instruction Commit with Handover Register Renaming
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-08-20
    Ipoom Jeong; Changmin Lee; Keunsoo Kim; Won Woo Ro

    Coarse-grained instruction commit mechanisms enabled the effective size of the instruction window to be as large as possible by committing a group of instructions atomically. Within a group, the reorder buffer (ROB) and physical registerfile (PRF) entries are conservatively managed, and thus the instruction window can handle more in-flight instructions beyond the hardware limit. However, previous approaches have suffered from high storage requirements for managing group information and unbalanced lifetime of instruction window resources, i.e., the ROB and PRF. In this paper, we propose an OverCome microarchitecture based on a history-based approach to address these problems. First, OverCome retains the conservative allocation of the ROB regardless of the group size limit, thereby providing high scalability. Second, it handles the information of numerous groups with a low storage cost. These two techniques achieve a significant reduction in the pressure on the ROB; thus, a new bottleneck arises: the pressure on the PRF. To address this issue, we propose a novel register renaming technique to reduce the lifetime of physical registers to a large extent, by tightly coupling the early release and lazy allocation schemes. Thus, the proposed design strikes a balance between the ROB and PRF requirements. Detailed evaluation of the proposed techniques on a state-of-the-art superscalar processor shows that our proposals augment the effective size of the instruction window by more than 4×, with a net overhead of less than 3 percent of the core area.

  • A Low-Power, High-Performance Speech Recognition Accelerator
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-08-26
    Reza Yazdani; Jose-Maria Arnau; Antonio González

    Automatic Speech Recognition (ASR) is becoming increasingly ubiquitous, especially in the mobile segment. Fast and accurate ASR comes at high energy cost, not being affordable for the tiny power-budgeted mobile devices. Hardware acceleration reduces energy-consumption of ASR systems, while delivering high-performance. In this paper, we present an accelerator for largevocabulary, speaker-independent, continuous speech-recognition. It focuses on the Viterbi search algorithm representing the main bottleneck in an ASR system. The proposed design consists of innovative techniques to improve the memory subsystem, since memory is the main bottleneck for performance and power in these accelerators' design. It includes a prefetching scheme tailored to the needs of ASR systems that hides main memory latency for a large fraction of the memory accesses, negligibly impacting area. Additionally, we introduce a novel bandwidth-saving technique that removes off-chip memory accesses by 20 percent. Finally, we present a power saving technique that significantly reduces the leakage power of the accelerators scratchpad memories, providing between 8.5 and 29.2 percent reduction in entire power dissipation. Overall, the proposed design outperforms implementations running on the CPU by orders of magnitude, and achieves speedups between 1.7x and 5.9x for different speech decoders over a highly optimized CUDA implementation running on Geforce-GTX-980 GPU, while reducing the energy by 123-454x.

  • VLSI architectures for computing multiplications and inverses in GF(2m).
    IEEE Trans. Comput. (IF 3.131) Pub Date : 1985-08-01
    C C Wang,T K Truong,H M Shao,L J Deutsch,J K Omura,I S Reed

    Finite field arithmetic logic is central in the implementation of Reed-Solomon coders and in some cryptographic algorithms. There is a need for good multiplication and inversion algorithms that can be easily realized on VLSI chips. Massey and Omura recently developed a new multiplication algorithm for Galois fields based on a normal basis representation. In this paper, a pipeline structure is developed to realize the Massey-Omura multiplier in the finite field GF(2m). With the simple squaring property of the normal basis representation used together with this multiplier, a pipeline architecture is developed for computing inverse elements in GF(2m). The designs developed for the Massey-Omura multiplier and the computation of inverse elements are regular, simple, expandable, and therefore, naturally suitable for VLSI implementation.

  • Better Circuits for Binary Polynomial Multiplication.
    IEEE Trans. Comput. (IF 3.131) Pub Date : 2019-10-04
    Magnus Gaudal Find,René Peralta

    We develop a new and simple way to describe Karatsuba-like algorithms for multiplication of polynomials over F 2 . We restrict the search of small circuits to a class of circuits we call symmetric bilinear. These are circuits in which AND gates only compute functions of the form ∑ i ∈ S a i ⋅ ∑ i ∈ S b i (S ⊆ {0,…, n - 1}). These techniques yield improved recurrences for M(kn), the number of gates used in a circuit that multiplies two kn-term polynomials, for k = 4, 5, 6, and 7. We built and verified the circuits for n-term binary polynomial multiplication for values of n of practical interest. Circuits for n up to 100 are posted at http://cs-www.cs.yale.edu/homes/peralta/CircuitStuff/BinPolMult.tar.gz.

  • A VLSI design of a pipeline Reed-Solomon decoder.
    IEEE Trans. Comput. (IF 3.131) Pub Date : 1985-05-01
    H M Shao,T K Truong,L J Deutsch,J H Yuen,I S Reed

    A pipeline structure of a transform decoder similar to a systolic array is developed to decode Reed-Solomon (RS) codes. An important ingredient of this design is a modified Euclidean algorithm for computing the error-locator polynomial. The computation of inverse field elements is completely avoided in this modification of Euclid's algorithm. The new coder is regular and simple, and naturally suitable for VLSI implementation. An example illustrating both the pipeline and systolic array aspects of this decoder structure is given for a RS code.

Contents have been reproduced by permission of the publishers.
上海纽约大学William Glover