当前期刊: arXiv - CS - Hardware Architecture Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
  • ReversiSpec: Reversible Coherence Protocol for Defending Transient Attacks
    arXiv.cs.AR Pub Date : 2020-06-30
    You Wu; Xuehai Qian

    The recent works such as InvisiSpec, SafeSpec, and Cleanup-Spec, among others, provided promising solutions to defend speculation induced (transient) attacks. However, they intro-duce delay either when a speculative load becomes safe in the redo approach or when it is squashed in the undo approach. We argue that it is due to the lack of fundamental mechanisms for reversing the effects of speculation

  • Firmware Insider: Bluetooth Randomness is Mostly Random
    arXiv.cs.AR Pub Date : 2020-06-30
    Jörn Tillmanns; Jiska Classen; Felix Rohrbach; Matthias Hollick

    Bluetooth chips must include a Random Number Generator (RNG). This RNG is used internally within cryptographic primitives but also exposed to the operating system for chip-external applications. In general, it is a black box with security-critical authentication and encryption mechanisms depending on it. In this paper, we evaluate the quality of RNGs in various Broadcom and Cypress Bluetooth chips

  • SeMPE: Secure Multi Path Execution Architecture for Removing Conditional Branch Side Channels
    arXiv.cs.AR Pub Date : 2020-06-29
    Andrea Mondelli; Paul Gazzillo; Yan Solihin

    One of the most prevalent source of side channel vulnerabilities is the secret-dependent behavior of conditional branches (SDBCB). The state-of-the-art solution relies on Constant-Time Expressions, which require high programming effort and incur high performance overheads. In this paper, we propose SeMPE, an approach that relies on architecture support to eliminate SDBCB without requiring much programming

  • An Imitation Learning Approach for Cache Replacement
    arXiv.cs.AR Pub Date : 2020-06-29
    Evan Zheran Liu; Milad Hashemi; Kevin Swersky; Parthasarathy Ranganathan; Junwhan Ahn

    Program execution speed critically depends on increasing cache hits, as cache hits are orders of magnitude faster than misses. To increase cache hits, we focus on the problem of cache replacement: choosing which cache line to evict upon inserting a new line. This is challenging because it requires planning far ahead and currently there is no known practical solution. As a result, current replacement

  • DRACO: Co-Optimizing Hardware Utilization, and Performance of DNNs on Systolic Accelerator
    arXiv.cs.AR Pub Date : 2020-06-26
    Nandan Kumar Jha; Shreyas Ravishankar; Sparsh Mittal; Arvind Kaushik; Dipan Mandal; Mahesh Chandra

    The number of processing elements (PEs) in a fixed-sized systolic accelerator is well matched for large and compute-bound DNNs; whereas, memory-bound DNNs suffer from PE underutilization and fail to achieve peak performance and energy efficiency. To mitigate this, specialized dataflow and/or micro-architectural techniques have been proposed. However, due to the longer development cycle and the rapid

  • A Fast Finite Field Multiplier for SIKE
    arXiv.cs.AR Pub Date : 2020-06-25
    Yeonsoo Jeon; Dongsuk Jeon

    Various post-quantum cryptography algorithms have been recently proposed. Supersingluar isogeny Diffie-Hellman key exchange (SIKE) is one of the most promising candidates due to its small key size. However, the SIKE scheme requires numerous finite field multiplications for its isogeny computation, and hence suffers from slow encryption and decryption process. In this paper, we propose a fast finite

  • Arnold: an eFPGA-Augmented RISC-V SoC for Flexible and Low-Power IoT End-Nodes
    arXiv.cs.AR Pub Date : 2020-06-25
    Pasquale Davide Schiavone; Davide Rossi; Alfio Di Mauro; Frank Gurkaynak; Timothy Saxe; Mao Wang; Ket Chong Yap; Luca Benini

    A wide range of Internet of Things (IoT) applications require powerful, energy-efficient and flexible end-nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold: a 0.5 V to 0.8 V, 46.83 uW/MHz, 600 MOPS fully programmable RISC-V Microcontroller unit (MCU) fabricated in 22 nm

  • On the Difficulty of Designing Processor Arrays for Deep Neural Networks
    arXiv.cs.AR Pub Date : 2020-06-24
    Kevin Stehle; Günther Schindler; Holger Fröning

    Systolic arrays are a promising computing concept which is in particular inline with CMOS technology trends and linear algebra operations found in the processing of artificial neural networks. The recent success of such deep learning methods in a wide set of applications has led to a variety of models, which albeit conceptual similar as based on convolutions and fully-connected layers, in detail show

  • Block-matching in FPGA
    arXiv.cs.AR Pub Date : 2020-06-24
    Rafael Pizarro Solar; Michal Pleskowicz

    Block-matching and 3D filtering (BM3D) is an image denoising algorithm that works in two similar steps. Both of these steps need to perform grouping by block-matching. We implement the block-matching in an FPGA, leveraging its ability to perform parallel computations. Our goal is to enable other researchers to use our solution in the future for real-time video denoising in video cameras that use FPGAs

  • On Mitigating Random and Adversarial Bit Errors
    arXiv.cs.AR Pub Date : 2020-06-24
    David Stutz; Nandhini Chandramoorthy; Matthias Hein; Bernt Schiele

    The design of deep neural network (DNN) accelerators, i.e., specialized hardware for inference, has received considerable attention in past years due to saved cost, area, and energy compared to mainstream hardware. We consider the problem of random and adversarial bit errors in quantized DNN weights stored on accelerator memory. Random bit errors arise when optimizing accelerators for energy efficiency

  • Fetch-Directed Instruction Prefetching Revisited
    arXiv.cs.AR Pub Date : 2020-06-24
    Truls Asheim; Rakesh Kumar; Boris Grot

    Prior work has observed that fetch-directed prefetching (FDIP) is highly effective at covering instruction cache misses. The key to FDIP's effectiveness is having a sufficiently large BTB to accommodate the application's branch working set. In this work, we introduce several optimizations that significantly extend the reach of the BTB within the available storage budget. Our optimizations target nearly

  • SCARE: Side Channel Attack on In-Memory Computing for Reverse Engineering
    arXiv.cs.AR Pub Date : 2020-06-23
    Sina Sayyah Ensan; Karthikeyan Nagarajan; Mohammad Nasim Imtia Khan; Swaroop Ghosh

    In-memory computing architectures provide a much needed solution to energy-efficiency barriers posed by Von-Neumann computing due to the movement of data between the processor and the memory. Functions implemented in such in-memory architectures are often proprietary and constitute confidential Intellectual Property. Our studies indicate that IMCs implemented using RRAM are susceptible to Side Channel

  • Optimizing Placement of Heap Memory Objects in Energy-Constrained Hybrid Memory Systems
    arXiv.cs.AR Pub Date : 2020-06-22
    Taeuk Kim; Safdar Jamil; Joongeon Park; Youngjae Kim

    Main memory (DRAM) significantly impacts the power and energy utilization of the overall server system. Non-Volatile Memory (NVM) devices, such as Phase Change Memory and Spin-Transfer Torque RAM, are suitable candidates for main memory to reduce energy consumption. But unlike DRAM, NVMs access latencies are higher than DRAM and NVM writes are more energy sensitive than DRAM write operations. Thus

  • Compiler Directed Speculative Intermittent Computation
    arXiv.cs.AR Pub Date : 2020-06-20
    Jongouk Choi; Qingrui Liu; Changhee Jung

    This paper presents CoSpec, a new architecture/compiler co-design scheme that works for commodity in-order processors used in energy-harvesting systems. To achieve crash consistency without requiring unconventional architectural support, CoSpec leverages speculation assuming that power failure is not going to occur and thus holds all committed stores in a store buffer (SB), as if they were speculative

  • fault: A Python Embedded Domain-Specific Language For Metaprogramming Portable Hardware Verification Components
    arXiv.cs.AR Pub Date : 2020-06-20
    Lenny Truong; Steven Herbst; Rajsekhar Setaluri; Makai Mann; Ross Daly; Keyi Zhang; Caleb Donovick; Daniel Stanley; Mark Horowitz; Clark Barrett; Pat Hanrahan

    While hardware generators have drastically improved design productivity, they have introduced new challenges for the task of verification. To effectively cover the functionality of a sophisticated generator, verification engineers require tools that provide the flexibility of metaprogramming. However, flexibility alone is not enough; components must also be portable in order to encourage the proliferation

  • Design of a Near-Ideal Fault-Tolerant Routing Algorithm for Network-on-Chip-Based Multicores
    arXiv.cs.AR Pub Date : 2020-06-19
    Costas Iordanou; Vassos Soteriou; Konstantinos Aisopos

    With relentless CMOS technology downsizing Networks-on-Chips (NoCs) are inescapably experiencing escalating susceptibility to wearout and reduced reliability. While faults in processors and memories may be masked via redundancy, or mitigated via techniques such as task migration, NoCs are especially vulnerable to hardware faults as a single link breakdown may cause inter-tile communication to halt

  • ZnG: Architecting GPU Multi-Processors with New Flash for Scalable Data Analysis
    arXiv.cs.AR Pub Date : 2020-06-16
    Jie Zhang; Myoungsoo Jung

    We propose ZnG, a new GPU-SSD integrated architecture, which can maximize the memory capacity in a GPU and address performance penalties imposed by an SSD. Specifically, ZnG replaces all GPU internal DRAMs with an ultra-low-latency SSD to maximize the GPU memory capacity. ZnG further removes performance bottleneck of the SSD by replacing its flash channels with a high-throughput flash network and integrating

  • Addressing Variability in Reuse Prediction for Last-Level Caches
    arXiv.cs.AR Pub Date : 2020-06-15
    Priyank Faldu

    Last-Level Cache (LLC) represents the bulk of a modern CPU processor's transistor budget and is essential for application performance as LLC enables fast access to data in contrast to much slower main memory. However, applications with large working set size often exhibit streaming and/or thrashing access patterns at LLC. As a result, a large fraction of the LLC capacity is occupied by dead blocks

  • Architecture Support for FPGA Multi-tenancy in the Cloud
    arXiv.cs.AR Pub Date : 2020-06-14
    Joel Mandebi Mbongue; Alex Shuping; Pankaj Bhowmik; Christophe Bobda

    Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud.

  • A Unified Learning Platform for Dynamic Frequency Scaling in Pipelined Processors
    arXiv.cs.AR Pub Date : 2020-06-12
    Arash Fouman Ajirlou; Inna Partin-Vaisband

    A machine learning (ML) design framework is proposed for dynamically adjusting clock frequency based on propagation delay of individual instructions. A Random Forest model is trained to classify propagation delays in real-time, utilizing current operation type, current operands, and computation history as ML features. The trained model is implemented in Verilog as an additional pipeline stage within

  • STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators
    arXiv.cs.AR Pub Date : 2020-06-10
    Francisco Muñoz-Martínez; José L. Abellán; Manuel E. Acacio; Tushar Krishna

    The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays. First-generation rigid proposals have been rapidly replaced by more advanced flexible accelerator architectures able to efficiently support a variety of layer types and dimensions. As the complexity of the designs grows, it is more and more appealing

  • Unified Characterization Platform for Emerging NVM Technology: Neural Network Application Benchmarking Using off-the-shelf NVM Chips
    arXiv.cs.AR Pub Date : 2020-06-10
    Supriya Chakraborty; Abhishek Gupta; Manan Suri

    In this paper, we present a unified FPGA based electrical test-bench for characterizing different emerging NonVolatile Memory (NVM) chips. In particular, we present detailed electrical characterization and benchmarking of multiple commercially available, off-the-shelf, NVM chips viz.: MRAM, FeRAM, CBRAM, and ReRAM. We investigate important NVM parameters such as: (i) current consumption patterns, (ii)

  • A GPU Register File using Static Data Compression
    arXiv.cs.AR Pub Date : 2020-06-10
    Alexandra Angerd; Erik Sintorn; Per Stenström

    GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilization. This paper introduces a new register file organization for efficient register-packing of narrow integer and floating-point operands designed to leverage on advances in static analysis

  • Improving Dependability of Neuromorphic Computing With Non-Volatile Memory
    arXiv.cs.AR Pub Date : 2020-06-10
    Shihao Song; Anup Das; Nagarajan Kandasamy

    As process technology continues to scale aggressively, circuit aging in a neuromorphic hardware due to negative bias temperature instability (NBTI) and time-dependent dielectric breakdown (TDDB) is becoming a critical reliability issue and is expected to proliferate when using non-volatile memory (NVM) for synaptic storage. This is because an NVM requires high voltage and current to access its synaptic

  • FP-Stereo: Hardware-Efficient Stereo Vision for Embedded Applications
    arXiv.cs.AR Pub Date : 2020-06-05
    Jieru Zhao; Tingyuan Liang; Liang Feng; Wenchao Ding; Sharad Sinha; Wei Zhang; Shaojie Shen

    Fast and accurate depth estimation, or stereo matching, is essential in embedded stereo vision systems, requiring substantial design effort to achieve an appropriate balance among accuracy, speed and hardware cost. To reduce the design effort and achieve the right balance, we propose FP-Stereo for building high-performance stereo matching pipelines on FPGAs automatically. FP-Stereo consists of an open-source

  • Counting Cards: Exploiting Weight and Variance Distributions for Robust Compute In-Memory
    arXiv.cs.AR Pub Date : 2020-06-04
    Brian Crafton; Samuel Spetalnick; Arijit Raychowdhury

    Compute in-memory (CIM) is a promising technique that minimizes data transport, the primary performance bottleneck and energy cost of most data intensive applications. This has found wide-spread adoption in accelerating neural networks for machine learning applications. Utilizing a crossbar architecture with emerging non-volatile memories (eNVM) such as dense resistive random access memory (RRAM) or

  • Operation Merging for Hardware Implementations of Fast Polar Decoders
    arXiv.cs.AR Pub Date : 2020-06-03
    Furkan Ercan; Thibaud Tonnellier; Carlo Condo; Warren J. Gross

    Polar codes are a class of linear block codes that provably achieves channel capacity. They have been selected as a coding scheme for the control channel of enhanced mobile broadband (eMBB) scenario for $5^{\text{th}}$ generation wireless communication networks (5G) and are being considered for additional use scenarios. As a result, fast decoding techniques for polar codes are essential. Previous works

  • Hardware Security in Spin-Based Computing-In-Memory: Analysis, Exploits, and Mitigation Techniques
    arXiv.cs.AR Pub Date : 2020-06-02
    Xueyan Wang; Jianlei Yang; Yinglin Zhao; Xiaotao Jia; Gang Qu; Weisheng Zhao

    Computing-in-memory (CIM) is proposed to alleviate the processor-memory data transfer bottleneck in traditional Von-Neumann architectures, and spintronics-based magnetic memory has demonstrated many facilitation in implementing CIM paradigm. Since hardware security has become one of the major concerns in circuit designs, this paper, for the first time, investigates spin-based computing-in-memory (SpinCIM)

  • Exceeding Conservative Limits: A Consolidated Analysis on Modern Hardware Margins
    arXiv.cs.AR Pub Date : 2020-06-01
    George Papadimitriou; Athanasios Chatzidimitriou; Dimitris Gizopoulos; Vijay Janapa Reddi; Jingwen Leng; Behzad Salami; Osman S. Unsal; Adrian Cristal Kestelman

    Modern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of

  • How to extend the Single-Processor Paradigm to the Explicitly Many-Processor Approach
    arXiv.cs.AR Pub Date : 2020-05-31
    János Végh

    The computing paradigm invented for processing a small amount of data on a single segregated processor cannot meet the challenges set by the present-day computing demands. The paper proposes a new computing paradigm (extending the old one to use several processors explicitly) and discusses some questions of its possible implementation. Some advantages of the implemented approach, illustrated with the

  • CLARINET: A RISC-V Based Framework for Posit Arithmetic Empiricism
    arXiv.cs.AR Pub Date : 2020-05-30
    Riya Jain; Niraj Sharma; Farhad Merchant; Sachin Patkar; Rainer Leupers

    Many engineering and scientific applications require high precision arithmetic. IEEE~754-2008 compliant (floating-point) arithmetic is the de facto standard for performing these computations. Recently, posit arithmetic has been proposed as a drop-in replacement for floating-point arithmetic. The posit data representation and arithmetic offer several absolute advantages over the floating-point format

  • A Unified Hardware Architecture for Convolutions and Deconvolutions in CNN
    arXiv.cs.AR Pub Date : 2020-05-29
    Lin Bai; Yecheng Lyu; Xinming Huang

    In this paper, a scalable neural network hardware architecture for image segmentation is proposed. By sharing the same computing resources, both convolution and deconvolution operations are handled by the same process element array. In addition, access to on-chip and off-chip memories is optimized to alleviate the burden introduced by partial sum. As an example, SegNet-Basic has been implemented using

  • Dynamic Merge Point Prediction
    arXiv.cs.AR Pub Date : 2020-05-29
    Stephen Pruett; Yale Patt

    Despite decades of research, conditional branch mispredictions still pose a significant problem for performance. Moreover, limit studies on infinite size predictors show that many of the remaining branches are impossible to predict by current strategies. Our work focuses on mitigating performance loss in the face of impossible to predict branches. This paper presents a dynamic merge point predictor

  • Revisiting RowHammer: An Experimental Analysis of Modern DRAM Devices and Mitigation Techniques
    arXiv.cs.AR Pub Date : 2020-05-27
    Jeremie S. Kim; Minesh Patel; A. Giray Yaglikci; Hasan Hassan; Roknoddin Azizi; Lois Orosa; Onur Mutlu

    In order to shed more light on how RowHammer affects modern and future devices at the circuit-level, we first present an experimental characterization of RowHammer on 1580 DRAM chips (408x DDR3, 652x DDR4, and 520x LPDDR4) from 300 DRAM modules (60x DDR3, 110x DDR4, and 130x LPDDR4) with RowHammer protection mechanisms disabled, spanning multiple different technology nodes from across each of the three

  • CLR-DRAM: A Low-Cost DRAM Architecture Enabling Dynamic Capacity-Latency Trade-Off
    arXiv.cs.AR Pub Date : 2020-05-26
    Haocong Luo; Taha Shahroodi; Hasan Hassan; Minesh Patel; Abdullah Giray Yaglikci; Lois Orosa; Jisung Park; Onur Mutlu

    DRAM is the prevalent main memory technology, but its long access latency can limit the performance of many workloads. Although prior works provide DRAM designs that reduce DRAM access latency, their reduced storage capacities hinder the performance of workloads that need large memory capacity. Because the capacity-latency trade-off is fixed at design time, previous works cannot achieve maximum performance

  • Accelerate Cycle-Level Full-System Simulation of Multi-Core RISC-V Systems with Binary Translation
    arXiv.cs.AR Pub Date : 2020-05-22
    Xuan Guo; Robert Mullins

    It has always been difficult to balance the accuracy and performance of ISSs. RTL simulators or systems such as gem5 are used to execute programs in a cycle-accurate manner but are often prohibitively slow. In contrast, functional simulators such as QEMU can run large benchmarks to completion in a reasonable time yet capture few performance metrics and fail to model complex interactions between multiple

  • Stack up your chips: Betting on 3D integration to augment Moore's Law scaling
    arXiv.cs.AR Pub Date : 2020-05-21
    Saurabh Sinha; Xiaoqing Xu; Mudit Bhargava; Shidhartha Das; Brian Cline; Greg Yeric

    3D integration, i.e., stacking of integrated circuit layers using parallel or sequential processing is gaining rapid industry adoption with the slowdown of Moore's law scaling. 3D stacking promises potential gains in performance, power and cost but the actual magnitude of gains varies depending on end-application, technology choices and design. In this talk, we will discuss some key challenges associated

  • Memory-Aware Denial-of-Service Attacks on Shared Cache in Multicore Real-Time Systems
    arXiv.cs.AR Pub Date : 2020-05-21
    Michael Bechtel; Heechul Yun

    In this paper, we identify that memory performance plays a crucial role in the feasibility and effectiveness for performing denial-of-service attacks on shared cache. Based on this insight, we introduce new cache DoS attacks, which can be mounted from the user-space and can cause extreme WCET impacts to cross-core victims---even if the shared cache is partitioned---by taking advantage of the platform's

  • A Way Around UMIP and Descriptor-Table Exiting via TSX-based Side-Channel Attack
    arXiv.cs.AR Pub Date : 2020-05-20
    Mohammad Sina Karvandi; Saleh Khalaj Monfared; Mohammad Sina Kiarostami; Dara Rahmati; Saeid Gorgin

    Nowadays, in operating systems, numerous protection mechanisms prevent or limit the user-mode applications to access the kernel's internal information. This is regularly carried out by software-based defenses such as Address Space Layout Randomization (ASLR) and Kernel ASLR (KASLR). They play pronounced roles when the security of sandboxed applications such as Web-browser are considered. Armed with

  • The Virtual Block Interface: A Flexible Alternative to the Conventional Virtual Memory Framework
    arXiv.cs.AR Pub Date : 2020-05-19
    Nastaran Hajinazar; Pratyush Patel; Minesh Patel; Konstantinos Kanellopoulos; Saugata Ghose; Rachata Ausavarungnirun; Geraldo Francisco de Oliveira Jr.; Jonathan Appavoo; Vivek Seshadri; Onur Mutlu

    Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework

  • In-memory Implementation of On-chip Trainable and Scalable ANN for AI/ML Applications
    arXiv.cs.AR Pub Date : 2020-05-19
    Abhash Kumar; Jawar Singh; Sai Manohar Beeraka; Bharat Gupta

    Traditional von Neumann architecture based processors become inefficient in terms of energy and throughput as they involve separate processing and memory units, also known as~\textit{memory wall}. The memory wall problem is further exacerbated when massive parallelism and frequent data movement are required between processing and memory units for real-time implementation of artificial neural network

  • Energy-Efficient On-Chip Networks through Profiled Hybrid Switching
    arXiv.cs.AR Pub Date : 2020-05-18
    Yuan He; Jinyu Jiao; Thang Cao; Masaaki Kondo

    Virtual channel flow control is the de facto choice for modern networks-on-chip to allow better utilization of the link bandwidth through buffering and packet switching, which are also the sources of large power footprint and long per-hop latency. On the other hand, bandwidth can be plentiful for parallel workloads under virtual channel flow control. Thus, dated but simpler flow controls such as circuit

  • A Lightweight Isolation Mechanism for Secure Branch Predictors
    arXiv.cs.AR Pub Date : 2020-05-17
    Lutan Zhao; Peinan Li; Rui Hou; Jiazhen Li; Michael C. Huang; Lixin Zhang; Xuehai Qian; Dan Meng

    Recently exposed vulnerabilities reveal the necessity to improve the security of branch predictors. Branch predictors record history about the execution of different programs, and such information from different processes are stored in the same structure and thus accessible to each other. This leaves the attackers with the opportunities for malicious training and malicious perception. Instead of flush-based

  • Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference
    arXiv.cs.AR Pub Date : 2020-05-16
    Zhi-Gang Liu; Paul N. Whatmough; Matthew Mattina

    Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). The systolic array (SA) is a pipelined 2D array of processing elements (PEs), with very efficient local data movement, well suited to accelerating GEMM, and widely deployed in industry. In this work, we describe two significant improvements

  • SysScale: Exploiting Multi-domain Dynamic Voltage and Frequency Scaling for Energy Efficient Mobile Processors
    arXiv.cs.AR Pub Date : 2020-05-15
    Jawad Haj-Yahya; Mohammed Alser; Jeremie Kim; A. Giray Yaglıkçı; Nandita Vijaykumar; Efraim Rotem; Onur Mutlu

    There are three domains in a modern thermally-constrained mobile system-on-chip (SoC): compute, IO, and memory. We observe that a modern SoC typically allocates a fixed power budget, corresponding to worst-case performance demands, to the IO and memory domains even if they are underutilized. The resulting unfair allocation of the power budget across domains can cause two major issues: 1) the IO and

  • ChewBaccaNN: A Flexible 223 TOPS/W BNN Accelerator
    arXiv.cs.AR Pub Date : 2020-05-12
    Renzo Andri; Geethan Karunaratne; Lukas Cavigelli; Luca Benini

    Binary Neural Networks enable smart IoT devices, as they significantly reduce the required memory footprint and computational complexity while retaining a high network performance and flexibility. This paper presents ChewBaccaNN, a 0.7 mm$^2$ sized binary CNN accelerator designed in globalfoundries 22 nm technology. By exploiting efficient data re-use, data buffering, latch-based memories, and voltage

  • Accelerating Deep Neuroevolution on Distributed FPGAs for Reinforcement Learning Problems
    arXiv.cs.AR Pub Date : 2020-05-10
    Alexis Asseman; Nicolas Antoine; Ahmet S. Ozcan

    Reinforcement learning augmented by the representational power of deep neural networks, has shown promising results on high-dimensional problems, such as game playing and robotic control. However, the sequential nature of these problems poses a fundamental challenge for computational efficiency. Recently, alternative approaches such as evolutionary strategies and deep neuroevolution demonstrated competitive

  • Power and Accuracy of Multi-Layer Perceptrons (MLPs) under Reduced-voltage FPGA BRAMs Operation
    arXiv.cs.AR Pub Date : 2020-05-10
    Behzad Salami; Osman Unsal; Adrian Cristal

    In this paper, we exploit the aggressive supply voltage underscaling technique in Block RAMs (BRAMs) of Field Programmable Gate Arrays (FPGAs) to improve the energy efficiency of Multi-Layer Perceptrons (MLPs). Additionally, we evaluate and improve the resilience of this accelerator. Through experiments on several representative FPGA fabrics, we observe that until a minimum safe voltage level, i.e

  • Exploiting Inter- and Intra-Memory Asymmetries for Data Mapping in Hybrid Tiered-Memories
    arXiv.cs.AR Pub Date : 2020-05-10
    Shihao Song; Anup Das; Nagarajan Kandasamy

    Modern computing systems are embracing hybrid memory comprising of DRAM and non-volatile memory (NVM) to combine the best properties of both memory technologies, achieving low latency, high reliability, and high density. A prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access latency much higher than DRAM access latency. We call this inter-memory asymmetry. We observe that parasitic

  • Improving Phase Change Memory Performance with Data Content Aware Access
    arXiv.cs.AR Pub Date : 2020-05-10
    Shihao Song; Anup Das; Onur Mutlu; Nagarajan Kandasamy

    A prominent characteristic of write operation in Phase-Change Memory (PCM) is that its latency and energy are sensitive to the data to be written as well as the content that is overwritten. We observe that overwriting unknown memory content can incur significantly higher latency and energy compared to overwriting known all-zeros or all-ones content. This is because all-zeros or all-ones content is

  • Benchmarking High Bandwidth Memory on FPGAs
    arXiv.cs.AR Pub Date : 2020-05-09
    Zeke Wang; Hongjing Huang; Jie Zhang; Gustavo Alonso

    FPGAs are starting to be enhanced with High Bandwidth Memory (HBM) as a way to reduce the memory bandwidth bottleneck encountered in some applications and to give the FPGA more capacity to deal with application state. However, the performance characteristics of HBM are still not well specified, especially in the context of FPGAs. In this paper, we bridge the gap between nominal specifications and actual

  • Optimizing Temporal Convolutional Network inference on FPGA-based accelerators
    arXiv.cs.AR Pub Date : 2020-05-07
    Marco Carreras; Gianfranco Deriu; Luigi Raffo; Luca Benini; Paolo Meloni

    Convolutional Neural Networks are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition, and segmentation. Recent research results demonstrate that multilayer(deep) networks involving mono-dimensional convolutions and dilation can be effectively used in time series and sequences classification and segmentation, as

  • A Post-Silicon Trace Analysis Approach for System-on-Chip Protocol Debug
    arXiv.cs.AR Pub Date : 2020-05-06
    Yuting Cao; Hao Zheng; Sandip Ray; Jin Yang

    Reconstructing system-level behavior from silicon traces is a critical problem in post-silicon validation of System-on-Chip designs. Current industrial practice in this area is primarily manual, depending on collaborative insights of the architects, designers, and validators. This paper presents a trace analysis approach that exploits architectural models of the system-level protocols to reconstruct

  • Comparing quaternary and binary multipliers
    arXiv.cs.AR Pub Date : 2020-05-06
    Daniel Etiemble

    We compare the implementation of a 8x8 bit multiplier with two different implementations of a 4x4 quaternary digit multiplier. Interfacing this binary multiplier with quaternary to binary decoders and binary to quaternary encoders leads to a 4x4 multiplier that outperforms the best direct implementation of a 4x4 quaternary multiplier. The far greater complexity of the 1-digit multipliers and 1-digit

  • Computing-in-Memory for Performance and Energy Efficient Homomorphic Encryption
    arXiv.cs.AR Pub Date : 2020-05-05
    Dayane Reis; Jonathan Takeshita; Taeho Jung; Michael Niemier; Xiaobo Sharon Hu

    Homomorphic encryption (HE) allows direct computations on encrypted data. Despite numerous research efforts, the practicality of HE schemes remains to be demonstrated. In this regard, the enormous size of ciphertexts involved in HE computations degrades computational efficiency. Near-memory Processing (NMP) and Computing-in-memory (CiM) - paradigms where computation is done within the memory boundaries

  • LiteX: an open-source SoC builder and library based on Migen Python DSL
    arXiv.cs.AR Pub Date : 2020-05-05
    Florent Kermarrec; Sébastien Bourdeauducq; Jean-Christophe Le Lann; Hannah Badier

    LiteX is a GitHub-hosted SoC builder / IP library and utilities that can be used to create SoCs and full FPGA designs. Besides being open-source and BSD licensed, its originality lies in the fact that its IP components are entirely described using Migen Python internal DSL, which simplifies its design in depth. LiteX already supports various softcores CPUs and essential peripherals, with no dependencies

  • Best implementations of quaternary adders
    arXiv.cs.AR Pub Date : 2020-05-05
    Daniel Etiemble

    The implementation of a quaternary 1-digit adder composed of a 2-bit binary adder, quaternary to binary decoders and binary to quaternary encoders is compared with several recent implementations of quaternary adders. This simple implementation outperforms all other implementations using only one power supply. It is equivalent to the best other implementation using three power supplies. The best quaternary

  • Testing Compilers for Programmable Switches Through Switch Hardware Simulation
    arXiv.cs.AR Pub Date : 2020-05-05
    Michael D. Wong; Aatish Varma; Anirudh Sivaraman

    Programmable switches have emerged as powerful and flexible alternatives to fixed function forwarding devices. But because of the unique hardware constraints of network switches, the design and implementation of compilers targeting these devices is tedious and error prone. Despite the important role that compilers play in software development, there is a dearth of tools for testing compilers within

  • Lupulus: A Flexible Hardware Accelerator for Neural Networks
    arXiv.cs.AR Pub Date : 2020-05-03
    Andreas Toftegaard Kristensen; Robert Giterman; Alexios Balatsoukas-Stimming; Andreas Burg

    Neural networks have become indispensable for a wide range of applications, but they suffer from high computational- and memory-requirements, requiring optimizations from the algorithmic description of the network to the hardware implementation. Moreover, the high rate of innovation in machine learning makes it important that hardware implementations provide a high level of programmability to support

  • TIMELY: Pushing Data Movements and Interfaces in PIM Accelerators Towards Local and in Time Domain
    arXiv.cs.AR Pub Date : 2020-05-03
    Weitao Li; Pengfei Xu; Yang Zhao; Haitong Li; Yuan Xie; Yingyan Lin

    Resistive-random-access-memory (ReRAM) based processing-in-memory (R$^2$PIM) accelerators show promise in bridging the gap between Internet of Thing devices' constrained resources and Convolutional/Deep Neural Networks' (CNNs/DNNs') prohibitive energy cost. Specifically, R$^2$PIM accelerators enhance energy efficiency by eliminating the cost of weight movements and improving the computational density

Contents have been reproduced by permission of the publishers.
ACS ES&T Engineering
ACS ES&T Water