• arXiv.cs.AR Pub Date : 2020-06-30
You Wu; Xuehai Qian

The recent works such as InvisiSpec, SafeSpec, and Cleanup-Spec, among others, provided promising solutions to defend speculation induced (transient) attacks. However, they intro-duce delay either when a speculative load becomes safe in the redo approach or when it is squashed in the undo approach. We argue that it is due to the lack of fundamental mechanisms for reversing the effects of speculation

更新日期：2020-07-01
• arXiv.cs.AR Pub Date : 2020-06-30
Jörn Tillmanns; Jiska Classen; Felix Rohrbach; Matthias Hollick

Bluetooth chips must include a Random Number Generator (RNG). This RNG is used internally within cryptographic primitives but also exposed to the operating system for chip-external applications. In general, it is a black box with security-critical authentication and encryption mechanisms depending on it. In this paper, we evaluate the quality of RNGs in various Broadcom and Cypress Bluetooth chips

更新日期：2020-07-01
• arXiv.cs.AR Pub Date : 2020-06-29
Andrea Mondelli; Paul Gazzillo; Yan Solihin

One of the most prevalent source of side channel vulnerabilities is the secret-dependent behavior of conditional branches (SDBCB). The state-of-the-art solution relies on Constant-Time Expressions, which require high programming effort and incur high performance overheads. In this paper, we propose SeMPE, an approach that relies on architecture support to eliminate SDBCB without requiring much programming

更新日期：2020-07-01
• arXiv.cs.AR Pub Date : 2020-06-29
Evan Zheran Liu; Milad Hashemi; Kevin Swersky; Parthasarathy Ranganathan; Junwhan Ahn

Program execution speed critically depends on increasing cache hits, as cache hits are orders of magnitude faster than misses. To increase cache hits, we focus on the problem of cache replacement: choosing which cache line to evict upon inserting a new line. This is challenging because it requires planning far ahead and currently there is no known practical solution. As a result, current replacement

更新日期：2020-06-30
• arXiv.cs.AR Pub Date : 2020-06-26
Nandan Kumar Jha; Shreyas Ravishankar; Sparsh Mittal; Arvind Kaushik; Dipan Mandal; Mahesh Chandra

The number of processing elements (PEs) in a fixed-sized systolic accelerator is well matched for large and compute-bound DNNs; whereas, memory-bound DNNs suffer from PE underutilization and fail to achieve peak performance and energy efficiency. To mitigate this, specialized dataflow and/or micro-architectural techniques have been proposed. However, due to the longer development cycle and the rapid

更新日期：2020-06-29
• arXiv.cs.AR Pub Date : 2020-06-25
Yeonsoo Jeon; Dongsuk Jeon

Various post-quantum cryptography algorithms have been recently proposed. Supersingluar isogeny Diffie-Hellman key exchange (SIKE) is one of the most promising candidates due to its small key size. However, the SIKE scheme requires numerous finite field multiplications for its isogeny computation, and hence suffers from slow encryption and decryption process. In this paper, we propose a fast finite

更新日期：2020-06-26
• arXiv.cs.AR Pub Date : 2020-06-25
Pasquale Davide Schiavone; Davide Rossi; Alfio Di Mauro; Frank Gurkaynak; Timothy Saxe; Mao Wang; Ket Chong Yap; Luca Benini

A wide range of Internet of Things (IoT) applications require powerful, energy-efficient and flexible end-nodes to acquire data from multiple sources, process and distill the sensed data through near-sensor data analytics algorithms, and transmit it wirelessly. This work presents Arnold: a 0.5 V to 0.8 V, 46.83 uW/MHz, 600 MOPS fully programmable RISC-V Microcontroller unit (MCU) fabricated in 22 nm

更新日期：2020-06-26
• arXiv.cs.AR Pub Date : 2020-06-24
Kevin Stehle; Günther Schindler; Holger Fröning

Systolic arrays are a promising computing concept which is in particular inline with CMOS technology trends and linear algebra operations found in the processing of artificial neural networks. The recent success of such deep learning methods in a wide set of applications has led to a variety of models, which albeit conceptual similar as based on convolutions and fully-connected layers, in detail show

更新日期：2020-06-26
• arXiv.cs.AR Pub Date : 2020-06-24
Rafael Pizarro Solar; Michal Pleskowicz

Block-matching and 3D filtering (BM3D) is an image denoising algorithm that works in two similar steps. Both of these steps need to perform grouping by block-matching. We implement the block-matching in an FPGA, leveraging its ability to perform parallel computations. Our goal is to enable other researchers to use our solution in the future for real-time video denoising in video cameras that use FPGAs

更新日期：2020-06-26
• arXiv.cs.AR Pub Date : 2020-06-24
David Stutz; Nandhini Chandramoorthy; Matthias Hein; Bernt Schiele

The design of deep neural network (DNN) accelerators, i.e., specialized hardware for inference, has received considerable attention in past years due to saved cost, area, and energy compared to mainstream hardware. We consider the problem of random and adversarial bit errors in quantized DNN weights stored on accelerator memory. Random bit errors arise when optimizing accelerators for energy efficiency

更新日期：2020-06-26
• arXiv.cs.AR Pub Date : 2020-06-24
Truls Asheim; Rakesh Kumar; Boris Grot

Prior work has observed that fetch-directed prefetching (FDIP) is highly effective at covering instruction cache misses. The key to FDIP's effectiveness is having a sufficiently large BTB to accommodate the application's branch working set. In this work, we introduce several optimizations that significantly extend the reach of the BTB within the available storage budget. Our optimizations target nearly

更新日期：2020-06-25
• arXiv.cs.AR Pub Date : 2020-06-23
Sina Sayyah Ensan; Karthikeyan Nagarajan; Mohammad Nasim Imtia Khan; Swaroop Ghosh

In-memory computing architectures provide a much needed solution to energy-efficiency barriers posed by Von-Neumann computing due to the movement of data between the processor and the memory. Functions implemented in such in-memory architectures are often proprietary and constitute confidential Intellectual Property. Our studies indicate that IMCs implemented using RRAM are susceptible to Side Channel

更新日期：2020-06-24
• arXiv.cs.AR Pub Date : 2020-06-22
Taeuk Kim; Safdar Jamil; Joongeon Park; Youngjae Kim

Main memory (DRAM) significantly impacts the power and energy utilization of the overall server system. Non-Volatile Memory (NVM) devices, such as Phase Change Memory and Spin-Transfer Torque RAM, are suitable candidates for main memory to reduce energy consumption. But unlike DRAM, NVMs access latencies are higher than DRAM and NVM writes are more energy sensitive than DRAM write operations. Thus

更新日期：2020-06-23
• arXiv.cs.AR Pub Date : 2020-06-20
Jongouk Choi; Qingrui Liu; Changhee Jung

This paper presents CoSpec, a new architecture/compiler co-design scheme that works for commodity in-order processors used in energy-harvesting systems. To achieve crash consistency without requiring unconventional architectural support, CoSpec leverages speculation assuming that power failure is not going to occur and thus holds all committed stores in a store buffer (SB), as if they were speculative

更新日期：2020-06-23
• arXiv.cs.AR Pub Date : 2020-06-20
Lenny Truong; Steven Herbst; Rajsekhar Setaluri; Makai Mann; Ross Daly; Keyi Zhang; Caleb Donovick; Daniel Stanley; Mark Horowitz; Clark Barrett; Pat Hanrahan

While hardware generators have drastically improved design productivity, they have introduced new challenges for the task of verification. To effectively cover the functionality of a sophisticated generator, verification engineers require tools that provide the flexibility of metaprogramming. However, flexibility alone is not enough; components must also be portable in order to encourage the proliferation

更新日期：2020-06-23
• arXiv.cs.AR Pub Date : 2020-06-19
Costas Iordanou; Vassos Soteriou; Konstantinos Aisopos

With relentless CMOS technology downsizing Networks-on-Chips (NoCs) are inescapably experiencing escalating susceptibility to wearout and reduced reliability. While faults in processors and memories may be masked via redundancy, or mitigated via techniques such as task migration, NoCs are especially vulnerable to hardware faults as a single link breakdown may cause inter-tile communication to halt

更新日期：2020-06-22
• arXiv.cs.AR Pub Date : 2020-06-16
Jie Zhang; Myoungsoo Jung

We propose ZnG, a new GPU-SSD integrated architecture, which can maximize the memory capacity in a GPU and address performance penalties imposed by an SSD. Specifically, ZnG replaces all GPU internal DRAMs with an ultra-low-latency SSD to maximize the GPU memory capacity. ZnG further removes performance bottleneck of the SSD by replacing its flash channels with a high-throughput flash network and integrating

更新日期：2020-06-16
• arXiv.cs.AR Pub Date : 2020-06-15
Priyank Faldu

Last-Level Cache (LLC) represents the bulk of a modern CPU processor's transistor budget and is essential for application performance as LLC enables fast access to data in contrast to much slower main memory. However, applications with large working set size often exhibit streaming and/or thrashing access patterns at LLC. As a result, a large fraction of the LLC capacity is occupied by dead blocks

更新日期：2020-06-15
• arXiv.cs.AR Pub Date : 2020-06-14
Joel Mandebi Mbongue; Alex Shuping; Pankaj Bhowmik; Christophe Bobda

Cloud deployments now increasingly provision FPGA accelerators as part of virtual instances. While FPGAs are still essentially single-tenant, the growing demand for hardware acceleration will inevitably lead to the need for methods and architectures supporting FPGA multi-tenancy. In this paper, we propose an architecture supporting space-sharing of FPGA devices among multiple tenants in the cloud.

更新日期：2020-06-14
• arXiv.cs.AR Pub Date : 2020-06-12
Arash Fouman Ajirlou; Inna Partin-Vaisband

A machine learning (ML) design framework is proposed for dynamically adjusting clock frequency based on propagation delay of individual instructions. A Random Forest model is trained to classify propagation delays in real-time, utilizing current operation type, current operands, and computation history as ML features. The trained model is implemented in Verilog as an additional pipeline stage within

更新日期：2020-06-12
• arXiv.cs.AR Pub Date : 2020-06-10
Francisco Muñoz-Martínez; José L. Abellán; Manuel E. Acacio; Tushar Krishna

The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays. First-generation rigid proposals have been rapidly replaced by more advanced flexible accelerator architectures able to efficiently support a variety of layer types and dimensions. As the complexity of the designs grows, it is more and more appealing

更新日期：2020-06-10
• arXiv.cs.AR Pub Date : 2020-06-10
Supriya Chakraborty; Abhishek Gupta; Manan Suri

In this paper, we present a unified FPGA based electrical test-bench for characterizing different emerging NonVolatile Memory (NVM) chips. In particular, we present detailed electrical characterization and benchmarking of multiple commercially available, off-the-shelf, NVM chips viz.: MRAM, FeRAM, CBRAM, and ReRAM. We investigate important NVM parameters such as: (i) current consumption patterns, (ii)

更新日期：2020-06-10
• arXiv.cs.AR Pub Date : 2020-06-10
Alexandra Angerd; Erik Sintorn; Per Stenström

GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilization. This paper introduces a new register file organization for efficient register-packing of narrow integer and floating-point operands designed to leverage on advances in static analysis

更新日期：2020-06-10
• arXiv.cs.AR Pub Date : 2020-06-10
Shihao Song; Anup Das; Nagarajan Kandasamy

As process technology continues to scale aggressively, circuit aging in a neuromorphic hardware due to negative bias temperature instability (NBTI) and time-dependent dielectric breakdown (TDDB) is becoming a critical reliability issue and is expected to proliferate when using non-volatile memory (NVM) for synaptic storage. This is because an NVM requires high voltage and current to access its synaptic

更新日期：2020-06-10
• arXiv.cs.AR Pub Date : 2020-06-05
Jieru Zhao; Tingyuan Liang; Liang Feng; Wenchao Ding; Sharad Sinha; Wei Zhang; Shaojie Shen

Fast and accurate depth estimation, or stereo matching, is essential in embedded stereo vision systems, requiring substantial design effort to achieve an appropriate balance among accuracy, speed and hardware cost. To reduce the design effort and achieve the right balance, we propose FP-Stereo for building high-performance stereo matching pipelines on FPGAs automatically. FP-Stereo consists of an open-source

更新日期：2020-06-05
• arXiv.cs.AR Pub Date : 2020-06-04
Brian Crafton; Samuel Spetalnick; Arijit Raychowdhury

Compute in-memory (CIM) is a promising technique that minimizes data transport, the primary performance bottleneck and energy cost of most data intensive applications. This has found wide-spread adoption in accelerating neural networks for machine learning applications. Utilizing a crossbar architecture with emerging non-volatile memories (eNVM) such as dense resistive random access memory (RRAM) or

更新日期：2020-06-04
• arXiv.cs.AR Pub Date : 2020-06-03
Furkan Ercan; Thibaud Tonnellier; Carlo Condo; Warren J. Gross

Polar codes are a class of linear block codes that provably achieves channel capacity. They have been selected as a coding scheme for the control channel of enhanced mobile broadband (eMBB) scenario for $5^{\text{th}}$ generation wireless communication networks (5G) and are being considered for additional use scenarios. As a result, fast decoding techniques for polar codes are essential. Previous works

更新日期：2020-06-03
• arXiv.cs.AR Pub Date : 2020-06-02
Xueyan Wang; Jianlei Yang; Yinglin Zhao; Xiaotao Jia; Gang Qu; Weisheng Zhao

Computing-in-memory (CIM) is proposed to alleviate the processor-memory data transfer bottleneck in traditional Von-Neumann architectures, and spintronics-based magnetic memory has demonstrated many facilitation in implementing CIM paradigm. Since hardware security has become one of the major concerns in circuit designs, this paper, for the first time, investigates spin-based computing-in-memory (SpinCIM)

更新日期：2020-06-02
• arXiv.cs.AR Pub Date : 2020-06-01
George Papadimitriou; Athanasios Chatzidimitriou; Dimitris Gizopoulos; Vijay Janapa Reddi; Jingwen Leng; Behzad Salami; Osman S. Unsal; Adrian Cristal Kestelman

Modern large-scale computing systems (data centers, supercomputers, cloud and edge setups and high-end cyber-physical systems) employ heterogeneous architectures that consist of multicore CPUs, general-purpose many-core GPUs, and programmable FPGAs. The effective utilization of these architectures poses several challenges, among which a primary one is power consumption. Voltage reduction is one of

更新日期：2020-06-01
• arXiv.cs.AR Pub Date : 2020-05-31
János Végh

The computing paradigm invented for processing a small amount of data on a single segregated processor cannot meet the challenges set by the present-day computing demands. The paper proposes a new computing paradigm (extending the old one to use several processors explicitly) and discusses some questions of its possible implementation. Some advantages of the implemented approach, illustrated with the

更新日期：2020-05-31
• arXiv.cs.AR Pub Date : 2020-05-30
Riya Jain; Niraj Sharma; Farhad Merchant; Sachin Patkar; Rainer Leupers

Many engineering and scientific applications require high precision arithmetic. IEEE~754-2008 compliant (floating-point) arithmetic is the de facto standard for performing these computations. Recently, posit arithmetic has been proposed as a drop-in replacement for floating-point arithmetic. The posit data representation and arithmetic offer several absolute advantages over the floating-point format

更新日期：2020-05-30
• arXiv.cs.AR Pub Date : 2020-05-29
Lin Bai; Yecheng Lyu; Xinming Huang

In this paper, a scalable neural network hardware architecture for image segmentation is proposed. By sharing the same computing resources, both convolution and deconvolution operations are handled by the same process element array. In addition, access to on-chip and off-chip memories is optimized to alleviate the burden introduced by partial sum. As an example, SegNet-Basic has been implemented using

更新日期：2020-05-29
• arXiv.cs.AR Pub Date : 2020-05-29
Stephen Pruett; Yale Patt

Despite decades of research, conditional branch mispredictions still pose a significant problem for performance. Moreover, limit studies on infinite size predictors show that many of the remaining branches are impossible to predict by current strategies. Our work focuses on mitigating performance loss in the face of impossible to predict branches. This paper presents a dynamic merge point predictor

更新日期：2020-05-29
• arXiv.cs.AR Pub Date : 2020-05-27
Jeremie S. Kim; Minesh Patel; A. Giray Yaglikci; Hasan Hassan; Roknoddin Azizi; Lois Orosa; Onur Mutlu

In order to shed more light on how RowHammer affects modern and future devices at the circuit-level, we first present an experimental characterization of RowHammer on 1580 DRAM chips (408x DDR3, 652x DDR4, and 520x LPDDR4) from 300 DRAM modules (60x DDR3, 110x DDR4, and 130x LPDDR4) with RowHammer protection mechanisms disabled, spanning multiple different technology nodes from across each of the three

更新日期：2020-05-27
• arXiv.cs.AR Pub Date : 2020-05-26
Haocong Luo; Taha Shahroodi; Hasan Hassan; Minesh Patel; Abdullah Giray Yaglikci; Lois Orosa; Jisung Park; Onur Mutlu

DRAM is the prevalent main memory technology, but its long access latency can limit the performance of many workloads. Although prior works provide DRAM designs that reduce DRAM access latency, their reduced storage capacities hinder the performance of workloads that need large memory capacity. Because the capacity-latency trade-off is fixed at design time, previous works cannot achieve maximum performance

更新日期：2020-05-26
• arXiv.cs.AR Pub Date : 2020-05-22
Xuan Guo; Robert Mullins

It has always been difficult to balance the accuracy and performance of ISSs. RTL simulators or systems such as gem5 are used to execute programs in a cycle-accurate manner but are often prohibitively slow. In contrast, functional simulators such as QEMU can run large benchmarks to completion in a reasonable time yet capture few performance metrics and fail to model complex interactions between multiple

更新日期：2020-05-22
• arXiv.cs.AR Pub Date : 2020-05-21
Saurabh Sinha; Xiaoqing Xu; Mudit Bhargava; Shidhartha Das; Brian Cline; Greg Yeric

3D integration, i.e., stacking of integrated circuit layers using parallel or sequential processing is gaining rapid industry adoption with the slowdown of Moore's law scaling. 3D stacking promises potential gains in performance, power and cost but the actual magnitude of gains varies depending on end-application, technology choices and design. In this talk, we will discuss some key challenges associated

更新日期：2020-05-21
• arXiv.cs.AR Pub Date : 2020-05-21
Michael Bechtel; Heechul Yun

In this paper, we identify that memory performance plays a crucial role in the feasibility and effectiveness for performing denial-of-service attacks on shared cache. Based on this insight, we introduce new cache DoS attacks, which can be mounted from the user-space and can cause extreme WCET impacts to cross-core victims---even if the shared cache is partitioned---by taking advantage of the platform's

更新日期：2020-05-21
• arXiv.cs.AR Pub Date : 2020-05-20
Mohammad Sina Karvandi; Saleh Khalaj Monfared; Mohammad Sina Kiarostami; Dara Rahmati; Saeid Gorgin

Nowadays, in operating systems, numerous protection mechanisms prevent or limit the user-mode applications to access the kernel's internal information. This is regularly carried out by software-based defenses such as Address Space Layout Randomization (ASLR) and Kernel ASLR (KASLR). They play pronounced roles when the security of sandboxed applications such as Web-browser are considered. Armed with

更新日期：2020-05-20
• arXiv.cs.AR Pub Date : 2020-05-19
Nastaran Hajinazar; Pratyush Patel; Minesh Patel; Konstantinos Kanellopoulos; Saugata Ghose; Rachata Ausavarungnirun; Geraldo Francisco de Oliveira Jr.; Jonathan Appavoo; Vivek Seshadri; Onur Mutlu

Computers continue to diversify with respect to system designs, emerging memory technologies, and application memory demands. Unfortunately, continually adapting the conventional virtual memory framework to each possible system configuration is challenging, and often results in performance loss or requires non-trivial workarounds. To address these challenges, we propose a new virtual memory framework

更新日期：2020-05-19
• arXiv.cs.AR Pub Date : 2020-05-19
Abhash Kumar; Jawar Singh; Sai Manohar Beeraka; Bharat Gupta

Traditional von Neumann architecture based processors become inefficient in terms of energy and throughput as they involve separate processing and memory units, also known as~\textit{memory wall}. The memory wall problem is further exacerbated when massive parallelism and frequent data movement are required between processing and memory units for real-time implementation of artificial neural network

更新日期：2020-05-19
• arXiv.cs.AR Pub Date : 2020-05-18
Yuan He; Jinyu Jiao; Thang Cao; Masaaki Kondo

Virtual channel flow control is the de facto choice for modern networks-on-chip to allow better utilization of the link bandwidth through buffering and packet switching, which are also the sources of large power footprint and long per-hop latency. On the other hand, bandwidth can be plentiful for parallel workloads under virtual channel flow control. Thus, dated but simpler flow controls such as circuit

更新日期：2020-05-18
• arXiv.cs.AR Pub Date : 2020-05-17
Lutan Zhao; Peinan Li; Rui Hou; Jiazhen Li; Michael C. Huang; Lixin Zhang; Xuehai Qian; Dan Meng

Recently exposed vulnerabilities reveal the necessity to improve the security of branch predictors. Branch predictors record history about the execution of different programs, and such information from different processes are stored in the same structure and thus accessible to each other. This leaves the attackers with the opportunities for malicious training and malicious perception. Instead of flush-based

更新日期：2020-05-17
• arXiv.cs.AR Pub Date : 2020-05-16
Zhi-Gang Liu; Paul N. Whatmough; Matthew Mattina

Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). The systolic array (SA) is a pipelined 2D array of processing elements (PEs), with very efficient local data movement, well suited to accelerating GEMM, and widely deployed in industry. In this work, we describe two significant improvements

更新日期：2020-05-16
• arXiv.cs.AR Pub Date : 2020-05-15
Jawad Haj-Yahya; Mohammed Alser; Jeremie Kim; A. Giray Yaglıkçı; Nandita Vijaykumar; Efraim Rotem; Onur Mutlu

There are three domains in a modern thermally-constrained mobile system-on-chip (SoC): compute, IO, and memory. We observe that a modern SoC typically allocates a fixed power budget, corresponding to worst-case performance demands, to the IO and memory domains even if they are underutilized. The resulting unfair allocation of the power budget across domains can cause two major issues: 1) the IO and

更新日期：2020-05-15
• arXiv.cs.AR Pub Date : 2020-05-12
Renzo Andri; Geethan Karunaratne; Lukas Cavigelli; Luca Benini

Binary Neural Networks enable smart IoT devices, as they significantly reduce the required memory footprint and computational complexity while retaining a high network performance and flexibility. This paper presents ChewBaccaNN, a 0.7 mm$^2$ sized binary CNN accelerator designed in globalfoundries 22 nm technology. By exploiting efficient data re-use, data buffering, latch-based memories, and voltage

更新日期：2020-05-12
• arXiv.cs.AR Pub Date : 2020-05-10
Alexis Asseman; Nicolas Antoine; Ahmet S. Ozcan

Reinforcement learning augmented by the representational power of deep neural networks, has shown promising results on high-dimensional problems, such as game playing and robotic control. However, the sequential nature of these problems poses a fundamental challenge for computational efficiency. Recently, alternative approaches such as evolutionary strategies and deep neuroevolution demonstrated competitive

更新日期：2020-05-10
• arXiv.cs.AR Pub Date : 2020-05-10

In this paper, we exploit the aggressive supply voltage underscaling technique in Block RAMs (BRAMs) of Field Programmable Gate Arrays (FPGAs) to improve the energy efficiency of Multi-Layer Perceptrons (MLPs). Additionally, we evaluate and improve the resilience of this accelerator. Through experiments on several representative FPGA fabrics, we observe that until a minimum safe voltage level, i.e

更新日期：2020-05-10
• arXiv.cs.AR Pub Date : 2020-05-10
Shihao Song; Anup Das; Nagarajan Kandasamy

Modern computing systems are embracing hybrid memory comprising of DRAM and non-volatile memory (NVM) to combine the best properties of both memory technologies, achieving low latency, high reliability, and high density. A prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access latency much higher than DRAM access latency. We call this inter-memory asymmetry. We observe that parasitic

更新日期：2020-05-10
• arXiv.cs.AR Pub Date : 2020-05-10
Shihao Song; Anup Das; Onur Mutlu; Nagarajan Kandasamy

A prominent characteristic of write operation in Phase-Change Memory (PCM) is that its latency and energy are sensitive to the data to be written as well as the content that is overwritten. We observe that overwriting unknown memory content can incur significantly higher latency and energy compared to overwriting known all-zeros or all-ones content. This is because all-zeros or all-ones content is

更新日期：2020-05-10
• arXiv.cs.AR Pub Date : 2020-05-09
Zeke Wang; Hongjing Huang; Jie Zhang; Gustavo Alonso

FPGAs are starting to be enhanced with High Bandwidth Memory (HBM) as a way to reduce the memory bandwidth bottleneck encountered in some applications and to give the FPGA more capacity to deal with application state. However, the performance characteristics of HBM are still not well specified, especially in the context of FPGAs. In this paper, we bridge the gap between nominal specifications and actual

更新日期：2020-05-09
• arXiv.cs.AR Pub Date : 2020-05-07
Marco Carreras; Gianfranco Deriu; Luigi Raffo; Luca Benini; Paolo Meloni

Convolutional Neural Networks are extensively used in a wide range of applications, commonly including computer vision tasks like image and video classification, recognition, and segmentation. Recent research results demonstrate that multilayer(deep) networks involving mono-dimensional convolutions and dilation can be effectively used in time series and sequences classification and segmentation, as

更新日期：2020-05-07
• arXiv.cs.AR Pub Date : 2020-05-06
Yuting Cao; Hao Zheng; Sandip Ray; Jin Yang

Reconstructing system-level behavior from silicon traces is a critical problem in post-silicon validation of System-on-Chip designs. Current industrial practice in this area is primarily manual, depending on collaborative insights of the architects, designers, and validators. This paper presents a trace analysis approach that exploits architectural models of the system-level protocols to reconstruct

更新日期：2020-05-06
• arXiv.cs.AR Pub Date : 2020-05-06
Daniel Etiemble

We compare the implementation of a 8x8 bit multiplier with two different implementations of a 4x4 quaternary digit multiplier. Interfacing this binary multiplier with quaternary to binary decoders and binary to quaternary encoders leads to a 4x4 multiplier that outperforms the best direct implementation of a 4x4 quaternary multiplier. The far greater complexity of the 1-digit multipliers and 1-digit

更新日期：2020-05-06
• arXiv.cs.AR Pub Date : 2020-05-05
Dayane Reis; Jonathan Takeshita; Taeho Jung; Michael Niemier; Xiaobo Sharon Hu

Homomorphic encryption (HE) allows direct computations on encrypted data. Despite numerous research efforts, the practicality of HE schemes remains to be demonstrated. In this regard, the enormous size of ciphertexts involved in HE computations degrades computational efficiency. Near-memory Processing (NMP) and Computing-in-memory (CiM) - paradigms where computation is done within the memory boundaries

更新日期：2020-05-05
• arXiv.cs.AR Pub Date : 2020-05-05
Florent Kermarrec; Sébastien Bourdeauducq; Jean-Christophe Le Lann; Hannah Badier

LiteX is a GitHub-hosted SoC builder / IP library and utilities that can be used to create SoCs and full FPGA designs. Besides being open-source and BSD licensed, its originality lies in the fact that its IP components are entirely described using Migen Python internal DSL, which simplifies its design in depth. LiteX already supports various softcores CPUs and essential peripherals, with no dependencies

更新日期：2020-05-05
• arXiv.cs.AR Pub Date : 2020-05-05
Daniel Etiemble

The implementation of a quaternary 1-digit adder composed of a 2-bit binary adder, quaternary to binary decoders and binary to quaternary encoders is compared with several recent implementations of quaternary adders. This simple implementation outperforms all other implementations using only one power supply. It is equivalent to the best other implementation using three power supplies. The best quaternary

更新日期：2020-05-05
• arXiv.cs.AR Pub Date : 2020-05-05
Michael D. Wong; Aatish Varma; Anirudh Sivaraman

Programmable switches have emerged as powerful and flexible alternatives to fixed function forwarding devices. But because of the unique hardware constraints of network switches, the design and implementation of compilers targeting these devices is tedious and error prone. Despite the important role that compilers play in software development, there is a dearth of tools for testing compilers within

更新日期：2020-05-05
• arXiv.cs.AR Pub Date : 2020-05-03
Andreas Toftegaard Kristensen; Robert Giterman; Alexios Balatsoukas-Stimming; Andreas Burg

Neural networks have become indispensable for a wide range of applications, but they suffer from high computational- and memory-requirements, requiring optimizations from the algorithmic description of the network to the hardware implementation. Moreover, the high rate of innovation in machine learning makes it important that hardware implementations provide a high level of programmability to support

更新日期：2020-05-03
• arXiv.cs.AR Pub Date : 2020-05-03
Weitao Li; Pengfei Xu; Yang Zhao; Haitong Li; Yuan Xie; Yingyan Lin

Resistive-random-access-memory (ReRAM) based processing-in-memory (R$^2$PIM) accelerators show promise in bridging the gap between Internet of Thing devices' constrained resources and Convolutional/Deep Neural Networks' (CNNs/DNNs') prohibitive energy cost. Specifically, R$^2$PIM accelerators enhance energy efficiency by eliminating the cost of weight movements and improving the computational density

更新日期：2020-05-03
Contents have been reproduced by permission of the publishers.

down
wechat
bug