• arXiv.cs.AR Pub Date : 2020-09-22
Franyell Silfa; Jose Maria Arnau; Antonio Gonzalez

Recurrent Neural Network (RNN) inference exhibits low hardware utilization due to the strict data dependencies across time-steps. Batching multiple requests can increase throughput. However, RNN batching requires a large amount of padding since the batched input sequences may largely differ in length. Schemes that dynamically update the batch every few time-steps avoid padding. However, they require

更新日期：2020-09-23
• arXiv.cs.AR Pub Date : 2020-09-22
Alberto Parravicini; Francesco Sgherzi; Marco D. Santambrogio

Sparse matrix-vector multiplication is often employed in many data-analytic workloads in which low latency and high throughput are more valuable than exact numerical convergence. FPGAs provide quick execution times while offering precise control over the accuracy of the results thanks to reduced-precision fixed-point arithmetic. In this work, we propose a novel streaming implementation of Coordinate

更新日期：2020-09-23
• arXiv.cs.AR Pub Date : 2020-09-21
Mahdi Taheri; Hamed Zandevakili; Ali Mahani

This paper aims to present a new re-configuration sequencing method for difference of read lengths that may take place as input data in which is crucial drawbacks lay impact on DNA sequencing methods.

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-21
Kamil Khan; Sudeep Pasricha; Ryan Gary Kim

Due to amount of data involved in emerging deep learning and big data applications, operations related to data movement have quickly become the bottleneck. Data-centric computing (DCC), as enabled by processing-in-memory (PIM) and near-memory processing (NMP) paradigms, aims to accelerate these types of applications by moving the computation closer to the data. Over the past few years, researchers

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-20
Altuğ Süral; E. Göksu Sezer; Ertuğrul Kolağasıoğlu; Veerle Derudder; Kaoutar Bertrand

This work presents an efficient ASIC implementation of successive cancellation (SC) decoder for polar codes. SC is a low-complexity depth-first search decoding algorithm, favorable for beyond-5G applications that require extremely high throughput and low power. The ASIC implementation of SC in this work exploits many techniques including pipelining and unrolling to achieve Tb/s data throughput without

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-18
Jawad Haj-Yahya; Mohammed Alser; Jeremie S. Kim; Lois Orosa; Efraim Rotem; Avi Mendelson; Anupam Chattopadhyay; Onur Mutlu

Modern client processors typically use one of three commonly-used power delivery network (PDN): 1) motherboard voltage regulators (MBVR), 2) integrated voltage regulators (IVR), and 3) low dropout voltage regulators (LDO). We observe that the energy-efficiency of each of these PDNs varies with the processor power (e.g., thermal design power (TDP) and dynamic power-state) and workload characteristics

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-18
Sung-Jin Kim; Zachary Myers; Steven Herbst; ByongChan Lim; Mark Horowitz

Using digital standard cells and digital place-and-route (PnR) tools, we created a 20 GS/s, 8-bit analog-to-digital converter (ADC) for use in high-speed serial link applications with an ENOB of 5.6, a DNL of 0.96 LSB, and an INL of 2.39 LSB, which dissipated 175 mW in 0.102 mm2 in a 16nm technology. The design is entirely described by HDL so that it can be ported to other processes with minimal effort

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-18
Akash Sridhar; Nursultan Kabylkas; Jose Renau

Branch instructions dependent on hard-to-predict load data are the leading branch misprediction contributors. Current state-of-the-art history-based branch predictors have poor prediction accuracy for these branches. Prior research backs this observation by showing that increasing the size of a 256-KBit history-based branch predictor to its 1-MBit variant has just a 10% reduction in branch mispredictions

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-18
Vidya A. Chhabria; Vipul Ahuja; Ashwath Prabhu; Nikhil Patil; Palkesh Jain; Sachin S. Sapatnekar

Computationally expensive temperature and power grid analyses are required during the design cycle to guide IC design. This paper employs encoder-decoder based generative (EDGe) networks to map these analyses to fast and accurate image-to-image and sequence-to-sequence translation tasks. The network takes a power map as input and outputs the corresponding temperature or IR drop map. We propose two

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-20
Simeon Babatunde; Nirnay Jain; Vishwas Powar

Bulk of the existing Wireless Sensor Network (WSN) nodes are usually battery powered, stationary and mostly designed for short distance communication, with little to no consideration for constrained devices that operate solely on harvested energy. On many occasions, batteries and beefy super-capacitors are used to power these WSN, but these systems are prone to service-life degradation and current-leakages

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-19
Adarsha Balaji; Shihao Song; Anup Das; Jeffrey Krichmar; Nikil Dutt; James Shackleford; Nagarajan Kandasamy; Francky Catthoor

With growing model complexity, mapping Spiking Neural Network (SNN)-based applications to tile-based neuromorphic hardware is becoming increasingly challenging. This is because the synaptic storage resources on a tile, viz. a crossbar, can accommodate only a fixed number of pre-synaptic connections per post-synaptic neuron. For complex SNN models that have many pre-synaptic connections per neuron,

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-18
Gururaj Saileshwar; Moinuddin Qureshi

Shared caches in modern processors are vulnerable to conflict-based attacks, whereby an attacker monitors the access pattern of a victim by engineering cache-set conflicts. Recent mitigations propose a randomized mapping of addresses to cache locations to obfuscate addresses that can conflict with a target address. Unfortunately, such designs continue to select eviction candidates from a small subset

更新日期：2020-09-22
• arXiv.cs.AR Pub Date : 2020-09-18
Yu-Sheng Lin; Hung Chang Lu; Yang-Bin Tsao; Yi-Min Chih; Wei-Chao Chen; Shao-Yi Chien

We propose GrateTile, an efficient, hardwarefriendly data storage scheme for sparse CNN feature maps (activations). It divides data into uneven-sized subtensors and, with small indexing overhead, stores them in a compressed yet randomly accessible format. This design enables modern CNN accelerators to fetch and decompressed sub-tensors on-the-fly in a tiled processing manner. GrateTile is suitable

更新日期：2020-09-21
• arXiv.cs.AR Pub Date : 2020-09-18
Siyuan Lu; Meiqi Wang; Shuang Liang; Jun Lin; Zhongfeng Wang

Designing hardware accelerators for deep neural networks (DNNs) has been much desired. Nonetheless, most of these existing accelerators are built for either convolutional neural networks (CNNs) or recurrent neural networks (RNNs). Recently, the Transformer model is replacing the RNN in the natural language processing (NLP) area. However, because of intensive matrix computations and complicated data

更新日期：2020-09-21
• arXiv.cs.AR Pub Date : 2020-09-17
Yaohua Wang; Lois Orosa; Xiangjun Peng; Yang Guo; Saugata Ghose; Minesh Patel; Jeremie S. Kim; Juan Gómez Luna; Mohammad Sadrosadati; Nika Mansouri Ghiasi; Onur Mutlu

DRAM Main memory is a performance bottleneck for many applications due to the high access latency. In-DRAM caches work to mitigate this latency by augmenting regular-latency DRAM with small-but-fast regions of DRAM that serve as a cache for the data held in the regular-latency region of DRAM. While an effective in-DRAM cache can allow a large fraction of memory requests to be served from a fast DRAM

更新日期：2020-09-20
• arXiv.cs.AR Pub Date : 2020-09-17
Gagandeep Singh; Dionysios Diamantopoulos; Christoph Hagleitner; Juan Gomez-Luna; Sander Stuijk; Onur Mutlu; Henk Corporaal

Ongoing climate change calls for fast and accurate weather and climate modeling. However, when solving large-scale weather prediction simulations, state-of-the-art CPU and GPU implementations suffer from limited performance and high energy consumption. These implementations are dominated by complex irregular memory access patterns and low arithmetic intensity that pose fundamental challenges to acceleration

更新日期：2020-09-20
• arXiv.cs.AR Pub Date : 2020-09-17
Minesh Patel; Jeremie S. Kim; Taha Shahroodi; Hasan Hassan; Onur Mutlu

Increasing single-cell DRAM error rates have pushed DRAM manufacturers to adopt on-die error-correction coding (ECC), which operates entirely within a DRAM chip to improve factory yield. The on-die ECC function and its effects on DRAM reliability are considered trade secrets, so only the manufacturer knows precisely how on-die ECC alters the externally-visible reliability characteristics. Consequently

更新日期：2020-09-20
• arXiv.cs.AR Pub Date : 2020-09-17
Zecheng He; Guangyuan Hu; Ruby Lee

Spectre and Meltdown attacks and their variants exploit performance optimization features to cause security breaches. Secret information is accessed and leaked through micro-architectural covert channels. New attack variants keep appearing and we do not have a systematic way to capture the critical characteristics of these attacks and evaluate why they succeed. In this paper, we provide a new attack-graph

更新日期：2020-09-20
• arXiv.cs.AR Pub Date : 2020-09-16
Bilgesu Arif Bilgin; Phillip Stanley-Marbell

Computing systems that can tolerate effects of errors in their communicated data values can trade this tolerance for improved resource efficiency. Many important applications of computing, such as embedded sensor systems, can tolerate errors that are bounded in their distribution of deviation from correctness (distortion). We present a channel adaptation technique which modulates properties of I/O

更新日期：2020-09-20
• arXiv.cs.AR Pub Date : 2020-09-16
Nikolaos Charalampos Papadopoulos; Vasileios Karakostas; Konstantinos Nikas; Nectarios Koziris; Dionisios N. Pnevmatikatos

The Rocket Chip Generator uses a collection of parameterized processor components to produce RISC-V-based SoCs. It is a powerful tool that can produce a wide variety of processor designs ranging from tiny embedded processors to complex multi-core systems. In this paper we extend the features of the Memory Management Unit of the Rocket Chip Generator and specifically the TLB hierarchy. TLBs are essential

更新日期：2020-09-18
• arXiv.cs.AR Pub Date : 2020-09-16
Damla Senol Cali; Gurpreet S. Kalsi; Zülal Bingöl; Can Firtina; Lavanya Subramanian; Jeremie S. Kim; Rachata Ausavarungnirun; Mohammed Alser; Juan Gomez-Luna; Amirali Boroumand; Anant Nori; Allison Scibisz; Sreenivas Subramoney; Can Alkan; Saugata Ghose; Onur Mutlu

Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. Unfortunately, it is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor

更新日期：2020-09-18
• arXiv.cs.AR Pub Date : 2020-09-16
Joseph Gravellier; Jean-Max Dutertre; Yannick Teglia; Philippe Loubet Moundi

To meet the ever-growing need for performance in silicon devices, SoC providers have been increasingly relying on software-hardware cooperation. By controlling hardware resources such as power or clock management from the software, developers earn the possibility to build more flexible and power efficient applications. Despite the benefits, these hardware components are now exposed to software code

更新日期：2020-09-18
• arXiv.cs.AR Pub Date : 2020-09-15
Zephan M. Enciso; Seyed Hadi Mirfarshbafan; Oscar Castañeda; Clemens JS. Schaefer; Christoph Studer; Siddharth Joshi

Spatial linear transforms that process multiple parallel analog signals to simplify downstream signal processing find widespread use in multi-antenna communication systems, machine learning inference, data compression, audio and ultrasound applications, among many others. In the past, a wide range of mixed-signal as well as digital spatial transform circuits have been proposed---it is, however, a longstanding

更新日期：2020-09-18
• arXiv.cs.AR Pub Date : 2020-09-15
El Mehdi BenhaniLHC; Cuauhtemoc Mancillas LopezCINVESTAV-IPN; Lilian BossuetLHC

Security in TrustZone-enabled heterogeneous system-on-chip (SoC) is gaining increasing attention for several years. Mainly because this type of SoC can be found in more and more applications in servers or in the cloud. The inside-SoC communication layer is one of the main element of heterogeneous SoC; indeed all the data goes through it. Monitoring and controlling inside-SoC communications enables

更新日期：2020-09-16
• arXiv.cs.AR Pub Date : 2020-09-14
Drew Zagieboylo; G. Edward Suh; Andrew C. Myers

Virtual memory has been a standard hardware feature for more than three decades. At the price of increased hardware complexity, it has simplified software and promised strong isolation among colocated processes. In modern computing systems, however, the costs of virtual memory have increased significantly. With large memory workloads, virtualized environments, data center computing, and chips with

更新日期：2020-09-16
• arXiv.cs.AR Pub Date : 2020-09-15
Malik Imran; Zain Ul Abideen; Samuel Pagliarini

Security of currently deployed public key cryptography algorithms is foreseen to be vulnerable against quantum computer attacks. Hence, a community effort exists to develop post-quantum cryptography (PQC) algorithms, i.e., algorithms that are resistant to quantum attacks. In this work, we have investigated how lattice-based candidate algorithms from the NIST PQC standardization competition fare when

更新日期：2020-09-16
• arXiv.cs.AR Pub Date : 2020-09-14
Sara Hooker

Hardware, systems and algorithms research communities have historically had different incentive structures and fluctuating motivation to engage with each other explicitly. This historical treatment is odd given that hardware and software have frequently determined which research ideas succeed (and fail). This essay introduces the term hardware lottery to describe when a research idea wins because it

更新日期：2020-09-15
• arXiv.cs.AR Pub Date : 2020-09-14
Kanghyun Choi; Deokki Hong; Hojae Yoon; Joonsang Yu; Youngsok Kim; Jinho Lee

To cope with the ever-increasing computational demand of the DNN execution, recent neural architecture search (NAS) algorithms consider hardware cost metrics into account, such as GPU latency. To further pursue a fast, efficient execution, DNN-specialized hardware accelerators are being designed for multiple purposes, which far-exceeds the efficiency of the GPUs. However, those hardware-related metrics

更新日期：2020-09-15
• arXiv.cs.AR Pub Date : 2020-09-14
Philip Colangelo; Oren Segal; Alex Speicher; Martin Margala

State-of-the-art Neural Network Architectures (NNAs) are challenging to design and implement efficiently in hardware. In the past couple of years, this has led to an explosion in research and development of automatic Neural Architecture Search (NAS) tools. AutomML tools are now used to achieve state of the art NNA designs and attempt to optimize for hardware usage and design. Much of the recent research

更新日期：2020-09-15
• arXiv.cs.AR Pub Date : 2020-09-13
Zishen Wan; Bo Yu; Thomas Yuang Li; Jie Tang; Yuhao Zhu; Yu Wang; Arijit Raychowdhury; Shaoshan Liu

Recent researches on robotics have shown significant improvement, spanning from algorithms, mechanics to hardware architectures. Robotics, including manipulators, legged robots, drones, and autonomous vehicles, are now widely applied in diverse scenarios. However, the high computation and data complexity of robotic algorithms pose great challenges to its applications. On the one hand, CPU platform

更新日期：2020-09-15
• arXiv.cs.AR Pub Date : 2020-09-11
Andreas Kurth; Wolfgang Rönninger; Thomas Benz; Matheus Cavalcante; Fabian Schuiki; Florian Zaruba; Luca Benini

On-chip communication infrastructure is a central component of modern systems-on-chip (SoCs), and it continues to gain importance as the number of cores, the heterogeneity of components, and the on-chip and off-chip bandwidth continue to grow. Decades of research on on-chip networks enabled cache-coherent shared-memory multiprocessors. However, communication fabrics that meet the needs of heterogeneous

更新日期：2020-09-14
• arXiv.cs.AR Pub Date : 2020-09-11
Mahdi Taheri; Saeideh Sheikhpour; Mohammad Saeed Ansari; Ali Mahani

This paper presents a high-throughput fault-resilient hardware implementation of AES S-box, called HFS-box. If a transient natural or even malicious fault in each pipeline stage is detected, the corresponding error signal becomes high and as a result, the control unit holds the output of our proposed DMR voter till the fault effect disappears. The proposed low-cost HFS-box provides a high capability

更新日期：2020-09-14
• arXiv.cs.AR Pub Date : 2020-09-11
Suresh Krishna; Ravi Krishna

In today's era of "scale-out", this paper makes the case that a specialized hardware architecture based on "scale-in"--placing as many specialized processors as possible along with their memory systems and interconnect links within one or two boards in a rack--would offer the potential to boost large recommender system throughput by 12-62x for inference and 12-45x for training compared to the DGX-2

更新日期：2020-09-14
• arXiv.cs.AR Pub Date : 2020-09-10
Gokul Subramanian Ravi; Ramon Bertran; Pradip Bose; Mikko Lipasti

We present MicroGrad, a centralized automated framework that is able to efficiently analyze the capabilities, limits and sensitivities of complex modern processors in the face of constantly evolving application domains. MicroGrad uses Microprobe, a flexible code generation framework as its back-end and a Gradient Descent based tuning mechanism to efficiently enable the evolution of the test cases to

更新日期：2020-09-11
• arXiv.cs.AR Pub Date : 2020-09-09
Kirti Bhanushali; Chinmay Tembe; W. Rhett Davis

FinFETs are predicted to advance semiconductorscaling for sub-20nm devices. In order to support their intro-duction into research and universities it is crucial to develop anopen source predictive process design kit. This paper discussesin detail the design process for such a kit for 15nm FinFETdevices, called the FreePDK15. The kit consists of a layerstack with thirteen-metal layers based on hier

更新日期：2020-09-11
• arXiv.cs.AR Pub Date : 2020-09-09
Yunsong Wang; Charlene Yang; Steven Farrell; Yan Zhang; Thorsten Kurth; Samuel Williams

Deep learning applications are usually very compute-intensive and require a long runtime for training and inference. This has been tackled by researchers from both hardware and software sides, and in this paper, we propose a Roofline-based approach to performance analysis to facilitate the optimization of these applications. This approach is an extension of the Roofline model widely used in traditional

更新日期：2020-09-11
• arXiv.cs.AR Pub Date : 2020-09-08
Freddy Gabbay; Avi Mendelson

Reliability is a crucial requirement in any modern microprocessor to assure correct execution over its lifetime. As mission critical components are becoming common in commodity systems; e.g., control of autonomous cars, the demand for reliable processing has even further heightened. Latest process technologies even worsened the situation; thus, microprocessors design has become highly susceptible to

更新日期：2020-09-10
• arXiv.cs.AR Pub Date : 2020-09-09
Mahmoud Khalafalla; Mahmoud A. Elmohr; Catherine Gebotys

This paper contributes to the study of PUFs vulnerability against modeling attacks by evaluating the security of XOR BR PUFs, XOR TBR PUFs, and obfuscated architectures of XOR BR PUF using a simplified mathematical model and deep learning (DL) techniques. Obtained results show that DL modeling attacks could easily break the security of 4-input XOR BR PUFs and 4-input XOR TBR PUFs with modeling accuracy

更新日期：2020-09-10
• arXiv.cs.AR Pub Date : 2020-09-08
Javad Bagherzadeh; Vishishtha Bothra; Disha Gujar; Sugandha Gupta; Jinal Shah

Rivest-Shamir-Adleman (RSA) cryptosystem uses modular multiplication for encryption and decryption. So, performance of RSA can be drastically improved by optimizing modular multiplication. This paper proposes a new parallel, high-radix Montgomery multiplier for 1024 bits multi-core RSA processor. Each computation step operates in radix 4. The computation speed is increased by more than 4 times. We

更新日期：2020-09-10
• arXiv.cs.AR Pub Date : 2020-09-08
Oscar Castañeda; Sven Jacobsson; Giuseppe Durisi; Tom Goldstein; Christoph Studer

All-digital basestation (BS) architectures enable superior spectral efficiency compared to hybrid solutions in massive multi-user MIMO systems. However, supporting large bandwidths with all-digital architectures at mmWave frequencies is challenging as traditional baseband processing would result in excessively high power consumption and large silicon area. The recently-proposed concept of finite-alphabet

更新日期：2020-09-10
• arXiv.cs.AR Pub Date : 2020-09-08
Soham Chakraborty

Mapping programs from one architecture to another plays a key role in technologies such as binary translation, decompilation, emulation, virtualization, and application migration. Although multicore architectures are ubiquitous, the state-of-the-art translation tools do not handle concurrency primitives correctly. Doing so is rather challenging because of the subtle differences in the concurrency models

更新日期：2020-09-10
• arXiv.cs.AR Pub Date : 2020-07-03
Chrysostomos Chatzigeorgiou; Dimitrios Garyfallou; George Floros; Nestor Evmorfopoulos; George Stamoulis

During the past decade, Model Order Reduction (MOR) has become key enabler for the efficient simulation of large circuit models. MOR techniques based on moment-matching are well established due to their simplicity and computational performance in the reduction process. However, moment-matching methods based on the ordinary Krylov subspace are usually inadequate to accurately approximate the original

更新日期：2020-09-10
• arXiv.cs.AR Pub Date : 2020-09-07
Rolf Drechsler

Only by formal verification approaches functional correctness can be ensured. While for many circuits fast verification is possible, in other cases the approaches fail. In general no efficient algorithms can be given, since the underlying verification problem is NP-complete. In this paper we prove that for different types of adder circuits polynomial verification can be ensured based on BDDs. While

更新日期：2020-09-08
• arXiv.cs.AR Pub Date : 2020-09-07
Shaoshan Liu

Most business decisions are made with analysis, but some are judgment calls not susceptible to analysis due to time or information constraints. In this article, we present a real-life case study of critical business decision making of PerceptIn, an autonomous driving technology startup. In early years of PerceptIn, PerceptIn had to make a decision on the design of computing systems for its autonomous

更新日期：2020-09-08
• arXiv.cs.AR Pub Date : 2020-09-04
Zhi-Gang Liu; Paul N. Whatmough; Matthew Mattina

Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). Exploiting data sparsity is a common approach to further accelerate GEMM for CNN inference, and in particular, structural sparsity has the advantages of predictable load balancing and very low index overhead. In this paper, we address a

更新日期：2020-09-08
• arXiv.cs.AR Pub Date : 2020-09-05
Xuan He; Kui Cai; Liang Zhou

Consider the computations at a node in the message passing algorithms. Assume that the node has incoming and outgoing messages $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ and $\mathbf{y} = (y_1, y_2, \ldots, y_n)$, respectively. In this paper, we investigate a class of structures that can be adopted by the node for computing $\mathbf{y}$ from $\mathbf{x}$, where each $y_j, j = 1, 2, \ldots, n$ is computed

更新日期：2020-09-08
• arXiv.cs.AR Pub Date : 2020-09-05
Charlene Yang

This paper surveys a range of methods to collect necessary performance data on Intel CPUs and NVIDIA GPUs for hierarchical Roofline analysis. As of mid-2020, two vendor performance tools, Intel Advisor and NVIDIA Nsight Compute, have integrated Roofline analysis into their supported feature set. This paper fills the gap for when these tools are not available, or when users would like a more customized

更新日期：2020-09-08
• arXiv.cs.AR Pub Date : 2020-09-04
Mohammed Nabeel; Mohammed Ashraf; Satwik Patnaik; Vassos Soteriou; Ozgur Sinanoglu; Johann Knechtel

For the first time, we leverage the 2.5D interposer technology to establish system-level security in the face of hardware- and software-centric adversaries. More specifically, we integrate chiplets (i.e., third-party hard intellectual property of complex functionality, like microprocessors) using a security-enforcing interposer. Such hardware organization provides a robust 2.5D root of trust for trustworthy

更新日期：2020-09-08
• arXiv.cs.AR Pub Date : 2020-09-04
Mojan Javaheripi; Mohammad Samragh; Gregory Fields; Tara Javidi; Farinaz Koushanfar

We propose CLEANN, the first end-to-end framework that enables online mitigation of Trojans for embedded Deep Neural Network (DNN) applications. A Trojan attack works by injecting a backdoor in the DNN while training; during inference, the Trojan can be activated by the specific backdoor trigger. What differentiates CLEANN from the prior work is its lightweight methodology which recovers the ground-truth

更新日期：2020-09-08
• arXiv.cs.AR Pub Date : 2020-09-04
Sheng-Chun Kao; Geonhwa Jeong; Tushar Krishna

DNN accelerators provide efficiency by leveraging reuse of activations/weights/outputs during the DNN computations to reduce data movement from DRAM to the chip. The reuse is captured by the accelerator's dataflow. While there has been significant prior work in exploring and comparing various dataflows, the strategy for assigning on-chip hardware resources (i.e., compute and memory) given a dataflow

更新日期：2020-09-08
• arXiv.cs.AR Pub Date : 2020-09-04
Casey Duckering; Jonathan M. Baker; David I. Schuster; Frederic T. Chong

Current, near-term quantum devices have shown great progress in recent years culminating with a demonstration of quantum supremacy. In the medium-term, however, quantum machines will need to transition to greater reliability through error correction, likely through promising techniques such as surface codes which are well suited for near-term devices with limited qubit connectivity. We discover quantum

更新日期：2020-09-08
• arXiv.cs.AR Pub Date : 2020-09-03
Zhe Lin; Sharad Sinha; Hao Liang; Liang Feng; Wei Zhang

Modern multicore systems are migrating from homogeneous systems to heterogeneous systems with accelerator-based computing in order to overcome the barriers of performance and power walls. In this trend, FPGA-based accelerators are becoming increasingly attractive, due to their excellent flexibility and low design cost. In this paper, we propose the architectural support for efficient interfacing between

更新日期：2020-09-05
• arXiv.cs.AR Pub Date : 2020-09-03
Zhe Lin; Wei Zhang; Sharad Sinha

Fine-grained runtime power management techniques could be promising solutions for power reduction. Therefore, it is essential to establish accurate power monitoring schemes to obtain dynamic power variation in a short period (i.e., tens or hundreds of clock cycles). In this paper, we leverage a decision-tree-based power modeling approach to establish fine-grained hardware power monitoring on FPGA platforms

更新日期：2020-09-05
• arXiv.cs.AR Pub Date : 2020-09-03
Zhe Lin; Sharad Sinha; Wei Zhang

As field-programmable gate arrays become prevalent in critical application domains, their power consumption is of high concern. In this paper, we present and evaluate a power monitoring scheme capable of accurately estimating the runtime dynamic power of FPGAs in a fine-grained timescale, in order to support emerging power management techniques. In particular, we describe a novel and specialized ensemble

更新日期：2020-09-05
• arXiv.cs.AR Pub Date : 2020-09-03
Duy Thanh Nguyen; Hyun Kim; Hyuk-Jae Lee

Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different

更新日期：2020-09-05
• arXiv.cs.AR Pub Date : 2020-09-02
Paolo Mantovani; Davide Giri; Giuseppe Di Guglielmo; Luca Piccolboni; Joseph Zuckerman; Emilio G. Cota; Michele Petracca; Christian Pilato; Luca P. Carloni

ESP is an open-source research platform for heterogeneous SoC design. The platform combines a modular tile-based architecture with a variety of application-oriented flows for the design and optimization of accelerators. The ESP architecture is highly scalable and strikes a balance between regularity and specialization. The companion methodology raises the level of abstraction to system-level design

更新日期：2020-09-03
• arXiv.cs.AR Pub Date : 2020-09-02
Debjyoti Bhattacharjee; Anupam Chattopadhyay; Srijit Dutta; Ronny Ronen; Shahar Kvatinsky

Data-intensive applications are poised to benefit directly from processing-in-memory platforms, such as memristive Memory Processing Units, which allow leveraging data locality and performing stateful logic operations. Developing design automation flows for such platforms is a challenging and highly relevant research problem. In this work, we investigate the problem of minimizing delay under arbitrary

更新日期：2020-09-03
• arXiv.cs.AR Pub Date : 2020-09-02
Zhe Lin; Jieru Zhao; Sharad Sinha; Wei Zhang

High-level synthesis (HLS) enables designers to customize hardware designs efficiently. However, it is still challenging to foresee the correlation between power consumption and HLS-based applications at an early design stage. To overcome this problem, we introduce HL-Pow, a power modeling framework for FPGA HLS based on state-of-the-art machine learning techniques. HL-Pow incorporates an automated

更新日期：2020-09-03
• arXiv.cs.AR Pub Date : 2020-09-02
Zhihui Zhang; Jingwen Leng; Lingxiao Ma; Youshan Miao; Chao Li; Minyi Guo

Graph neural networks (GNN) represent an emerging line of deep learning models that operate on graph structures. It is becoming more and more popular due to its high accuracy achieved in many graph-related tasks. However, GNN is not as well understood in the system and architecture community as its counterparts such as multi-layer perceptrons and convolutional neural networks. This work tries to introduce

更新日期：2020-09-03
• arXiv.cs.AR Pub Date : 2020-09-01
Mostafa Mahmoud; Isak Edo; Ali Hadi Zadeh; Omar Mohamed Awad; Gennady Pekhimenko; Jorge Albericio; Andreas Moshovos

TensorDash is a hardware level technique for enabling data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning, TensorDash can speedup the training process while also increasing energy efficiency. TensorDash combines a low-cost, sparse input operand interconnect comprising an 8-input multiplexer per multiplier

更新日期：2020-09-03
Contents have been reproduced by permission of the publishers.

down
wechat
bug