-
Hardware Acceleration of Fully Quantized BERT for Efficient Natural Language Processing arXiv.cs.AR Pub Date : 2021-03-04 Zejian Liu; Gang Li; Jian Cheng
BERT is the most recent Transformer-based model that achieves state-of-the-art performance in various NLP tasks. In this paper, we investigate the hardware acceleration of BERT on FPGA for edge computing. To tackle the issue of huge computational complexity and memory footprint, we propose to fully quantize the BERT (FQ-BERT), including weights, activations, softmax, layer normalization, and all the
-
Effective and Fast: A Novel Sequential Single Path Search for Mixed-Precision Quantization arXiv.cs.AR Pub Date : 2021-03-04 Qigong Sun; Licheng Jiao; Yan Ren; Xiufang Li; Fanhua Shang; Fang Liu
Since model quantization helps to reduce the model size and computation latency, it has been successfully applied in many applications of mobile phones, embedded devices and smart chips. The mixed-precision quantization model can match different quantization bit-precisions according to the sensitivity of different layers to achieve great performance. However, it is a difficult problem to quickly determine
-
Efficient Training Convolutional Neural Networks on Edge Devices with Gradient-pruned Sign-symmetric Feedback Alignment arXiv.cs.AR Pub Date : 2021-03-04 Ziyang Hong; C. Patrick Yue
With the prosperity of mobile devices, the distributed learning approach enabling model training with decentralized data has attracted wide research. However, the lack of training capability for edge devices significantly limits the energy efficiency of distributed learning in real life. This paper describes a novel approach of training DNNs exploiting the redundancy and the weight asymmetry potential
-
Towards Fully Intelligent Transportation through Infrastructure-Vehicle Cooperative Autonomous Driving: Challenges and Opportunities arXiv.cs.AR Pub Date : 2021-03-03 Shaoshan Liu; Bo Yu; Jie Tang; Qi Zhu
The infrastructure-vehicle cooperative autonomous driving approach depends on the cooperation between intelligent roads and intelligent vehicles. This approach is not only safer but also more economical compared to the traditional on-vehicle-only autonomous driving approach. In this paper, we introduce our real-world deployment experiences of cooperative autonomous driving, and delve into the details
-
SME: ReRAM-based Sparse-Multiplication-Engine to Squeeze-Out Bit Sparsity of Neural Network arXiv.cs.AR Pub Date : 2021-03-02 Fangxin Liu; Wenbo Zhao; Yilong Zhao; Zongwu Wang; Tao Yang; Zhezhi He; Naifeng Jing; Xiaoyao Liang; Li Jiang
Resistive Random-Access-Memory (ReRAM) crossbar is a promising technique for deep neural network (DNN) accelerators, thanks to its in-memory and in-situ analog computing abilities for Vector-Matrix Multiplication-and-Accumulations (VMMs). However, it is challenging for crossbar architecture to exploit the sparsity in the DNN. It inevitably causes complex and costly control to exploit fine-grained sparsity
-
Conceptual Modeling for Computer Organization and Architecture arXiv.cs.AR Pub Date : 2021-02-23 Sabah Al-Fedaghi
Understanding computer system hardware, including how computers operate, is essential for undergraduate students in computer engineering and science. Literature shows students learning computer organization and assembly language often find fundamental concepts difficult to comprehend within the topic materials. Tools have been introduced to improve students comprehension of the interaction between
-
Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search arXiv.cs.AR Pub Date : 2021-03-02 Kartik Hegde; Po-An Tsai; Sitao Huang; Vikas Chandra; Angshuman Parashar; Christopher W. Fletcher
Modern day computing increasingly relies on specialization to satiate growing performance and efficiency requirements. A core challenge in designing such specialized hardware architectures is how to perform mapping space search, i.e., search for an optimal mapping from algorithm to hardware. Prior work shows that choosing an inefficient mapping can lead to multiplicative-factor efficiency overheads
-
Run-time Performance Monitoring of Heterogenous Hw/Sw Platforms Using PAPI arXiv.cs.AR Pub Date : 2021-03-01 Tiziana Fanni; Daniel Madronal; Claudio Rubattu; Carlo Sau; Francesca Palumbo; Eduardo Juarez; Maxime Pelcat; Cesar Sanz; Luigi Raffo
In the era of Cyber Physical Systems, designers need to offer support for run-time adaptivity considering different constraints, including the internal status of the system. This work presents a run-time monitoring approach, based on the Performance Application Programming Interface, that offers a unified interface to transparently access both the standard Performance Monitoring Counters (PMCs) in
-
Layering the monitoring action for improved flexibility and overhead control: work-in-progress arXiv.cs.AR Pub Date : 2021-03-01 Giacomo Valente; Tiziana Fanni; Carlo Sau; Francesco Di Battista
With the diffusion of complex heterogeneous platforms and their need of characterization, monitoring the system gained increasing interest. This work proposes a framework to build custom and modular monitoring systems, flexible enough to face the heterogeneity of modern platforms, offering a predictable HW/SW impact.
-
Polynesia: Enabling Effective Hybrid Transactional/Analytical Databases with Specialized Hardware/Software Co-Design arXiv.cs.AR Pub Date : 2021-03-01 Amirali Boroumand; Saugata Ghose; Geraldo F. Oliveira; Onur Mutlu
An exponential growth in data volume, combined with increasing demand for real-time analysis (i.e., using the most recent data), has resulted in the emergence of database systems that concurrently support transactions and data analytics. These hybrid transactional and analytical processing (HTAP) database systems can support real-time data analysis without the high costs of synchronizing across separate
-
Mitigating Edge Machine Learning Inference Bottlenecks: An Empirical Study on Accelerating Google Edge Models arXiv.cs.AR Pub Date : 2021-03-01 Amirali Boroumand; Saugata Ghose; Berkin Akin; Ravi Narayanaswami; Geraldo F. Oliveira; Xiaoyu Ma; Eric Shiu; Onur Mutlu
As the need for edge computing grows, many modern consumer devices now contain edge machine learning (ML) accelerators that can compute a wide range of neural network (NN) models while still fitting within tight resource constraints. We analyze a commercial Edge TPU using 24 Google edge NN models (including CNNs, LSTMs, transducers, and RCNNs), and find that the accelerator suffers from three shortcomings
-
SparkXD: A Framework for Resilient and Energy-Efficient Spiking Neural Network Inference using Approximate DRAM arXiv.cs.AR Pub Date : 2021-02-28 Rachmad Vidya Wicaksana Putra; Muhammad Abdullah Hanif; Muhammad Shafique
Spiking Neural Networks (SNNs) have the potential for achieving low energy consumption due to their biologically sparse computation. Several studies have shown that the off-chip memory (DRAM) accesses are the most energy-consuming operations in SNN processing. However, state-of-the-art in SNN systems do not optimize the DRAM energy-per-access, thereby hindering achieving high energy-efficiency. To
-
Acceleration of probabilistic reasoning through custom processor architecture arXiv.cs.AR Pub Date : 2021-02-27 Nimish Shah; Laura I. Galindez Olascoaga; Wannes Meert; Marian Verhelst
Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic
-
ProbLP: A framework for low-precision probabilistic inference arXiv.cs.AR Pub Date : 2021-02-27 Nimish Shah; Laura I. Galindez Olascoaga; Wannes Meert; Marian Verhelst
Bayesian reasoning is a powerful mechanism for probabilistic inference in smart edge-devices. During such inferences, a low-precision arithmetic representation can enable improved energy efficiency. However, its impact on inference accuracy is not yet understood. Furthermore, general-purpose hardware does not natively support low-precision representation. To address this, we propose ProbLP, a framework
-
HIR: An MLIR-based Intermediate Representation for Hardware Accelerator Description arXiv.cs.AR Pub Date : 2021-02-27 Kingshuk Majumder; Uday Bondhugula
The emergence of machine learning, image and audio processing on edge devices has motivated research towards power efficient custom hardware accelerators. Though FPGAs are an ideal target for energy efficient custom accelerators, the difficulty of hardware design and the lack of vendor agnostic, standardized hardware compilation infrastructure has hindered their adoption. This paper introduces HIR
-
High-Performance Training by Exploiting Hot-Embeddings in Recommendation Systems arXiv.cs.AR Pub Date : 2021-03-01 Muhammad Adnan; Yassaman Ebrahimzadeh Maboud; Divya Mahajan; Prashant J. Nair
Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications. Current recommendation models include deep-learning-based (DLRM) and time-based sequence (TBSM) models. These models use massive embedding tables to store a numerical representation of item's and user's categorical variables (memory-bound) while also
-
An Architecture for Memory Centric Active Storage (MCAS) arXiv.cs.AR Pub Date : 2021-02-26 Daniel Waddington; Clem Dickey; Moshik Hershcovitch; Sangeetha Seshadri
The advent of CPU-attached persistent memory technology, such as Intel's Optane Persistent Memory Modules (PMM), has brought with it new opportunities for storage. In 2018, IBM Research Almaden began investigating and developing a new enterprise-grade storage solution directly aimed at this emerging technology. MCAS (Memory Centric Active Storage) defines an evolved network-attached key-value store
-
Performance Optimization of SU3_Bench on Xeon and Programmable Integrated Unified Memory Architecture arXiv.cs.AR Pub Date : 2021-02-28 Jesmin Jahan Tithi; Fabio Checconi; Douglas Doerfler; Fabrizio Petrini
SU3\_Bench is a microbenchmark developed to explore performance portability across multiple programming models/methodologies using a simple, but nontrivial, mathematical kernel. This kernel has been derived from the MILC lattice quantum chromodynamics (LQCD) code. SU3\_Bench is bandwidth bound and generates regular compute and data access patterns. Therefore, on most traditional CPU and GPU-based systems
-
SpikeDyn: A Framework for Energy-Efficient Spiking Neural Networks with Continual and Unsupervised Learning Capabilities in Dynamic Environments arXiv.cs.AR Pub Date : 2021-02-28 Rachmad Vidya Wicaksana Putra; Muhammad Shafique
Spiking Neural Networks (SNNs) bear the potential of efficient unsupervised and continual learning capabilities because of their biological plausibility, but their complexity still poses a serious research challenge to enable their energy-efficient design for resource-constrained scenarios (like embedded systems, IoT-Edge, etc.). We propose SpikeDyn, a comprehensive framework for energy-efficient SNNs
-
A Variable Vector Length SIMD Architecture for HW/SW Co-designed Processors arXiv.cs.AR Pub Date : 2021-02-26 Rakesh Kumar; Alejandro Martinez; Antonio Gonzalez
Hardware/Software (HW/SW) co-designed processors provide a promising solution to the power and complexity problems of the modern microprocessors by keeping their hardware simple. Moreover, they employ several runtime optimizations to improve the performance. One of the most potent optimizations, vectorization, has been utilized by modern microprocessors, to exploit the data level parallelism through
-
SLAP: A Split Latency Adaptive VLIW pipeline architecture which enables on-the-fly variable SIMD vector-length arXiv.cs.AR Pub Date : 2021-02-26 Ashish Shrivastava; Alan Gatherer; Tong Sun; Sushma Wokhlu; Alex Chandra
Over the last decade the relative latency of access to shared memory by multicore increased as wire resistance dominated latency and low wire density layout pushed multiport memories farther away from their ports. Various techniques were deployed to improve average memory access latencies, such as speculative pre-fetching and branch-prediction, often leading to high variance in execution time which
-
Q-VR: System-Level Design for Future Mobile Collaborative Virtual Reality arXiv.cs.AR Pub Date : 2021-02-25 Chenhao Xie; Xie Li; Yang Hu; Huwan Peng; Michael Taylor; Shuaiwen Leon Song
High Quality Mobile Virtual Reality (VR) is what the incoming graphics technology era demands: users around the world, regardless of their hardware and network conditions, can all enjoy the immersive virtual experience. However, the state-of-the-art software-based mobile VR designs cannot fully satisfy the realtime performance requirements due to the highly interactive nature of user's actions and
-
Open-Source Verification with Chisel and Scala arXiv.cs.AR Pub Date : 2021-02-26 Andrew Dobis; Tjark Petersen; Kasper Juul Hesse Rasmussen; Enrico Tolotto; Hans Jakob Damsgaard; Simon Thye Andersen; Richard Lin; Martin Schoeberl
Performance increase with general-purpose processors has come to a halt. We can no longer depend on Moore's Law to increase computing performance. The only way to achieve higher performance or lower energy consumption is by building domain-specific hardware accelerators. To efficiently design and verify those domain-specific accelerators, we need agile hardware development. One of the main obstacles
-
FIXAR: A Fixed-Point Deep Reinforcement Learning Platform with Quantization-Aware Training and Adaptive Parallelism arXiv.cs.AR Pub Date : 2021-02-24 Je Yang; Seongmin Hong; Joo-Young Kim
In this paper, we present a deep reinforcement learning platform named FIXAR which employs fixed-point data types and arithmetic units for the first time using a SW/HW co-design approach. Starting from 32-bit fixed-point data, Quantization-Aware Training (QAT) reduces its data precision based on the range of activations and performs retraining to minimize the reward degradation. FIXAR proposes the
-
CBP: Coordinated management of cache partitioning, bandwidth partitioning and prefetch throttling arXiv.cs.AR Pub Date : 2021-02-23 Nadja Ramhöj Holtryd; Madhavan Manivannan; Per Stenström; Miquel Pericàs
Reducing the average memory access time is crucial for improving the performance of applications running on multi-core architectures. With workload consolidation this becomes increasingly challenging due to shared resource contention. Techniques for partitioning of shared resources - cache and bandwidth - and prefetching throttling have been proposed to mitigate contention and reduce the average memory
-
Silent Data Corruptions at Scale arXiv.cs.AR Pub Date : 2021-02-22 Harish Dattatraya Dixit; Sneha Pendharkar; Matt Beadon; Chris Mason; Tejasvi Chakravarthy; Bharath Muthiah; Sriram Sankar
Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error reporting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months
-
On Value Recomputation to Accelerate Invisible Speculation arXiv.cs.AR Pub Date : 2021-02-22 Christos Sakalis; Zamshed I. Chowdhury; Shayne Wadle; Ismail Akturk; Alberto Ros; Magnus Själander; Stefanos Kaxiras; Ulya R. Karpuzcu
Recent architectural approaches that address speculative side-channel attacks aim to prevent software from exposing the microarchitectural state changes of transient execution. The Delay-on-Miss technique is one such approach, which simply delays loads that miss in the L1 cache until they become non-speculative, resulting in no transient changes in the memory hierarchy. However, this costs performance
-
Dither computing: a hybrid deterministic-stochastic computing framework arXiv.cs.AR Pub Date : 2021-02-22 Chai Wah Wu
Stochastic computing has a long history as an alternative method of performing arithmetic on a computer. While it can be considered an unbiased estimator of real numbers, it has a variance and MSE on the order of $\Omega(\frac{1}{N})$. On the other hand, deterministic variants of stochastic computing remove the stochastic aspect, but cannot approximate arbitrary real numbers with arbitrary precision
-
BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics arXiv.cs.AR Pub Date : 2021-02-22 Subho S. Banerjee; Saurabh Jha; Zbigniew T. Kalbarczyk; Ravishankar K. Iyer
Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in
-
An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks arXiv.cs.AR Pub Date : 2021-02-20 Amir Yazdanbakhsh; Kiran Seshadri; Berkin Akin; James Laudon; Ravi Narayanaswami
Edge TPUs are a domain of accelerators for low-power, edge devices and are widely used in various Google products such as Coral and Pixel devices. In this paper, we first discuss the major microarchitectural details of Edge TPUs. Then, we extensively evaluate three classes of Edge TPUs, covering different computing ecosystems, that are either currently deployed in Google products or are the product
-
DeepScaleTool : A Tool for the Accurate Estimation of Technology Scaling in the Deep-Submicron Era arXiv.cs.AR Pub Date : 2021-02-19 Satyabrata Sarangi; Bevan Baas
The estimation of classical CMOS "constant-field" or "Dennard" scaling methods that define scaling factors for various dimensional and electrical parameters have become less accurate in the deep-submicron regime, which drives the need for better estimation approaches especially in the educational and research domains. We present DeepScaleTool, a tool for the accurate estimation of deep-submicron technology
-
BPLight-CNN: A Photonics-based Backpropagation Accelerator for Deep Learning arXiv.cs.AR Pub Date : 2021-02-19 D. Dang; S. V. R. Chittamuru; S. Pasricha; R. Mahapatra; D. Sahoo
Training deep learning networks involves continuous weight updates across the various layers of the deep network while using a backpropagation algorithm (BP). This results in expensive computation overheads during training. Consequently, most deep learning accelerators today employ pre-trained weights and focus only on improving the design of the inference phase. The recent trend is to build a complete
-
Toward Taming the Overhead Monster for Data-Flow Integrity arXiv.cs.AR Pub Date : 2021-02-19 Lang Feng; Jiayi Huang; Jeff Huang; Jiang Hu
Data-Flow Integrity (DFI) is a well-known approach to effectively detecting a wide range of software attacks. However, its real-world application has been quite limited so far because of the prohibitive performance overhead it incurs. Moreover, the overhead is enormously difficult to overcome without substantially lowering the DFI criterion. In this work, an analysis is performed to understand the
-
A Compiler Infrastructure for Accelerator Generators arXiv.cs.AR Pub Date : 2021-02-19 Rachit Nigam; Samuel Thomas; Zhijing Li; Adrian Sampson
We present Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. Calyx combines a hardware-like structural language with a software-like control flow representation with loops and conditionals. This split representation enables a new class of hardware-focused optimizations that require both structural and control flow information which are crucial for high-level
-
Effective Cache Apportioning for Performance Isolation Under Compiler Guidance arXiv.cs.AR Pub Date : 2021-02-18 Bodhisatwa Chatterjee; Sharjeel Khan; Santosh Pande
With a growing number of cores per socket in modern data-centers where multi-tenancy of a diverse set of applications must be efficiently supported, effective sharing of the last level cache is a very important problem. This is challenging because modern workloads exhibit \textit{dynamic phase behavior} - their cache requirements \& sensitivity vary across different execution points. To tackle this
-
Control Variate Approximation for DNN Accelerators arXiv.cs.AR Pub Date : 2021-02-18 Georgios Zervakis; Ourania Spantidi; Iraklis Anagnostopoulos; Hussam Amrouch; Jörg Henkel
In this work, we introduce a control variate approximation technique for low error approximate Deep Neural Network (DNN) accelerators. The control variate technique is used in Monte Carlo methods to achieve variance reduction. Our approach significantly decreases the induced error due to approximate multiplications in DNN inference, without requiring time-exhaustive retraining compared to state-of-the-art
-
Verifying High-Level Latency-Insensitive Designs with Formal Model Checking arXiv.cs.AR Pub Date : 2021-02-12 Steve Dai; Alicia Klinefelter; Haoxing Ren; Rangharajan Venkatesan; Ben Keller; Nathaniel Pinckney; Brucek Khailany
Latency-insensitive design mitigates increasing interconnect delay and enables productive component reuse in complex digital systems. This design style has been adopted in high-level design flows because untimed functional blocks connected through latency-insensitive interfaces provide a natural communication abstraction. However, latency-insensitive design with high-level languages also introduces
-
IRS-Assisted Wireless Powered NOMA: Is Dynamic Passive Beamforming Really Needed? arXiv.cs.AR Pub Date : 2021-02-17 Qingqing Wu; Xiaobo Zhou; Robert Schober
Intelligent reflecting surface (IRS) is a promising technology to improve the performance of wireless powered communication networks (WPCNs) due to its capability to reconfigure signal propagation environments via smart reflection. In particular, the high passive beamforming gain promised by IRS can significantly enhance the efficiency of both downlink wireless power transfer (DL WPT) and uplink wireless
-
Rethinking Co-design of Neural Architectures and Hardware Accelerators arXiv.cs.AR Pub Date : 2021-02-17 Yanqi Zhou; Xuanyi Dong; Berkin Akin; Mingxing Tan; Daiyi Peng; Tianjian Meng; Amir Yazdanbakhsh; Da Huang; Ravi Narayanaswami; James Laudon
Neural architectures and hardware accelerators have been two driving forces for the progress in deep learning. Previous works typically attempt to optimize hardware given a fixed model architecture or model architecture given fixed hardware. And the dominant hardware architecture explored in this prior work is FPGAs. In our work, we target the optimization of hardware and software configurations on
-
IronMan: GNN-assisted Design Space Exploration in High-Level Synthesis via Reinforcement Learning arXiv.cs.AR Pub Date : 2021-02-16 Nan Wu; Yuan Xie; Cong Hao
Despite the great success of High-Level Synthesis (HLS) tools, we observe several unresolved challenges: 1) the high-level abstraction of programming styles in HLS sometimes conceals optimization opportunities; 2) existing HLS tools do not provide flexible trade-off (Pareto) solutions among different objectives and constraints; 3) the actual quality of the resulting RTL designs is hard to predict.
-
ReGraphX: NoC-enabled 3D Heterogeneous ReRAM Architecture for Training Graph Neural Networks arXiv.cs.AR Pub Date : 2021-02-16 Aqeeb Iqbal Arka; Biresh Kumar Joardar; Janardhan Rao Doppa; Partha Pratim Pande; Krishnendu Chakrabarty
Graph Neural Network (GNN) is a variant of Deep Neural Networks (DNNs) operating on graphs. However, GNNs are more complex compared to traditional DNNs as they simultaneously exhibit features of both DNN and graph applications. As a result, architectures specifically optimized for either DNNs or graph applications are not suited for GNN training. In this work, we propose a 3D heterogeneous manycore
-
AdEle: An Adaptive Congestion-and-Energy-Aware Elevator Selection for Partially Connected 3D NoCs arXiv.cs.AR Pub Date : 2021-02-16 Ebadollah Taheri; Ryan G. Kim; Mahdi Nikdast
By lowering the number of vertical connections in fully connected 3D networks-on-chip (NoCs), partially connected 3D NoCs (PC-3DNoCs) help alleviate reliability and fabrication issues. This paper proposes a novel, adaptive congestion- and energy-aware elevator-selection scheme called AdEle to improve the traffic distribution in PC-3DNoCs. AdEle employs an offline multi-objective simulated-annealing-based
-
Probabilistic Localization of Insect-Scale Drones on Floating-Gate Inverter Arrays arXiv.cs.AR Pub Date : 2021-02-16 Priyesh Shukla; Ankith Muralidhar; Nick Iliev; Theja Tulabandhula; Sawyer B. Fuller; Amit Ranjan Trivedi
We propose a novel compute-in-memory (CIM)-based ultra-low-power framework for probabilistic localization of insect-scale drones. The conventional probabilistic localization approaches rely on the three-dimensional (3D) Gaussian Mixture Model (GMM)-based representation of a 3D map. A GMM model with hundreds of mixture functions is typically needed to adequately learn and represent the intricacies of
-
A Survey of Machine Learning for Computer Architecture and Systems arXiv.cs.AR Pub Date : 2021-02-16 Nan Wu; Yuan Xie
It has been a long time that computer architecture and systems are optimized to enable efficient execution of machine learning (ML) algorithms or models. Now, it is time to reconsider the relationship between ML and systems, and let ML transform the way that computer architecture and systems are designed. This embraces a twofold meaning: the improvement of designers' productivity, and the completion
-
Cache Bypassing for Machine Learning Algorithms arXiv.cs.AR Pub Date : 2021-02-13 Asim Ikram; Muhammad Awais Ali; Mirza Omer Beg
Graphics Processing Units (GPUs) were once used solely for graphical computation tasks but with the increase in the use of machine learning applications, the use of GPUs to perform general-purpose computing has increased in the last few years. GPUs employ a massive amount of threads, that in turn achieve a high amount of parallelism, to perform tasks. Though GPUs have a high amount of computation power
-
Voltage Scaling for Partitioned Systolic Array in A Reconfigurable Platform arXiv.cs.AR Pub Date : 2021-02-13 Rourab Paul; Sreetama Sarkar; Suman Sau; Koushik Chakraborty; Sanghamitra Roy; Amlan Chakrabarti
The exponential emergence of Field Programmable Gate Array (FPGA) has accelerated the research of hardware implementation of Deep Neural Network (DNN). Among all DNN processors, domain specific architectures, such as, Google's Tensor Processor Unit (TPU) have outperformed conventional GPUs. However, implementation of TPUs in reconfigurable hardware should emphasize energy savings to serve the green
-
Hardware Architecture of Wireless Power Transfer, RFID, and WIPT Systems arXiv.cs.AR Pub Date : 2021-02-13 Yu Luo; Lina Pu
In this work, we provide an overview of the hardware architecture of wireless power transfer (WPT), RFID, and wireless information and power transfer (WIPT) systems. The historical milestones and structure differences among WPT, RFID, and WIPT are introduced.
-
GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent arXiv.cs.AR Pub Date : 2021-02-15 Heesu Kim; Hanmin Park; Taehyun Kim; Kwanheum Cho; Eojin Lee; Soojung Ryu; Hyuk-Jae Lee; Kiyoung Choi; Jinho Lee
In this paper, we present GradPIM, a processing-in-memory architecture which accelerates parameter updates of deep neural networks training. As one of processing-in-memory techniques that could be realized in the near future, we propose an incremental, simple architectural design that does not invade the existing memory protocol. Extending DDR4 SDRAM to utilize bank-group parallelism makes our operation
-
CrossLight: A Cross-Layer Optimized Silicon Photonic Neural Network Accelerator arXiv.cs.AR Pub Date : 2021-02-13 Febin Sunny; Asif Mirza; Mahdi Nikdast; Sudeep Pasricha
Domain-specific neural network accelerators have seen growing interest in recent years due to their improved energy efficiency and inference performance compared to CPUs and GPUs. In this paper, we propose a novel cross-layer optimized neural network accelerator called CrossLight that leverages silicon photonics. CrossLight includes device-level engineering for resilience to process variations and
-
CrossStack: A 3-D Reconfigurable RRAM Crossbar Inference Engine arXiv.cs.AR Pub Date : 2021-02-07 Jason K. Eshraghian; Kyoungrok Cho; Sung Mo Kang
Deep neural network inference accelerators are rapidly growing in importance as we turn to massively parallelized processing beyond GPUs and ASICs. The dominant operation in feedforward inference is the multiply-and-accumlate process, where each column in a crossbar generates the current response of a single neuron. As a result, memristor crossbar arrays parallelize inference and image processing tasks
-
BlockHammer: Preventing RowHammer at Low Cost by Blacklisting Rapidly-Accessed DRAM Rows arXiv.cs.AR Pub Date : 2021-02-11 Abdullah Giray Yağlıkçı; Minesh Patel; Jeremie S. Kim; Roknoddin Azizi; Ataberk Olgun; Lois Orosa; Hasan Hassan; Jisung Park; Konstantinos Kanellopoulos; Taha Shahroodi; Saugata Ghose; Onur Mutlu
Aggressive memory density scaling causes modern DRAM devices to suffer from RowHammer, a phenomenon where rapidly activating a DRAM row can cause bit-flips in physically-nearby rows. Recent studies demonstrate that modern DRAM chips, including chips previously marketed as RowHammer-safe, are even more vulnerable to RowHammer than older chips. Many works show that attackers can exploit RowHammer bit-flips
-
Transparent FPGA Acceleration with TensorFlow arXiv.cs.AR Pub Date : 2021-02-02 Simon Pfenning; Philipp Holzinger; Marc Reichenbach
Today, artificial neural networks are one of the major innovators pushing the progress of machine learning. This has particularly affected the development of neural network accelerating hardware. However, since most of these architectures require specialized toolchains, there is a certain amount of additional effort for developers each time they want to make use of a new deep learning accelerator.
-
Enabling multi-programming mechanism for quantum computing in the NISQ era arXiv.cs.AR Pub Date : 2021-02-10 Siyuan NiuLIRMM; Aida Todri-SanialLIRMM
As NISQ devices have several physical limitations and unavoidable noisy quantum operations, only small circuits can be executed on a quantum machine to get reliable results. This leads to the quantum hardware under-utilization issue. Here, we address this problem and improve the quantum hardware throughput by proposing a multiprogramming approach to execute multiple quantum circuits on quantum hardware
-
Hybrid In-memory Computing Architecture for the Training of Deep Neural Networks arXiv.cs.AR Pub Date : 2021-02-10 Vinay Joshi; Wangxin He; Jae-sun Seo; Bipin Rajendran
The cost involved in training deep neural networks (DNNs) on von-Neumann architectures has motivated the development of novel solutions for efficient DNN training accelerators. We propose a hybrid in-memory computing (HIC) architecture for the training of DNNs on hardware accelerators that results in memory-efficient inference and outperforms baseline software accuracy in benchmark tasks. We introduce
-
VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision Neural Network Inference arXiv.cs.AR Pub Date : 2021-02-08 Steve Dai; Rangharajan Venkatesan; Haoxing Ren; Brian Zimmer; William J. Dally; Brucek Khailany
Quantization enables efficient acceleration of deep neural networks by reducing model memory footprint and exploiting low-cost integer math hardware units. Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors. Excessive quantization, reducing precision too aggressively, results in accuracy degradation. When scale factors are
-
Feature Engineering for Scalable Application-Level Post-Silicon Debugging arXiv.cs.AR Pub Date : 2021-02-08 Debjit Pal; Shobha Vasudevan
We present systematic and efficient solutions for both observability enhancement and root-cause diagnosis of post-silicon System-on-Chips (SoCs) validation with diverse usage scenarios. We model specification of interacting flows in typical applications for message selection. Our method for message selection optimizes flow specification coverage and trace buffer utilization. We define the diagnosis
-
Enabling Binary Neural Network Training on the Edge arXiv.cs.AR Pub Date : 2021-02-08 Erwei Wang; James J. Davis; Daniele Moro; Piotr Zielinski; Claudionor Coelho; Satrajit Chatterjee; Peter Y. K. Cheung; George A. Constantinides
The ever-growing computational demands of increasingly complex machine learning models frequently necessitate the use of powerful cloud-based infrastructure for their training. Binary neural networks are known to be promising candidates for on-device inference due to their extreme compute and memory savings over higher-precision alternatives. In this paper, we demonstrate that they are also strongly
-
Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch arXiv.cs.AR Pub Date : 2021-02-08 Aojun Zhou; Yukun Ma; Junnan Zhu; Jianbo Liu; Zhijie Zhang; Kun Yuan; Wenxiu Sun; Hongsheng Li
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments. It can be generally categorized into unstructured fine-grained sparsity that zeroes out multiple individual weights distributed across the neural network, and structured coarse-grained sparsity which prunes blocks of sub-networks of a neural network. Fine-grained
-
A Memory-Efficient FM-Index Constructor for Next-Generation Sequencing Applications on FPGAs arXiv.cs.AR Pub Date : 2021-02-05 Nae-Chyun Chen; Yu-Cheng Li; Yi-Chang Lu
FM-index is an efficient data structure for string search and is widely used in next-generation sequencing (NGS) applications such as sequence alignment and de novo assembly. Recently, FM-indexing is even performed down to the read level, raising a demand of an efficient algorithm for FM-index construction. In this work, we propose a hardware-compatible Self-Aided Incremental Indexing (SAII) algorithm
-
Machine Learning-Based Automated Design Space Exploration for Autonomous Aerial Robots arXiv.cs.AR Pub Date : 2021-02-05 Srivatsan Krishnan; Zishen Wan; Kshitij Bharadwaj; Paul Whatmough; Aleksandra Faust; Sabrina Neuman; Gu-Yeon Wei; David Brooks; Vijay Janapa Reddi
Building domain-specific architectures for autonomous aerial robots is challenging due to a lack of systematic methodology for designing onboard compute. We introduce a novel performance model called the F-1 roofline to help architects understand how to build a balanced computing system for autonomous aerial robots considering both its cyber (sensor rate, compute performance) and physical components
Contents have been reproduced by permission of the publishers.