当前期刊: arXiv - CS - Hardware Architecture Go to current issue    加入关注   
显示样式:        排序: IF: - GO 导出
我的关注
我的收藏
您暂时未登录!
登录
  • Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators
    arXiv.cs.AR Pub Date : 2020-11-30
    Benjamin Y. Cho; Jeageun Jung; Mattan Erez

    DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference tasks are memory bandwidth bound, contrary to common

    更新日期:2020-12-02
  • HeM3D: Heterogeneous Manycore Architecture Based on Monolithic 3D Vertical Integration
    arXiv.cs.AR Pub Date : 2020-11-30
    Aqeeb Iqbal Arka; Biresh Kumar Joardar; Ryan Gary Kim; Dae Hyun Kim; Janardhan Rao Doppa; Partha Pratim Pande

    Heterogeneous manycore architectures are the key to efficiently execute compute- and data-intensive applications. Through silicon via (TSV)-based 3D manycore system is a promising solution in this direction as it enables integration of disparate computing cores on a single system. However, the achievable performance of conventional through-silicon-via (TSV)-based 3D systems is ultimately bottlenecked

    更新日期:2020-12-02
  • Aging-Aware Request Scheduling for Non-Volatile Main Memory
    arXiv.cs.AR Pub Date : 2020-11-30
    Shihao Song; Anup Das; Onur Mutlu; Nagarajan Kandasamy

    Modern computing systems are embracing non-volatile memory (NVM) to implement high-capacity and low-cost main memory. Elevated operating voltages of NVM accelerate the aging of CMOS transistors in the peripheral circuitry of each memory bank. Aggressive device scaling increases power density and temperature, which further accelerates aging, challenging the reliable operation of NVM-based main memory

    更新日期:2020-12-02
  • Dataflow-Architecture Co-Design for 2.5D DNN Accelerators using Wireless Network-on-Package
    arXiv.cs.AR Pub Date : 2020-11-30
    Robert Guirado; Hyoukjun Kwon; Sergi Abadal; Eduard Alarcón; Tushar Krishna

    Deep neural network (DNN) models continue to grow in size and complexity, demanding higher computational power to enable real-time inference. To efficiently deliver such computational demands, hardware accelerators are being developed and deployed across scales. This naturally requires an efficient scale-out mechanism for increasing compute density as required by the application. 2.5D integration over

    更新日期:2020-12-01
  • XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Network on RISC-V based IoT End Nodes
    arXiv.cs.AR Pub Date : 2020-11-29
    Angelo Garofalo; Giuseppe Tagliavini; Francesco Conti; Luca Benini; Davide Rossi

    This work introduces lightweight extensions to the RISC-V ISA to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we are able to show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom

    更新日期:2020-12-01
  • EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP
    arXiv.cs.AR Pub Date : 2020-11-28
    Thierry Tambe; Coleman Hooper; Lillian Pentecost; En-Yu Yang; Marco Donato; Victor Sanh; Alexander M. Rush; David Brooks; Gu-Yeon Wei

    Transformer-based language models such as BERT provide significant accuracy improvement to a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT an in-depth and principled algorithm and hardware design methodology to achieve

    更新日期:2020-12-01
  • Design Methodologies for Reliable and Energy-efficient PCM Systems
    arXiv.cs.AR Pub Date : 2020-11-27
    Shihao Song; Anup Das

    Phase-change memory (PCM) is a scalable and low latency non-volatile memory (NVM) technology that has been proposed to serve as storage class memory (SCM), providing low access latency similar to DRAM and often approaching or exceeding the capacity of SSD. The multilevel property of PCM also enables its adoption in neuromorphic systems to build high-density synaptic storage. We investigate and describe

    更新日期:2020-12-01
  • Compiling Spiking Neural Networks to Mitigate Neuromorphic Hardware Constraints
    arXiv.cs.AR Pub Date : 2020-11-27
    Adarsha Balaji; Anup Das

    Spiking Neural Networks (SNNs) are efficient computation models to perform spatio-temporal pattern recognition on {resource}- and {power}-constrained platforms. SNNs executed on neuromorphic hardware can further reduce energy consumption of these platforms. With increasing model size and complexity, mapping SNN-based applications to tile-based neuromorphic hardware is becoming increasingly challenging

    更新日期:2020-12-01
  • Net2: A Graph Attention Network Method Customized for Pre-Placement Net Length Estimation
    arXiv.cs.AR Pub Date : 2020-11-27
    Zhiyao Xie; Rongjian Liang; Xiaoqing Xu; Jiang Hu; Yixiao Duan; Yiran Chen

    Net length is a key proxy metric for optimizing timing and power across various stages of a standard digital design flow. However, the bulk of net length information is not available until cell placement, and hence it is a significant challenge to explicitly consider net length optimization in design stages prior to placement, such as logic synthesis. This work addresses this challenge by proposing

    更新日期:2020-12-01
  • PowerNet: Transferable Dynamic IR Drop Estimation via Maximum Convolutional Neural Network
    arXiv.cs.AR Pub Date : 2020-11-26
    Zhiyao Xie; Haoxing Ren; Brucek Khailany; Ye Sheng; Santosh Santosh; Jiang Hu; Yiran Chen

    IR drop is a fundamental constraint required by almost all chip designs. However, its evaluation usually takes a long time that hinders mitigation techniques for fixing its violations. In this work, we develop a fast dynamic IR drop estimation technique, named PowerNet, based on a convolutional neural network (CNN). It can handle both vector-based and vectorless IR analyses. Moreover, the proposed

    更新日期:2020-12-01
  • FIST: A Feature-Importance Sampling and Tree-Based Method for Automatic Design Flow Parameter Tuning
    arXiv.cs.AR Pub Date : 2020-11-26
    Zhiyao Xie; Guan-Qi Fang; Yu-Hung Huang; Haoxing Ren; Yanqing Zhang; Brucek Khailany; Shao-Yun Fang; Jiang Hu; Yiran Chen; Erick Carvajal Barboza

    Design flow parameters are of utmost importance to chip design quality and require a painfully long time to evaluate their effects. In reality, flow parameter tuning is usually performed manually based on designers' experience in an ad hoc manner. In this work, we introduce a machine learning-based automatic parameter tuning methodology that aims to find the best design quality with a limited number

    更新日期:2020-12-01
  • True-data Testbed for 5G/B5G Intelligent Network
    arXiv.cs.AR Pub Date : 2020-11-26
    Yongming Huang; Shengheng Liu; Cheng Zhang; Xiaohu You; Hequan Wu

    Future beyond fifth-generation (B5G) and sixth-generation (6G) mobile communications will shift from facilitating interpersonal communications to supporting Internet of Everything (IoE), where intelligent communications with full integration of big data and artificial intelligence (AI) will play an important role in improving network efficiency and providing high-quality service. As a rapid evolving

    更新日期:2020-12-01
  • Ax-BxP: Approximate Blocked Computation for Precision-Reconfigurable Deep Neural Network Acceleration
    arXiv.cs.AR Pub Date : 2020-11-25
    Reena Elangovan; Shubham Jain; Anand Raghunathan

    Precision scaling has emerged as a popular technique to optimize the compute and storage requirements of Deep Neural Networks (DNNs). Efforts toward creating ultra-low-precision (sub-8-bit) DNNs suggest that the minimum precision required to achieve a given network-level accuracy varies considerably across networks, and even across layers within a network, requiring support for variable precision in

    更新日期:2020-12-01
  • Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks
    arXiv.cs.AR Pub Date : 2020-11-25
    Nick Iliev; Amit Ranjan Trivedi

    We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 128 High Bandwidth Memory (HBM) units for storing the pretrained weights. Micro-architectural details for

    更新日期:2020-11-27
  • AccSS3D: Accelerator for Spatially Sparse 3D DNNs
    arXiv.cs.AR Pub Date : 2020-11-25
    Om Ji Omer; Prashant Laddha; Gurpreet S Kalsi; Anirud Thyagharajan; Kamlesh R Pillai; Abhimanyu Kulkarni; Anbang Yao; Yurong Chen; Sreenivas Subramoney

    Semantic understanding and completion of real world scenes is a foundational primitive of 3D Visual perception widely used in high-level applications such as robotics, medical imaging, autonomous driving and navigation. Due to the curse of dimensionality, compute and memory requirements for 3D scene understanding grow in cubic complexity with voxel resolution, posing a huge impediment to realizing

    更新日期:2020-11-27
  • Automated Floorplanning for Partially Reconfigurable Designs on Heterogenrous FPGAs
    arXiv.cs.AR Pub Date : 2020-11-23
    Pingakshya Goswami; Dinesh Bhatia

    Floorplanning problem has been extensively explored for homogeneous FPGAs. Most modern FPGAs consist of heterogeneous resources in the form of configurable logic blocks, DSP blocks, BRAMs and more. Very little work has been done for heterogeneous FPGAs. In addition, features like partial reconfigurability allow on-the-fly changes to the executable design that can result in enhanced performance and

    更新日期:2020-11-25
  • Proximu$: Efficiently Scaling DNN Inference in Multi-core CPUs through Near-Cache Compute
    arXiv.cs.AR Pub Date : 2020-11-23
    Anant V. Nori; Rahul Bera; Shankar Balachandran; Joydeep Rakshit; Om J. Omer; Avishaii Abuhatzera; Kuttanna Belliappa; Sreenivas Subramoney

    Deep Neural Network (DNN) inference is emerging as the fundamental bedrock for a multitude of utilities and services. CPUs continue to scale up their raw compute capabilities for DNN inference along with mature high performance libraries to extract optimal performance. While general purpose CPUs offer unique attractive advantages for DNN inference at both datacenter and edge, they have primarily evolved

    更新日期:2020-11-25
  • Leveraging Architectural Support of Three Page Sizes with Trident
    arXiv.cs.AR Pub Date : 2020-11-24
    Venkat Sri Sai Ram; Ashish Panwar; Arkaprava Basu

    Large pages are commonly deployed to reduce address translation overheads for big-memory workloads. Modern x86-64 processors from Intel and AMD support two large page sizes -- 1GB and 2MB. However, previous works on large pages have primarily focused on 2MB pages, partly due to lack of substantial evidence on the profitability of 1GB pages to real-world applications. We argue that in fact, inadequate

    更新日期:2020-11-25
  • Benchmarking Inference Performance of Deep Learning Models on Analog Devices
    arXiv.cs.AR Pub Date : 2020-11-24
    Omobayode Fagbohungbe; Lijun Qian

    Analog hardware implemented deep learning models are promising for computation and energy constrained systems such as edge computing devices. However, the analog nature of the device and the associated many noise sources will cause changes to the value of the weights in the trained deep learning models deployed on such devices. In this study, systematic evaluation of the inference performance of trained

    更新日期:2020-11-25
  • RVCoreP-32IC: A high-performance RISC-V soft processor with an efficient fetch unit supporting the compressed instructions
    arXiv.cs.AR Pub Date : 2020-11-23
    Takuto Kanamori; Hiromu Miyazaki; Kenji Kise

    In this paper, we propose a high-performance RISC-V soft processor with an efficient fetch unit supporting the compressed instructions targeting on FPGA. The compressed instruction extension in RISC-V can reduce the program size by about 25%. But it needs a complicated logic for the instruction fetch unit and has a significant impact on performance. We propose an instruction fetch unit that supports

    更新日期:2020-11-25
  • Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads
    arXiv.cs.AR Pub Date : 2020-11-22
    Bahar Asgari; Ramyad Hadidi; Joshua Dierberger; Charlotte Steinichen; Hyesoon Kim

    Sparse matrices are the key ingredients of several application domains, from scientific computation to machine learning. The primary challenge with sparse matrices has been efficiently storing and transferring data, for which many sparse formats have been proposed to significantly eliminate zero entries. Such formats, essentially designed to optimize memory footprint, may not be as successful in performing

    更新日期:2020-11-25
  • Third ArchEdge Workshop: Exploring the Design Space of Efficient Deep Neural Networks
    arXiv.cs.AR Pub Date : 2020-11-22
    Fuxun Yu; Dimitrios Stamoulis; Di Wang; Dimitrios Lymberopoulos; Xiang Chen

    This paper gives an overview of our ongoing work on the design space exploration of efficient deep neural networks (DNNs). Specifically, we cover two aspects: (1) static architecture design efficiency and (2) dynamic model execution efficiency. For static architecture design, different from existing end-to-end hardware modeling assumptions, we conduct full-stack profiling at the GPU core level to identify

    更新日期:2020-11-25
  • AZP: Automatic Specialization for Zero Values in Gaming Applications
    arXiv.cs.AR Pub Date : 2020-11-20
    Mark W. Stephenson; Ram Rangan

    Recent research has shown that dynamic zeros in shader programs of gaming applications can be effectively leveraged with a profile-guided, code-versioning transform. This transform duplicates code, specializes one path assuming certain key program operands, called versioning variables, are zero, and leaves the other path unspecialized. Dynamically, depending on the versioning variable's value, either

    更新日期:2020-11-23
  • Experiences from Large-Scale Model Checking: Verification of a Vehicle Control System
    arXiv.cs.AR Pub Date : 2020-11-20
    Jonas Fritzsch; Tobias Schmid; Stefan Wagner

    In the age of autonomously driving vehicles, functionality and complexity of embedded systems are increasing tremendously. Safety aspects become more important and require such systems to operate with the highest possible level of fault tolerance. Simulation and systematic testing techniques have reached their limits in this regard. Here, formal verification as a long established technique can be an

    更新日期:2020-11-23
  • SIMF: Single-Instruction Multiple-Flush Mechanism for Processor Temporal Isolation
    arXiv.cs.AR Pub Date : 2020-11-20
    Tuo Li; Bradley Hopkins; Sri Parameswaran

    Microarchitectural timing attacks are a type of information leakage attack, which exploit the time-shared microarchitectural components, such as caches, translation look-aside buffers (TLBs), branch prediction unit (BPU), and speculative execution, in modern processors to leak critical information from a victim process or thread. To mitigate such attacks, the mechanism for flushing the on-core state

    更新日期:2020-11-23
  • Hardware Implementation of Fano Decoder for PAC Codes
    arXiv.cs.AR Pub Date : 2020-11-19
    Amir Mozammel

    This paper proposes a hardware implementation architecture for Fano decoding of polarization-adjusted convolutional (PAC) codes. This architecture maintains a trade-off between the error-correction performance and throughput of the decoder by setting a strict limit on its search complexity. The paper presents analyses of the complexity, combinational delay, and latency of the proposed architecture

    更新日期:2020-11-21
  • ArSMART: An Improved SMART NoC Design Supporting Arbitrary-Turn Transmission
    arXiv.cs.AR Pub Date : 2020-11-18
    Hui Chen; Peng Chen; Jun Zhou; Duong H. K. Luan; Weichen Liu

    SMART NoC, which transmits unconflicted flits to distant processing elements (PEs) in one cycle through the express bypass, is a high-performance NoC design proposed recently. However, if contention occurs, flits with low priority would not only be buffered but also could not fully utilize bypass. Although there exist several routing algorithms that decrease contentions by rounding busy routers and

    更新日期:2020-11-19
  • A Survey of System Architectures and Techniques for FPGA Virtualization
    arXiv.cs.AR Pub Date : 2020-11-18
    Masudul Hassan Quraishi; Erfan Bank Tavakoli; Fengbo Ren

    FPGA accelerators are gaining increasing attention in both cloud and edge computing because of their hardware flexibility, high computational throughput, and low power consumption. However, the design flow of FPGAs often requires specific knowledge of the underlying hardware, which hinders the wide adoption of FPGAs by application developers. Therefore, the virtualization of FPGAs becomes extremely

    更新日期:2020-11-19
  • Distributed Injection-Locking in Analog Ising Machines to Solve Combinatorial Optimizations
    arXiv.cs.AR Pub Date : 2020-11-18
    M. Ali Vosoughi

    The oscillator-based Ising machine (OIM) is a network of coupled CMOS oscillators that solves combinatorial optimization problems. In this paper, the distribution of the injection-locking oscillations throughout the circuit is proposed to accelerate the phase-locking of the OIM. The implications of the proposed technique theoretically investigated and verified by extensive simulations in EDA tools

    更新日期:2020-11-19
  • Automatic Microprocessor Performance Bug Detection
    arXiv.cs.AR Pub Date : 2020-11-17
    Erick Carvajal Barboza; Sara Jacob; Mahesh Ketkar; Michael Kishinevsky; Paul Gratz; Jiang Hu

    Processor design validation and debug is a difficult and complex task, which consumes the lion's share of the design process. Design bugs that affect processor performance rather than its functionality are especially difficult to catch, particularly in new microarchitectures. This is because, unlike functional bugs, the correct processor performance of new microarchitectures on complex, long-running

    更新日期:2020-11-18
  • Revising the classic computing paradigm and its technological implementations
    arXiv.cs.AR Pub Date : 2020-11-16
    János Végh

    Today's computing is told to be based on the classic paradigm, proposed by von Neumann, a three-quarter century ago. However, that paradigm was justified (for the timing relations of) vacuum tubes only. The technological development invalidated the classic paradigm (but not the model!) and led to catastrophic performance losses in computing systems, from operating gate level to large networks, including

    更新日期:2020-11-18
  • Optimizing Graph Processing and Preprocessing with Hardware Assisted Propagation Blocking
    arXiv.cs.AR Pub Date : 2020-11-17
    Vignesh Balaji; Brandon Lucia

    Extensive prior research has focused on alleviating the characteristic poor cache locality of graph analytics workloads. However, graph pre-processing tasks remain relatively unexplored. In many important scenarios, graph pre-processing tasks can be as expensive as the downstream graph analytics kernel. We observe that Propagation Blocking (PB), a software optimization designed for SpMV kernels, generalizes

    更新日期:2020-11-18
  • AXES: Approximation Manager for Emerging Memory Architectures
    arXiv.cs.AR Pub Date : 2020-11-17
    Biswadip Maity; Bryan Donyanavard; Anmol Surhonne; Amir Rahmani; Andreas Herkersdorf; Nikil Dutt

    Memory approximation techniques are commonly limited in scope, targeting individual levels of the memory hierarchy. Existing approximation techniques for a full memory hierarchy determine optimal configurations at design-time provided a goal and application. Such policies are rigid: they cannot adapt to unknown workloads and must be redesigned for different memory configurations and technologies. We

    更新日期:2020-11-18
  • Indirection Stream Semantic Register Architecture for Efficient Sparse-Dense Linear Algebra
    arXiv.cs.AR Pub Date : 2020-11-16
    Paul Scheffler; Florian Zaruba; Fabian Schuiki; Torsten Hoefler; Luca Benini

    Sparse-dense linear algebra is crucial in many domains, but challenging to handle efficiently on CPUs, GPUs, and accelerators alike; multiplications with sparse formats like CSR and CSF require indirect memory lookups. In this work, we enhance a memory-streaming RISC-V ISA extension to accelerate sparse-dense products through streaming indirection. We present efficient dot, matrix-vector, and matrix-matrix

    更新日期:2020-11-17
  • Memory-Efficient Dataflow Inference for Deep CNNs on FPGA
    arXiv.cs.AR Pub Date : 2020-11-14
    Lucian Petrica; Tobias Alonso; Mairin Kroes; Nicholas Fraser; Sorin Cotofana; Michaela Blott

    Custom dataflow Convolutional Neural Network (CNN) inference accelerators on FPGA are tailored to a specific CNN topology and store parameters in On-Chip Memory (OCM), resulting in high energy efficiency and low inference latency. However, in these accelerators the shapes of parameter memories are dictated by throughput constraints and do not map well to the underlying OCM, which becomes an implementation

    更新日期:2020-11-17
  • Tiny-CFA: A Minimalistic Approach for Control-Flow Attestation Using Verified Proofs of Execution
    arXiv.cs.AR Pub Date : 2020-11-14
    Ivan De Oliveira Nunes; Sashidhar Jakkamsetti; Gene Tsudik

    The design of tiny trust anchors has received significant attention over the past decade, to secure low-end MCU-s that cannot afford expensive security mechanisms. In particular, hardware/software (hybrid) co-designs offer low hardware cost, while retaining similar security guarantees as (more expensive) hardware-based techniques. Hybrid trust anchors support security services, such as remote attestation

    更新日期:2020-11-17
  • Channel Tiling for Improved Performance and Accuracy of Optical Neural Network Accelerators
    arXiv.cs.AR Pub Date : 2020-11-14
    Shurui Li; Mario Miscuglio; Volker J. Sorger; Puneet Gupta

    Low latency, high throughput inference on Convolution Neural Networks (CNNs) remains a challenge, especially for applications requiring large input or large kernel sizes. 4F optics provides a solution to accelerate CNNs by converting convolutions into Fourier-domain point-wise multiplications that are computationally 'free' in optical domain. However, existing 4F CNN systems suffer from the all-positive

    更新日期:2020-11-17
  • Customizing Trusted AI Accelerators for Efficient Privacy-Preserving Machine Learning
    arXiv.cs.AR Pub Date : 2020-11-12
    Peichen Xie; Xuanle Ren; Guangyu Sun

    The use of trusted hardware has become a promising solution to enable privacy-preserving machine learning. In particular, users can upload their private data and models to a hardware-enforced trusted execution environment (e.g. an enclave in Intel SGX-enabled CPUs) and run machine learning tasks in it with confidentiality and integrity guaranteed. To improve performance, AI accelerators have been widely

    更新日期:2020-11-13
  • Understanding Training Efficiency of Deep Learning Recommendation Models at Scale
    arXiv.cs.AR Pub Date : 2020-11-11
    Bilge Acun; Matthew Murphy; Xiaodong Wang; Jade Nie; Carole-Jean Wu; Kim Hazelwood

    The use of GPUs has proliferated for machine learning workflows and is now considered mainstream for many deep learning models. Meanwhile, when training state-of-the-art personal recommendation models, which consume the highest number of compute cycles at our large-scale datacenters, the use of GPUs came with various challenges due to having both compute-intensive and memory-intensive components. GPU

    更新日期:2020-11-12
  • Coherence Traffic in Manycore Processors with Opaque Distributed Directories
    arXiv.cs.AR Pub Date : 2020-11-10
    Steve Kommrusch; Marcos Horro; Louis-Noël Pouchet; Gabriel Rodríguez; Juan Touriño

    Manycore processors feature a high number of general-purpose cores designed to work in a multithreaded fashion. Recent manycore processors are kept coherent using scalable distributed directories. A paramount example is the Intel Mesh interconnect, which consists of a network-on-chip interconnecting "tiles", each of which contains computation cores, local caches, and coherence masters. The distributed

    更新日期:2020-11-12
  • von Neumann's missing "Second Draft": what it should contain
    arXiv.cs.AR Pub Date : 2020-11-09
    János Végh

    Computing science is based on a computing paradigm that is not valid anymore for today's technological conditions. The reason is that the transmission time even inside the processor chip, but especially between the components of the system, is not negligible anymore. The paper introduces a quantitative measure for dispersion, which is vital for both computing performance and energy consumption, and

    更新日期:2020-11-12
  • ARENA: Asynchronous Reconfigurable Accelerator Ring to Enable Data-Centric Parallel Computing
    arXiv.cs.AR Pub Date : 2020-11-10
    Cheng Tan; Chenhao Xie; Andres Marquez; Antonino Tumeo; Kevin Barker; Ang Li

    The next generation HPC and data centers are likely to be reconfigurable and data-centric due to the trend of hardware specialization and the emergence of data-driven applications. In this paper, we propose ARENA -- an asynchronous reconfigurable accelerator ring architecture as a potential scenario on how the future HPC and data centers will be like. Despite using the coarse-grained reconfigurable

    更新日期:2020-11-12
  • Mapping Stencils on Coarse-grained Reconfigurable Spatial Architecture
    arXiv.cs.AR Pub Date : 2020-11-06
    Jesmin Jahan Tithi; Fabrizio Petrini; Hongbo Rong; Andrei Valentin; Carl Ebeling

    Stencils represent a class of computational patterns where an output grid point depends on a fixed shape of neighboring points in an input grid. Stencil computations are prevalent in scientific applications engaging a significant portion of supercomputing resources. Therefore, it has been always important to optimize stencil programs for the best performance. A rich body of research has focused on

    更新日期:2020-11-12
  • FPGA-based Hyrbid Memory Emulation System
    arXiv.cs.AR Pub Date : 2020-11-09
    Fei Wen; Mian Qin; Paul V. Gratz; A. L. Narasimha Reddy

    Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of applications. Emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed

    更新日期:2020-11-12
  • Towards Latency-aware DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination
    arXiv.cs.AR Pub Date : 2020-11-08
    Fuxun Yu; Zirui Xu; Tong Shen; Dimitrios Stamoulis; Longfei Shangguan; Di Wang; Rishi Madhok; Chunshui Zhao; Xin Li; Nikolaos Karianakis; Dimitrios Lymberopoulos; Ang Li; ChenChen Liu; Yiran Chen; Xiang Chen

    Despite the superb performance of State-Of-The-Art (SOTA) DNNs, the increasing computational cost makes them very challenging to meet real-time latency and accuracy requirements. Although DNN runtime latency is dictated by model property (e.g., architecture, operations), hardware property (e.g., utilization, throughput), and more importantly, the effective mapping between these two, many existing approaches

    更新日期:2020-11-12
  • EHAP-ORAM: Efficient Hardware-Assisted Persistent ORAM System for Non-volatile Memory
    arXiv.cs.AR Pub Date : 2020-11-07
    Gang Liu; Kenli Li; Zheng Xiao; Rujia Wang

    Oblivious RAM (ORAM) protected access pattern is essential for secure NVM. In the ORAM system, data and PosMap metadata are maps in pairs to perform secure access. Therefore, we focus on the problem of crash consistency in the ORAM system. Unfortunately, using traditional software-based support for ORAM system crash consistency is not only expensive, it can also lead to information leaks. At present

    更新日期:2020-11-12
  • Graphene-based Wireless Agile Interconnects for Massive Heterogeneous Multi-chip Processors
    arXiv.cs.AR Pub Date : 2020-11-08
    Sergi Abadal; Robert Guirado; Hamidreza Taghvaee; Akshay Jain; Elana Pereira de Santana; Peter Haring Bolívar; Mohamed Saeed; Renato Negra; Zhenxing Wang; Kun-Ta Wang; Max C. Lemme; Joshua Klein; Marina Zapater; Alexandre Levisse; David Atienza; Davide Rossi; Francesco Conti; Martino Dazzi; Geethan Karunaratne; Irem Boybat; Abu Sebastian

    The main design principles in computer architecture have recently shifted from a monolithic scaling-driven approach to the development of heterogeneous architectures that tightly co-integrate multiple specialized processor and memory chiplets. In such data-hungry multi-chip architectures, current Networks-in-Package (NiPs) may not be enough to cater to their heterogeneous and fast-changing communication

    更新日期:2020-11-12
  • Runtime Performances Benchmark for Knowledge Graph Embedding Methods
    arXiv.cs.AR Pub Date : 2020-11-05
    Angelica Sofia Valeriani

    This paper wants to focus on providing a characterization of the runtime performances of state-of-the-art implementations of KGE alghoritms, in terms of memory footprint and execution time. Despite the rapidly growing interest in KGE methods, so far little attention has been devoted to their comparison and evaluation; in particular, previous work mainly focused on performance in terms of accuracy in

    更新日期:2020-11-12
  • Strawberry Detection Using a Heterogeneous Multi-Processor Platform
    arXiv.cs.AR Pub Date : 2020-11-07
    Samuel Brandenburg; Pedro Machado; Nikesh Lama; T. M. McGinnity

    Over the last few years, the number of precision farming projects has increased specifically in harvesting robots and many of which have made continued progress from identifying crops to grasping the desired fruit or vegetable. One of the most common issues found in precision farming projects is that successful application is heavily dependent not just on identifying the fruit but also on ensuring

    更新日期:2020-11-12
  • ReFloat: Low-Cost Floating-Point Processing in ReRAM
    arXiv.cs.AR Pub Date : 2020-11-06
    Linghao Song; Fan Chen; Xuehai Qian; Hai Li; Yiran Chen

    We propose ReFloat, a principled approach for low-cost floating-point processing in ReRAM. The exponent offsets based on a base are stored by a flexible and fine-grained floating-point number representation. The key motivation is that, while the number of exponent bits must be reduced due to the exponential relation to the computation latency and hardware cost, the convergence still requires sufficient

    更新日期:2020-11-09
  • Predict and Write: Using K-Means Clustering to Extend the Lifetime of NVM Storage
    arXiv.cs.AR Pub Date : 2020-11-04
    Saeed Kargar; Heiner Litz; Faisal Nawab

    Non-volatile memory (NVM) technologies suffer from limited write endurance. To address this challenge, we propose Predict and Write (PNW), a K/V-store that uses a clustering-based machine learning approach to extend the lifetime of NVMs. PNW decreases the number of bit flips for PUT/UPDATE operations by determining the best memory location an updated value should be written to. PNW leverages the indirection

    更新日期:2020-11-06
  • Chasing Carbon: The Elusive Environmental Footprint of Computing
    arXiv.cs.AR Pub Date : 2020-10-28
    Udit Gupta; Young Geun Kim; Sylvia Lee; Jordan Tse; Hsien-Hsin S. Lee; Gu-Yeon Wei; David Brooks; Carole-Jean Wu

    Given recent algorithm, software, and hardware innovation, computing has enabled a plethora of new applications. As computing becomes increasingly ubiquitous, however, so does its environmental impact. This paper brings the issue to the attention of computer-systems researchers. Our analysis, built on industry-reported characterization, quantifies the environmental effects of computing in terms of

    更新日期:2020-11-06
  • An Empirical-cum-Statistical Approach to Power-Performance Characterization of Concurrent GPU Kernels
    arXiv.cs.AR Pub Date : 2020-11-04
    Nilanjan Goswami; Amer Qouneh; Chao Li; Tao Li

    Growing deployment of power and energy efficient throughput accelerators (GPU) in data centers demands enhancement of power-performance co-optimization capabilities of GPUs. Realization of exascale computing using accelerators requires further improvements in power efficiency. With hardwired kernel concurrency enablement in accelerators, inter- and intra-workload simultaneous kernels computation predicts

    更新日期:2020-11-05
  • Booster: An Accelerator for Gradient Boosting Decision Trees
    arXiv.cs.AR Pub Date : 2020-11-03
    Mingxuan He; T. N. Vijaykumar; Mithuna Thottethodi

    We propose Booster, a novel accelerator for gradient boosting trees based on the unique characteristics of gradient boosting models. We observe that the dominant steps of gradient boosting training (accounting for 90-98% of training time) involve simple, fine-grained, independent operations on small-footprint data structures (e.g., accumulate and compare values in the structures). Unfortunately, existing

    更新日期:2020-11-05
  • CUTIE: Beyond PetaOp/s/W Ternary DNN Inference Acceleration with Better-than-Binary Energy Efficiency
    arXiv.cs.AR Pub Date : 2020-11-03
    Moritz Scherer; Georg Rutishauser; Lukas Cavigelli; Luca Benini

    We present a 3.1 POp/s/W fully digital hardware accelerator for ternary neural networks. CUTIE, the Completely Unrolled Ternary Inference Engine, focuses on minimizing non-computational energy and switching activity so that dynamic power spent on storing (locally or globally) intermediate results is minimized. This is achieved by 1) a data path architecture completely unrolled in the feature map and

    更新日期:2020-11-04
  • SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable Accuracy
    arXiv.cs.AR Pub Date : 2020-11-02
    Zahra Ebrahimi; Salim Ullah; Akash Kumar

    The ever-increasing quest for data-level parallelism and variable precision in ubiquitous multimedia and Deep Neural Network (DNN) applications has motivated the use of Single Instruction, Multiple Data (SIMD) architectures. To alleviate energy as their main resource constraint, approximate computing has re-emerged,albeit mainly specialized for their Application-Specific Integrated Circuit (ASIC) implementations

    更新日期:2020-11-03
  • On the Impact of Partial Sums on Interconnect Bandwidth and Memory Accesses in a DNN Accelerator
    arXiv.cs.AR Pub Date : 2020-11-02
    Mahesh Chandra

    Dedicated accelerators are being designed to address the huge resource requirement of the deep neural network (DNN) applications. The power, performance and area (PPA) constraints limit the number of MACs available in these accelerators. The convolution layers which require huge number of MACs are often partitioned into multiple iterative sub-tasks. This puts huge pressure on the available system resources

    更新日期:2020-11-03
  • Addressing Resiliency of In-Memory Floating Point Computation
    arXiv.cs.AR Pub Date : 2020-11-01
    Sina Sayyah Ensan; Swaroop Ghosh; Seyedhamidreza Motaman; Derek Weast

    In-memory computing (IMC) can eliminate the data movement between processor and memory which is a barrier to the energy-efficiency and performance in Von-Neumann computing. Resistive RAM (RRAM) is one of the promising devices for IMC applications (e.g. integer and Floating Point (FP) operations and random logic implementation) due to low power consumption, fast operation, and small footprint in crossbar

    更新日期:2020-11-03
  • Mitigating Write Disturbance Errors of Phase-Change Memory as In-Module Approach
    arXiv.cs.AR Pub Date : 2020-11-01
    Hyokeun Lee; Seungyong Lee; Moonsoo Kim; Hyun Kim; Hyuk-Jae Lee

    With the growing demand for technology scaling and storage capacity in server systems to support high-performance computing, phase-change memory (PCM) has garnered attention as the next-generation non-volatile memory to satisfy these requirements. However, write disturbance error (WDE) appears as a serious reliability problem preventing PCM from general commercialization. WDE occurs on the neighboring

    更新日期:2020-11-03
  • RANC: Reconfigurable Architecture for Neuromorphic Computing
    arXiv.cs.AR Pub Date : 2020-11-01
    Joshua Mack; Ruben Purdy; Kris Rockowitz; Michael Inouye; Edward Richter; Spencer Valancius; Nirmal Kumbhare; Md Sahil Hassan; Kaitlin Fair; John Mixter; Ali Akoglu

    Neuromorphic architectures have been introduced as platforms for energy efficient spiking neural network execution. The massive parallelism offered by these architectures has also triggered interest from non-machine learning application domains. In order to lift the barriers to entry for hardware designers and application developers we present RANC: a Reconfigurable Architecture for Neuromorphic Computing

    更新日期:2020-11-03
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
ERIS期刊投稿
欢迎阅读创刊号
自然职场,为您触达千万科研人才
spring&清华大学出版社
城市可持续发展前沿研究专辑
Springer 纳米技术权威期刊征稿
全球视野覆盖
施普林格·自然新
chemistry
物理学研究前沿热点精选期刊推荐
自然职位线上招聘会
欢迎报名注册2020量子在线大会
化学领域亟待解决的问题
材料学研究精选新
GIANT
ACS ES&T Engineering
ACS ES&T Water
屿渡论文,编辑服务
ACS Publications填问卷
阿拉丁试剂right
苏州大学
林亮
南方科技大学
朱守非
内蒙古大学
杨小会
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
上海纽约大学
浙江大学
廖矿标
天合科研
x-mol收录
试剂库存
down
wechat
bug