当前期刊: ACM Transactions on Architecture and Code Optimization Go to current issue    加入关注    本刊投稿指南
显示样式:        排序: IF: - GO 导出
我的关注
我的收藏
您暂时未登录!
登录
  • PolyDL: Polyhedral Optimizations for Creation of High-performance DL Primitives
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2021-01-07
    Sanket Tavarageri; Alexander Heinecke; Sasikanth Avancha; Bharat Kaul; Gagandeep Goyal; Ramakrishna Upadrasta

    Deep Neural Networks (DNNs) have revolutionized many aspects of our lives. The use of DNNs is becoming ubiquitous, including in software for image recognition, speech recognition, speech synthesis, language translation, to name a few. The training of DNN architectures, however, is computationally expensive. Once the model is created, its use in the intended application—the inference task, is computationally

    更新日期:2021-01-08
  • On the Anatomy of Predictive Models for Accelerating GPU Convolution Kernels and Beyond
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2021-01-07
    Paolo Sylos Labini; Marco Cianfriglia; Damiano Perri; Osvaldo Gervasi; Grigori Fursin; Anton Lokhmotov; Cedric Nugteren; Bruno Carpentieri; Fabiana Zollo; Flavio Vella

    Efficient HPC libraries often expose multiple tunable parameters, algorithmic implementations, or a combination of them, to provide optimized routines. The optimal parameters and algorithmic choices may depend on input properties such as the shapes of the matrices involved in the operation. Traditionally, these parameters are manually tuned or set by auto-tuners. In emerging applications such as deep

    更新日期:2021-01-08
  • Refresh Triggered Computation: Improving the Energy Efficiency of Convolutional Neural Network Accelerators
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Syed M. A. H. Jafri; Hasan Hassan; Ahmed Hemani; Onur Mutlu

    To employ a Convolutional Neural Network (CNN) in an energy-constrained embedded system, it is critical for the CNN implementation to be highly energy efficient. Many recent studies propose CNN accelerator architectures with custom computation units that try to improve the energy efficiency and performance of CNNs by minimizing data transfers from DRAM-based main memory. However, in these architectures

    更新日期:2020-12-30
  • Performance-Energy Trade-off in Modern CMPs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Solomon Abera; M. Balakrishnan; Anshul Kumar

    Chip multiprocessors (CMPs) are ubiquitous in all computing systems ranging from high-end servers to mobile devices. In these systems, energy consumption is a critical design constraint as it constitutes the most significant operating cost for computing clouds. Analogous to this, longer battery life continues to be an essential user concern in mobile devices. To optimize on power consumption, modern

    更新日期:2020-12-30
  • Bayesian Optimization for Efficient Accelerator Synthesis
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Atefeh Mehrabi; Aninda Manocha; Benjamin C. Lee; Daniel J. Sorin

    Accelerator design is expensive due to the effort required to understand an algorithm and optimize the design. Architects have embraced two technologies to reduce costs. High-level synthesis automatically generates hardware from code. Reconfigurable fabrics instantiate accelerators while avoiding fabrication costs for custom circuits. We further reduce design effort with statistical learning. We build

    更新日期:2020-12-30
  • Irregular Register Allocation for Translation of Test-pattern Programs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Minsu Kim; Jeong-Keun Park; Soo-Mook Moon

    Test-pattern programs are for testing DRAM memory chips. They run on a special embedded system called automated test equipment (ATE). Each ATE manufacturer provides its own programming language, which is mostly low level, thus accessing the registers in the ATE directly. The register structure of each ATE is quite different and highly irregular. Since DRAM chipmakers are often equipped with diverse

    更新日期:2020-12-30
  • Efficient Nearest-Neighbor Data Sharing in GPUs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Negin Nematollahi; Mohammad Sadrosadati; Hajar Falahati; Marzieh Barkhordar; Mario Paulo Drumond; Hamid Sarbazi-Azad; Babak Falsafi

    Stencil codes (a.k.a. nearest-neighbor computations) are widely used in image processing, machine learning, and scientific applications. Stencil codes incur nearest-neighbor data exchange because the value of each point in the structured grid is calculated as a function of its value and the values of a subset of its nearest-neighbor points. When running on Graphics Processing Unit (GPUs), stencil codes

    更新日期:2020-12-30
  • A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Lorenz Braun; Sotirios Nikas; Chen Song; Vincent Heuveline; Holger Fröning

    Characterizing compute kernel execution behavior on GPUs for efficient task scheduling is a non-trivial task. We address this with a simple model enabling portable and fast predictions among different GPUs using only hardware-independent features. This model is built based on random forests using 189 individual compute kernels from benchmarks such as Parboil, Rodinia, Polybench-GPU, and SHOC. Evaluation

    更新日期:2020-12-30
  • A Distributed Hardware Monitoring System for Runtime Verification on Multi-Tile MPSoCs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Marcel Mettler; Daniel Mueller-Gritschneder; Ulf Schlichtmann

    Exhaustive verification techniques do not scale with the complexity of today’s multi-tile Multi-processor Systems-on-chip (MPSoCs). Hence, runtime verification (RV) has emerged as a complementary method, which verifies the correct behavior of applications executed on the MPSoC during runtime. In this article, we propose a decentralized monitoring architecture for large-scale multi-tile MPSoCs. In order

    更新日期:2020-12-30
  • Exploiting Parallelism Opportunities with Deep Learning Frameworks
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Yu Emma Wang; Carole-Jean Wu; Xiaodong Wang; Kim Hazelwood; David Brooks

    State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific

    更新日期:2020-12-30
  • SGXL: Security and Performance for Enclaves Using Large Pages
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Sujay Yadalam; Vinod Ganapathy; Arkaprava Basu

    Intel’s SGX architecture offers clients of public cloud computing platforms the ability to create hardware-protected enclaves whose contents are protected from privileged system software. However, SGX relies on system software for enclave memory management. In a sequence of recent papers, researchers have demonstrated that this reliance allows a malicious OS/hypervisor to snoop on the page addresses

    更新日期:2020-12-30
  • Leveraging Value Equality Prediction for Value Speculation
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Kleovoulos Kalaitzidis; André Seznec

    Value Prediction (VP) has recently been gaining interest in the research community, since prior work has established practical solutions for its implementation that provide meaningful performance gains. A constant challenge of contemporary context-based value predictors is to sufficiently capture value redundancy and exploit the predictable execution paths. To do so, modern context-based VP techniques

    更新日期:2020-12-30
  • SPX64: A Scratchpad Memory for General-purpose Microprocessors
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-29
    Abhishek Singh; Shail Dave; Pantea Zardoshti; Robert Brotzman; Chao Zhang; Xiaochen Guo; Aviral Shrivastava; Gang Tan; Michael Spear

    General-purpose computing systems employ memory hierarchies to provide the appearance of a single large, fast, coherent memory. In special-purpose CPUs, programmers manually manage distinct, non-coherent scratchpad memories. In this article, we combine these mechanisms by adding a virtually addressed, set-associative scratchpad to a general purpose CPU. Our scratchpad exists alongside a traditional

    更新日期:2020-12-30
  • IR2VEC: LLVM IR Based Scalable Program Embeddings
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-18
    S. VenkataKeerthy; Rohit Aggarwal; Shalini Jain; Maunendra Sankar Desarkar; Ramakrishna Upadrasta; Y. N. Srikant

    We propose IR2VEC, a Concise and Scalable encoding infrastructure to represent programs as a distributed embedding in continuous space. This distributed embedding is obtained by combining representation learning methods with flow information to capture the syntax as well as the semantics of the input programs. As our infrastructure is based on the Intermediate Representation (IR) of the source code

    更新日期:2020-12-22
  • LLOV: A Fast Static Data-Race Checker for OpenMP Programs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-12-18
    Utpal Bora; Santanu Das; Pankaj Kukreja; Saurabh Joshi; Ramakrishna Upadrasta; Sanjay Rajopadhye

    In the era of Exascale computing, writing efficient parallel programs is indispensable, and, at the same time, writing sound parallel programs is very difficult. Specifying parallelism with frameworks such as OpenMP is relatively easy, but data races in these programs are an important source of bugs. In this article, we propose LLOV, a fast, lightweight, language agnostic, and static data race checker

    更新日期:2020-12-22
  • GEVO: GPU Code Optimization Using Evolutionary Computation
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-11-23
    Jhe-Yu Liou; Xiaodong Wang; Stephanie Forrest; Carole-Jean Wu

    GPUs are a key enabler of the revolution in machine learning and high-performance computing, functioning as de facto co-processors to accelerate large-scale computation. As the programming stack and tool support have matured, GPUs have also become accessible to programmers, who may lack detailed knowledge of the underlying architecture and fail to fully leverage the GPU’s computation power. GEVO (Gpu

    更新日期:2020-11-27
  • FastPath_MP: Low Overhead 8 Energy-efficient FPGA-based Storage Multi-paths
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-11-23
    Athanasios Stratikopoulos; Christos Kotselidis; John Goodacre; Mikel Luján

    In this article, we present FastPath_MP, a novel low-overhead and energy-efficient storage multi-path architecture that leverages FPGAs to operate transparently to the main processor and improve the performance and energy efficiency of accessing storage devices. We prototyped FastPath_MP on both Arm-FPGA Zynq 7000 SoC and Zynq UltraScale+ MPSoC and evaluated its performance against standard microbenchmarks

    更新日期:2020-11-27
  • NNBench-X: A Benchmarking Methodology for Neural Network Accelerator Designs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-11-10
    Xinfeng Xie; Xing Hu; Peng Gu; Shuangchen Li; Yu Ji; Yuan Xie

    The tremendous impact of deep learning algorithms over a wide range of application domains has encouraged a surge of neural network (NN) accelerator research. Facilitating the NN accelerator design calls for guidance from an evolving benchmark suite that incorporates emerging NN models. Nevertheless, existing NN benchmarks are not suitable for guiding NN accelerator designs. These benchmarks are either

    更新日期:2020-11-12
  • A Black-box Monitoring Approach to Measure Microservices Runtime Performance
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-11-10
    Rolando Brondolin; Marco D. Santambrogio

    Microservices changed cloud computing by moving the applications’ complexity from one monolithic executable to thousands of network interactions between small components. Given the increasing deployment sizes, the architectural exploitation challenges, and the impact on data-centers’ power consumption, we need to efficiently track this complexity. Within this article, we propose a black-box monitoring

    更新日期:2020-11-12
  • On Architectural Support for Instruction Set Randomization
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-11-10
    George Christou; Giorgos Vasiliadis; Vassilis Papaefstathiou; Antonis Papadogiannakis; Sotiris Ioannidis

    Instruction Set Randomization (ISR) is able to protect against remote code injection attacks by randomizing the instruction set of each process. Thereby, even if an attacker succeeds to inject code, it will fail to execute on the randomized processor. The majority of existing ISR implementations is based on emulators and binary instrumentation tools that unfortunately: (i) incur significant runtime

    更新日期:2020-11-12
  • A RISC-V Simulator and Benchmark Suite for Designing and Evaluating Vector Architectures
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-11-10
    Cristóbal Ramírez; César Alejandro Hernández; Oscar Palomar; Osman Unsal; Marco Antonio Ramírez; Adrián Cristal

    Vector architectures lack tools for research. Consider the gem5 simulator, which is possibly the leading platform for computer-system architecture research. Unfortunately, gem5 does not have an available distribution that includes a flexible and customizable vector architecture model. In consequence, researchers have to develop their own simulation platform to test their ideas, which consume much research

    更新日期:2020-11-12
  • SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-11-10
    Sam (Likun) Xi; Yuan Yao; Kshitij Bhardwaj; Paul Whatmough; Gu-Yeon Wei; David Brooks

    In recent years, there has been tremendous advances in hardware acceleration of deep neural networks. However, most of the research has focused on optimizing accelerator microarchitecture for higher performance and energy efficiency on a per-layer basis. We find that for overall single-batch inference latency, the accelerator may only make up 25–40%, with the rest spent on data movement and in the

    更新日期:2020-11-12
  • MemSZ: Squeezing Memory Traffic with Lossy Compression
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-11-10
    Albin Eldstål-Ahrens; Ioannis Sourdis

    This article describes Memory Squeeze (MemSZ), a new approach for lossy general-purpose memory compression. MemSZ introduces a low latency, parallel design of the Squeeze (SZ) algorithm offering aggressive compression ratios, up to 16:1 in our implementation. Our compressor is placed between the memory controller and the cache hierarchy of a processor to reduce the memory traffic of applications that

    更新日期:2020-11-12
  • Design and Evaluation of an Ultra Low-power Human-quality Speech Recognition System
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-11-10
    Dennis Pinto; Jose-María Arnau; Antonio González

    Automatic Speech Recognition (ASR) has experienced a dramatic evolution since pioneer development of Bell Lab’s single-digit recognizer more than 50 years ago. Current ASR systems have taken advantage of the tremendous improvements in AI during the past decade by incorporating Deep Neural Networks into the system and pushing their accuracy to levels comparable to that of humans. This article describes

    更新日期:2020-11-12
  • SHASTA: Synergic HW-SW Architecture for Spatio-temporal Approximation
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-09-30
    Gokul Subramanian Ravi; Joshua San Miguel; Mikko Lipasti

    A key requirement for efficient general purpose approximate computing is an amalgamation of flexible hardware design and intelligent application tuning, which together can leverage the appropriate amount of approximation that the applications engender and reap the best efficiency gains from them. To achieve this, we have identified three important features to build better general-purpose cross-layer

    更新日期:2020-09-30
  • Effective Loop Fusion in Polyhedral Compilation Using Fusion Conflict Graphs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-09-30
    Aravind Acharya; Uday Bondhugula; Albert Cohen

    Polyhedral auto-transformation frameworks are known to find efficient loop transformations that maximize locality and parallelism and minimize synchronization. While complex loop transformations are routinely modeled in these frameworks, they tend to rely on ad hoc heuristics for loop fusion. Although there exist multiple loop fusion models with cost functions to maximize locality and parallelism,

    更新日期:2020-09-30
  • ECOTLB: Eventually Consistent TLBs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-09-30
    Steffen Maass; Mohan Kumar Kumar; Taesoo Kim; Tushar Krishna; Abhishek Bhattacharjee

    We propose ecoTLB—software-based eventual translation lookaside buffer (TLB) coherence—which eliminates the overhead of the synchronous TLB shootdown mechanism in operating systems that use address space identifiers (ASIDs). With an eventual TLB coherence, ecoTLB improves the performance of free and page swap operations by removing the inter-processor interrupt (IPI) overheads incurred to invalidate

    更新日期:2020-09-30
  • DisGCo: A Compiler for Distributed Graph Analytics
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-09-30
    Anchu Rajendran; V. Krishna Nandivada

    Graph algorithms are widely used in various applications. Their programmability and performance have garnered a lot of interest among the researchers. Being able to run these graph analytics programs on distributed systems is an important requirement. Green-Marl is a popular Domain Specific Language (DSL) for coding graph algorithms and is known for its simplicity. However, the existing Green-Marl

    更新日期:2020-09-30
  • AsynGraph: Maximizing Data Parallelism for Efficient Iterative Graph Processing on GPUs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-09-30
    Yu Zhang; Xiaofei Liao; Lin Gu; Hai Jin; Kan Hu; Haikun Liu; Bingsheng He

    Recently, iterative graph algorithms are proposed to be handled by GPU-accelerated systems. However, in iterative graph processing, the parallelism of GPU is still underutilized by existing GPU-based solutions. In fact, because of the power-law property of the natural graphs, the paths between a small set of important vertices (e.g., high-degree vertices) play a more important role in iterative graph

    更新日期:2020-09-30
  • OD-SGD: One-Step Delay Stochastic Gradient Descent for Distributed Training
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-09-30
    Yemao Xu; Dezun Dong; Yawei Zhao; Weixia Xu; Xiangke Liao

    The training of modern deep learning neural network calls for large amounts of computation, which is often provided by GPUs or other specific accelerators. To scale out to achieve faster training speed, two update algorithms are mainly applied in the distributed training process, i.e., the Synchronous SGD algorithm (SSGD) and Asynchronous SGD algorithm (ASGD). SSGD obtains good convergence point while

    更新日期:2020-09-30
  • Editorial: A Message from the Editor-in-Chief
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-08-03
    Dave Kaeli

    No abstract available.

    更新日期:2020-08-18
  • Zeroploit: Exploiting Zero Valued Operands in Interactive Gaming Applications
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-08-03
    Ram Rangan; Mark W. Stephenson; Aditya Ukarande; Shyam Murthy; Virat Agarwal; Marc Blackstein

    In this article, we first characterize register operand value locality in shader programs of modern gaming applications and observe that there is a high likelihood of one of the register operands of several multiply, logical-and, and similar operations being zero, dynamically. We provide intuition, examples, and a quantitative characterization for how zeros originate dynamically in these programs.

    更新日期:2020-08-18
  • GPU Fast Convolution via the Overlap-and-Save Method in Shared Memory
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-08-03
    Karel Adámek; Sofia Dimoudi; Mike Giles; Wesley Armour

    We present an implementation of the overlap-and-save method, a method for the convolution of very long signals with short response functions, which is tailored to GPUs. We have implemented several FFT algorithms (using the CUDA programming language), which exploit GPU shared memory, allowing for GPU accelerated convolution. We compare our implementation with an implementation of the overlap-and-save

    更新日期:2020-08-18
  • FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-08-17
    Arnab Das; Sriram Krishnamoorthy; Ian Briggs; Ganesh Gopalakrishnan; Ramakrishna Tipireddy

    We present FPDetect, a low-overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point bits preserved across stencil applications. This estimate rigorously bounds the values expected in the data space of the computation. Violations of this bound can

    更新日期:2020-08-18
  • Cooperative Software-hardware Acceleration of K-means on a Tightly Coupled CPU-FPGA System
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-08-17
    Tarek S. Abdelrahman

    We consider software-hardware acceleration of K-means clustering on the Intel Xeon+FPGA platform. We design a pipelined accelerator for K-means and combine it with CPU threads to assess performance benefits of (1) acceleration when data are only accessed from system memory and (2) cooperative CPU-FPGA acceleration. Our evaluation shows that the accelerator is up to 12.7×/2.4× faster than a single CPU

    更新日期:2020-08-18
  • Securing Branch Predictors with Two-Level Encryption
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-08-03
    Jaekyu Lee; Yasuo Ishii; Dam Sunwoo

    Modern processors rely on various speculative mechanisms to meet performance demand. Branch predictors are one of the most important micro-architecture components to deliver performance. However, they have been under heavy scrutiny because of recent side-channel attacks. Branch predictors are indexed using the PC and recent branch histories. An adversary can manipulate these parameters to access and

    更新日期:2020-08-18
  • EchoBay: Design and Optimization of Echo State Networks under Memory and Time Constraints
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-08-17
    L. Cerina; M. D. Santambrogio; G. Franco; C. Gallicchio; A. Micheli

    The increase in computational power of embedded devices and the latency demands of novel applications brought a paradigm shift on how and where the computation is performed. Although AI inference is slowly moving from the cloud to end-devices with limited resources, time-centric recurrent networks like Long-Short Term Memory remain too complex to be transferred on embedded devices without extreme simplifications

    更新日期:2020-08-18
  • Schedule Synthesis for Halide Pipelines on GPUs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-08-03
    Savvas Sioutas; Sander Stuijk; Twan Basten; Henk Corporaal; Lou Somers

    The Halide DSL and compiler have enabled high-performance code generation for image processing pipelines targeting heterogeneous architectures through the separation of algorithmic description and optimization schedule. However, automatic schedule generation is currently only possible for multi-core CPU architectures. As a result, expert knowledge is still required when optimizing for platforms with

    更新日期:2020-08-18
  • Inter-kernel Reuse-aware Thread Block Scheduling
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-08-17
    Muhammad Huzaifa; Johnathan Alsop; Abdulrahman Mahmoud; Giordano Salvador; Matthew D. Sinclair; Sarita V. Adve

    As GPUs have become more programmable, their performance and energy benefits have made them increasingly popular. However, while GPU compute units continue to improve in performance, on-chip memories lag behind and data accesses are becoming increasingly expensive in performance and energy. Emerging GPU coherence protocols can mitigate this bottleneck by exploiting data reuse in GPU caches across kernel

    更新日期:2020-08-18
  • ArmorAll
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-05-29
    Charu Kalra; Fritz Previlon; Norm Rubin; David Kaeli

    The vulnerability of GPUs to soft errors has become a first-class design concern as they are increasingly being used in accuracy-sensitive and safety-critical domains. Existing solutions used to enhance the reliability of GPUs come with significant overhead in terms of area, power, and/or performance. In this article, we propose ArmorAll, a light-weight, adaptive, selective, and portable software solution

    更新日期:2020-05-29
  • Dynamic Precision Autotuning with TAFFO
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-05-29
    Stefano Cherubin; Daniele Cattaneo; Michele Chiari; Giovanni Agosta

    Many classes of applications, both in the embedded and high performance domains, can trade off the accuracy of the computed results for computation performance. One way to achieve such a trade-off is precision tuning—that is, to modify the data types used for the computation by reducing the bit width, or by changing the representation from floating point to fixed point. We present a methodology for

    更新日期:2020-05-29
  • Runtime Design Space Exploration and Mapping of DCNNs for the Ultra-Low-Power Orlando SoC
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-05-29
    Ahmet Erdem; Cristina Silvano; Thomas Boesch; Andrea Carlo Ornstein; Surinder-Pal Singh; Giuseppe Desoli

    Recent trends in deep convolutional neural networks (DCNNs) impose hardware accelerators as a viable solution for computer vision and speech recognition. The Orlando SoC architecture from STMicroelectronics targets exactly this class of problems by integrating hardware-accelerated convolutional blocks together with DSPs and on-chip memory resources to enable energy-efficient designs of DCNNs. The main

    更新日期:2020-05-29
  • Reliability Analysis for Unreliable FSM Computations
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-05-29
    Amir Hossein Nodehi Sabet; Junqiao Qiu; Zhijia Zhao; Sriram Krishnamoorthy

    Finite State Machines (FSMs) are fundamental in both hardware design and software development. However, the reliability of FSM computations remains poorly understood. Existing reliability analyses are mainly designed for generic computations and are unaware of the special error tolerance characteristics in FSM computations. This work introduces RelyFSM -- a state-level reliability analysis framework

    更新日期:2020-05-29
  • Network Interface Architecture for Remote Indirect Memory Access (RIMA) in Datacenters
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-05-29
    Jiachen Xue; T. N. Vijaykumar; Mithuna Thottethodi

    Remote Direct Memory Access (RDMA) fabrics such as InfiniBand and Converged Ethernet report latency shorter by a factor of 50 than TCP. As such, RDMA is a potential replacement for TCP in datacenters (DCs) running low-latency applications, such as Web search and memcached. InfiniBand’s Shared Receive Queues (SRQs), which use two-sided send/recv verbs (i.e., channel semantics), reduce the amount of

    更新日期:2020-05-29
  • A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAs
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-05-29
    Qinggang Wang; Long Zheng; Jieshan Zhao; Xiaofei Liao; Hai Jin; Jingling Xue

    FPGA-based graph processing accelerators are nowadays equipped with multiple pipelines for hardware acceleration of graph computations. However, their multi-pipeline efficiency can suffer greatly from the considerable overheads caused by the read/write conflicts in their on-chip BRAM from different pipelines, leading to significant performance degradation and poor scalability. In this article, we investigate

    更新日期:2020-05-29
  • SIMT-X
    ACM Trans. Archit. Code Optim. (IF 1.309) Pub Date : 2020-05-29
    Anita Tino; Caroline Collange; André Seznec

    This work introduces Single Instruction Multi-Thread Express (SIMT-X), a general-purpose Central Processing Unit (CPU) microarchitecture that enables Graphics Processing Units (GPUs)-style SIMT execution across multiple threads of the same program for high throughput, while retaining the latency benefits of out-of-order execution, and the programming convenience of homogeneous multi-thread processors

    更新日期:2020-05-29
Contents have been reproduced by permission of the publishers.
导出
全部期刊列表>>
微生物研究
亚洲大洋洲地球科学
NPJ欢迎投稿
自然科研论文编辑
ERIS期刊投稿
欢迎阅读创刊号
自然职场,为您触达千万科研人才
spring&清华大学出版社
城市可持续发展前沿研究专辑
Springer 纳米技术权威期刊征稿
全球视野覆盖
施普林格·自然新
chemistry
物理学研究前沿热点精选期刊推荐
自然职位线上招聘会
欢迎报名注册2020量子在线大会
化学领域亟待解决的问题
材料学研究精选新
GIANT
ACS ES&T Engineering
ACS ES&T Water
屿渡论文,编辑服务
阿拉丁试剂right
上海中医药大学
清华大学
复旦大学
南科大
北京理工大学
上海交通大学
隐藏1h前已浏览文章
课题组网站
新版X-MOL期刊搜索和高级搜索功能介绍
ACS材料视界
清华大学-1
武汉大学
浙江大学
天合科研
x-mol收录
试剂库存
down
wechat
bug