-
Taiyi: A high-performance CKKS accelerator for Practical Fully Homomorphic Encryption arXiv.cs.AR Pub Date : 2024-03-15 Shengyu Fan, Xianglong Deng, Zhuoyu Tian, Zhicheng Hu, Liang Chang, Rui Hou, Dan Meng, Mingzhe Zhang
Fully Homomorphic Encryption (FHE), a novel cryptographic theory enabling computation directly on ciphertext data, offers significant security benefits but is hampered by substantial performance overhead. In recent years, a series of accelerator designs have significantly enhanced the performance of FHE applications, bringing them closer to real-world applicability. However, these accelerators face
-
Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory arXiv.cs.AR Pub Date : 2024-03-14 Jeongmin Hong, Sungjun Cho, Geonwoo Park, Wonhyuk Yang, Young-Ho Gong, Gwangsun Kim
We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW
-
Analytical Heterogeneous Die-to-Die 3D Placement with Macros arXiv.cs.AR Pub Date : 2024-03-14 Yuxuan Zhao, Peiyu Liao, Siting Liu, Jiaxi Jiang, Yibo Lin, Bei Yu
This paper presents an innovative approach to 3D mixed-size placement in heterogeneous face-to-face (F2F) bonded 3D ICs. We propose an analytical framework that utilizes a dedicated density model and a bistratal wirelength model, effectively handling macros and standard cells in a 3D solution space. A novel 3D preconditioner is developed to resolve the topological and physical gap between macros and
-
FlexNN: A Dataflow-aware Flexible Deep Learning Accelerator for Energy-Efficient Edge Devices arXiv.cs.AR Pub Date : 2024-03-14 Arnab Raha, Deepak A. Mathaikutty, Soumendu K. Ghosh, Shamik Kundu
This paper introduces FlexNN, a Flexible Neural Network accelerator, which adopts agile design principles to enable versatile dataflows, enhancing energy efficiency. Unlike conventional convolutional neural network accelerator architectures that adhere to fixed dataflows (such as input, weight, output, or row stationary) for transferring activations and weights between storage and compute units, our
-
Wet TinyML: Chemical Neural Network Using Gene Regulation and Cell Plasticity arXiv.cs.AR Pub Date : 2024-03-13 Samitha Somathilaka, Adrian Ratwatte, Sasitharan Balasubramaniam, Mehmet Can Vuran, Witawas Srisa-an, Pietro Liò
In our earlier work, we introduced the concept of Gene Regulatory Neural Network (GRNN), which utilizes natural neural network-like structures inherent in biological cells to perform computing tasks using chemical inputs. We define this form of chemical-based neural network as Wet TinyML. The GRNN structures are based on the gene regulatory network and have weights associated with each link based on
-
Learning-driven Physically-aware Large-scale Circuit Gate Sizing arXiv.cs.AR Pub Date : 2024-03-13 Yuyang Ye, Peng Xu, Lizheng Ren, Tinghuan Chen, Hao Yan, Bei Yu, Longxing Shi
Gate sizing plays an important role in timing optimization after physical design. Existing machine learning-based gate sizing works cannot optimize timing on multiple timing paths simultaneously and neglect the physical constraint on layouts. They cause sub-optimal sizing solutions and low-efficiency issues when compared with commercial gate sizing tools. In this work, we propose a learning-driven
-
Improving Memory Dependence Prediction with Static Analysis arXiv.cs.AR Pub Date : 2024-03-12 Luke Panayi, Rohan Gandhi, Jim Whittaker, Vassilios Chouliaras, Martin Berger, Paul Kelly
This paper explores the potential of communicating information gained by static analysis from compilers to Out-of-Order (OoO) machines, focusing on the memory dependence predictor (MDP). The MDP enables loads to issue without all in-flight store addresses being known, with minimal memory order violations. We use LLVM to find loads with no dependencies and label them via their opcode. These labelled
-
Performance Analysis of Matrix Multiplication for Deep Learning on the Edge arXiv.cs.AR Pub Date : 2024-03-12 Cristian Ramírez, Adrián Castelló, Héctor Martínez, Enrique S. Quintana-Ortí
The devices designed for the Internet-of-Things encompass a large variety of distinct processor architectures, forming a highly heterogeneous zoo. In order to tackle this, we employ a simulator to estimate the performance of the matrix-matrix multiplication (GEMM) kernel on processors designed to operate at the edge. Our simulator adheres to the modern implementations of GEMM, advocated by GotoBLAS2
-
The Dawn of AI-Native EDA: Promises and Challenges of Large Circuit Models arXiv.cs.AR Pub Date : 2024-03-12 Lei ChenHuawei Noah's Ark Lab, Yiqi ChenPeking University, Zhufei ChuNingbo University, Wenji FangHong Kong University of Science and Technology, Tsung-Yi HoThe Chinese University of Hong Kong, Yu HuangHuawei HiSilicon, Sadaf KhanThe Chinese University of Hong Kong, Min LiHuawei Noah's Ark Lab, Xingquan LiPeng Cheng Laboratory, Yun LiangPeking University, Yibo LinPeking University, Jinwei LiuThe Chinese
Within the Electronic Design Automation (EDA) domain, AI-driven solutions have emerged as formidable tools, yet they typically augment rather than redefine existing methodologies. These solutions often repurpose deep learning models from other domains, such as vision, text, and graph analytics, applying them to circuit design without tailoring to the unique complexities of electronic circuits. Such
-
TCAM-SSD: A Framework for Search-Based Computing in Solid-State Drives arXiv.cs.AR Pub Date : 2024-03-11 Ryan Wong, Nikita Kim, Kevin Higgs, Sapan Agarwal, Engin Ipek, Saugata Ghose, Ben Feinberg
As the amount of data produced in society continues to grow at an exponential rate, modern applications are incurring significant performance and energy penalties due to high data movement between the CPU and memory/storage. While processing in main memory can alleviate these penalties, it is becoming increasingly difficult to keep large datasets entirely in main memory. This has led to a recent push
-
Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System arXiv.cs.AR Pub Date : 2024-03-11 Hongsun Jang, Jaeyong Song, Jaewon Jung, Jaeyoung Park, Youngsok Kim, Jinho Lee
The recent huge advance of Large Language Models (LLMs) is mainly driven by the increase in the number of parameters. This has led to substantial memory capacity requirements, necessitating the use of dozens of GPUs just to meet the capacity. One popular solution to this is storage-offloaded training, which uses host memory and storage as an extended memory hierarchy. However, this obviously comes
-
I/O Transit Caching for PMem-based Block Device arXiv.cs.AR Pub Date : 2024-03-10 Qing Xu, Qisheng Jiang, Chundong Wang
Byte-addressable non-volatile memory (NVM) sitting on the memory bus is employed to make persistent memory (PMem) in general-purpose computing systems and embedded systems for data storage. Researchers develop software drivers such as the block translation table (BTT) to build block devices on PMem, so programmers can keep using mature and reliable conventional storage stack while expecting high performance
-
HDReason: Algorithm-Hardware Codesign for Hyperdimensional Knowledge Graph Reasoning arXiv.cs.AR Pub Date : 2024-03-09 Hanning Chen, Yang Ni, Ali Zakeri, Zhuowen Zou, Sanggeon Yun, Fei Wen, Behnam Khaleghi, Narayan Srinivasa, Hugo Latapie, Mohsen Imani
In recent times, a plethora of hardware accelerators have been put forth for graph learning applications such as vertex classification and graph classification. However, previous works have paid little attention to Knowledge Graph Completion (KGC), a task that is well-known for its significantly higher algorithm complexity. The state-of-the-art KGC solutions based on graph convolution neural network
-
Quantum-HPC Framework with multi-GPU-Enabled Hybrid Quantum-Classical Workflow: Applications in Quantum Simulations arXiv.cs.AR Pub Date : 2024-03-09 Kuan-Cheng Chen, Xiaoren Li, Xiaotian Xu, Yun-Yuan Wang, Chen-Yu Liu
Achieving high-performance computation on quantum systems presents a formidable challenge that necessitates bridging the capabilities between quantum hardware and classical computing resources. This study introduces an innovative distribution-aware Quantum-Classical-Quantum (QCQ) architecture, which integrates cutting-edge quantum software framework works with high-performance classical computing resources
-
A 28.6 mJ/iter Stable Diffusion Processor for Text-to-Image Generation with Patch Similarity-based Sparsity Augmentation and Text-based Mixed-Precision arXiv.cs.AR Pub Date : 2024-03-08 Jiwon Choi, Wooyoung Jo, Seongyon Hong, Beomseok Kwon, Wonhoon Park, Hoi-Jun Yoo
This paper presents an energy-efficient stable diffusion processor for text-to-image generation. While stable diffusion attained attention for high-quality image synthesis results, its inherent characteristics hinder its deployment on mobile platforms. The proposed processor achieves high throughput and energy efficiency with three key features as solutions: 1) Patch similarity-based sparsity augmentation
-
PUMA: Efficient and Low-Cost Memory Allocation and Alignment Support for Processing-Using-Memory Architectures arXiv.cs.AR Pub Date : 2024-03-07 Geraldo F. Oliveira, Emanuele G. Esposito, Juan Gómez-Luna, Onur Mutlu
Processing-using-DRAM (PUD) architectures impose a restrictive data layout and alignment for their operands, where source and destination operands (i) must reside in the same DRAM subarray (i.e., a group of DRAM rows sharing the same row buffer and row decoder) and (ii) are aligned to the boundaries of a DRAM row. However, standard memory allocation routines (i.e., malloc, posix_memalign, and huge
-
A methodology to automatically optimize dynamic memory managers applying grammatical evolution arXiv.cs.AR Pub Date : 2024-03-07 José L. Risco-Martín, J. Manuel Colmenar, J. Ignacio Hidalgo, Juan Lanchares, Josefa Díaz
Modern consumer devices must execute multimedia applications that exhibit high resource utilization. In order to efficiently execute these applications, the dynamic memory subsystem needs to be optimized. This complex task can be tackled in two complementary ways: optimizing the application source code or designing custom dynamic memory management mechanisms. Currently, the first approach has been
-
Parendi: Thousand-Way Parallel RTL Simulation arXiv.cs.AR Pub Date : 2024-03-07 Mahyar Emami, Thomas Bourgeat, James Larus
Hardware development relies on simulations, particularly cycle-accurate RTL (Register Transfer Level) simulations, which consume significant time. As single-processor performance grows only slowly, conventional, single-threaded RTL simulation is becoming less practical for increasingly complex chips and systems. A solution is parallel RTL simulation, where ideally, simulators could run on thousands
-
CAMASim: A Comprehensive Simulation Framework for Content-Addressable Memory based Accelerators arXiv.cs.AR Pub Date : 2024-03-06 Mengyuan Li, Shiyi Liu, Mohammad Mehdi Sharifi, X. Sharon Hu
Content addressable memory (CAM) stands out as an efficient hardware solution for memory-intensive search operations by supporting parallel computation in memory. However, developing a CAM-based accelerator architecture that achieves acceptable accuracy, while minimizing hardware cost and catering to both exact and approximate search, still presents a significant challenge especially when considering
-
Efficient FIR filtering with Bit Layer Multiply Accumulator arXiv.cs.AR Pub Date : 2024-03-03 Vincenzo Liguori
Bit Layer Multiplier Accumulator (BLMAC) is an efficient method to perform dot products without multiplications that exploits the bit level sparsity of the weights. A total of 1,980,000 low, high, band pass and band stop type I FIR filters were generated by systematically sweeping through the cut off frequencies and by varying the number of taps from 55 to 255. After their coefficients were quantized
-
Performance evaluation of acceleration of convolutional layers on OpenEdgeCGRA arXiv.cs.AR Pub Date : 2024-03-02 Nicolò Carpentieri, Juan Sapriza, Davide Schiavone, Daniele Jahier Pagliari, David Atienza, Maurizio Martina, Alessio Burrello
Recently, efficiently deploying deep learning solutions on the edge has received increasing attention. New platforms are emerging to support the increasing demand for flexibility and high performance. In this work, we explore the efficient mapping of convolutional layers on an open-hardware, low-power Coarse-Grain Reconfigurable Array (CGRA), namely OpenEdgeCGRA. We explore both direct implementations
-
NeuPIMs: A NPU-PIM Heterogeneous Acceleration for Batched Inference of Large Language Model arXiv.cs.AR Pub Date : 2024-03-01 Guseul Heo, Sangyeop Lee, Jaehong Cho, Hyunmin Choi, Sanghyeon Lee, Hyungkyu Ham, Gwangsun Kim, Divya Mahajan, Jongse Park
Modern transformer-based Large Language Models (LLMs) are constructed with a series of decoder blocks. Each block comprises three key components: (1) QKV generation, (2) multi-head attention, and (3) feed-forward networks. In batched processing, QKV generation and feed-forward networks involve compute-intensive matrix-matrix multiplications (GEMM), while multi-head attention requires bandwidth-heavy
-
FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators arXiv.cs.AR Pub Date : 2024-03-01 Xinyi Li, Ang Li, Bo Fang, Katarzyna Swirydowicz, Ignacio Laguna, Ganesh Gopalakrishnan
NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance. Unfortunately, their numerical behaviors are not publicly documented, including the number of extra precision bits maintained, the accumulation order of addition, and predictable subnormal number handling during computations
-
Attacking Delay-based PUFs with Minimal Adversary Model arXiv.cs.AR Pub Date : 2024-03-01 Hongming Fei, Owen Millwood, Prosanta Gope, Jack Miskelly, Biplab Sikdar
Physically Unclonable Functions (PUFs) provide a streamlined solution for lightweight device authentication. Delay-based Arbiter PUFs, with their ease of implementation and vast challenge space, have received significant attention; however, they are not immune to modelling attacks that exploit correlations between their inputs and outputs. Research is therefore polarized between developing modelling-resistant
-
OzMAC: An Energy-Efficient Sparsity-Exploiting Multiply-Accumulate-Unit Design for DL Inference arXiv.cs.AR Pub Date : 2024-02-29 Harideep Nair, Prabhu Vellaisamy, Tsung-Han Lin, Perry Wang, Shawn Blanton, John Paul Shen
General Matrix Multiply (GEMM) hardware, employing large arrays of multiply-accumulate (MAC) units, perform bulk of the computation in deep learning (DL). Recent trends have established 8-bit integer (INT8) as the most widely used precision for DL inference. This paper proposes a novel MAC design capable of dynamically exploiting bit sparsity (i.e., number of `0' bits within a binary value) in input
-
MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Processing arXiv.cs.AR Pub Date : 2024-02-29 Geraldo F. Oliveira, Ataberk Olgun, Abdullah Giray Yağlıkçı, F. Nisa Bostancı, Juan Gómez-Luna, Saugata Ghose, Onur Mutlu
Processing-using-DRAM (PUD) is a processing-in-memory (PIM) approach that uses a DRAM array's massive internal parallelism to execute very-wide data-parallel operations, in a single-instruction multiple-data (SIMD) fashion. However, DRAM rows' large and rigid granularity limit the effectiveness and applicability of PUD in three ways. First, since applications have varying degrees of SIMD parallelism
-
Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis arXiv.cs.AR Pub Date : 2024-02-28 Ismail Emir Yuksel, Yahya Can Tugrul, Ataberk Olgun, F. Nisa Bostanci, A. Giray Yaglikci, Geraldo F. Oliveira, Haocong Luo, Juan Gómez-Luna, Mohammad Sadrosadati, Onur Mutlu
Processing-using-DRAM (PuD) is an emerging paradigm that leverages the analog operational properties of DRAM circuitry to enable massively parallel in-DRAM computation. PuD has the potential to significantly reduce or eliminate costly data movement between processing elements and main memory. Prior works experimentally demonstrate three-input MAJ (i.e., MAJ3) and two-input AND and OR operations in
-
CoMeT: Count-Min-Sketch-based Row Tracking to Mitigate RowHammer at Low Cost arXiv.cs.AR Pub Date : 2024-02-29 F. Nisa Bostanci, Ismail Emir Yuksel, Ataberk Olgun, Konstantinos Kanellopoulos, Yahya Can Tugrul, A. Giray Yaglikci, Mohammad Sadrosadati, Onur Mutlu
We propose a new RowHammer mitigation mechanism, CoMeT, that prevents RowHammer bitflips with low area, performance, and energy costs in DRAM-based systems at very low RowHammer thresholds. The key idea of CoMeT is to use low-cost and scalable hash-based counters to track DRAM row activations. CoMeT uses the Count-Min Sketch technique that maps each DRAM row to a group of counters, as uniquely as possible
-
Spatial Variation-Aware Read Disturbance Defenses: Experimental Analysis of Real DRAM Chips and Implications on Future Solutions arXiv.cs.AR Pub Date : 2024-02-28 Abdullah Giray Yağlıkçı, Yahya Can Tuğrul, Geraldo F. Oliveira, İsmail Emir Yüksel, Ataberk Olgun, Haocong Luo, Onur Mutlu
Read disturbance in modern DRAM chips is a widespread phenomenon and is reliably used for breaking memory isolation, a fundamental building block for building robust systems. RowHammer and RowPress are two examples of read disturbance in DRAM where repeatedly accessing (hammering) or keeping active (pressing) a memory location induces bitflips in other memory locations. Unfortunately, shrinking technology
-
Energy-Aware Heterogeneous Federated Learning via Approximate Systolic DNN Accelerators arXiv.cs.AR Pub Date : 2024-02-28 Kilian Pfeiffer, Konstantinos Balaskas, Kostas Siozios, Jörg Henkel
In Federated Learning (FL), devices that participate in the training usually have heterogeneous resources, i.e., energy availability. In current deployments of FL, devices that do not fulfill certain hardware requirements are often dropped from the collaborative training. However, dropping devices in FL can degrade training accuracy and introduce bias or unfairness. Several works have tacked this problem
-
PIMSYN: Synthesizing Processing-in-memory CNN Accelerators arXiv.cs.AR Pub Date : 2024-02-28 Wanqian Li, Xiaotian Sun, Xinyu Wang, Lei Wang, Yinhe Han, Xiaoming Chen
Processing-in-memory architectures have been regarded as a promising solution for CNN acceleration. Existing PIM accelerator designs rely heavily on the experience of experts and require significant manual design overhead. Manual design cannot effectively optimize and explore architecture implementations. In this work, we develop an automatic framework PIMSYN for synthesizing PIM-based CNN accelerators
-
PIMSIM-NN: An ISA-based Simulation Framework for Processing-in-Memory Accelerators arXiv.cs.AR Pub Date : 2024-02-28 Xinyu Wang, Xiaotian Sun, Yinhe Han, Xiaoming Chen
Processing-in-memory (PIM) has shown extraordinary potential in accelerating neural networks. To evaluate the performance of PIM accelerators, we present an ISA-based simulation framework including a dedicated ISA targeting neural networks running on PIM architectures, a compiler, and a cycleaccurate configurable simulator. Compared with prior works, this work decouples software algorithms and hardware
-
SSRESF: Sensitivity-aware Single-particle Radiation Effects Simulation Framework in SoC Platforms based on SVM Algorithm arXiv.cs.AR Pub Date : 2024-02-27 Meng LiuFaculty of Information Technology, School of Microelectronics, Beijing University of Technology, Beijing, China, Shuai LiFaculty of Information Technology, School of Microelectronics, Beijing University of Technology, Beijing, China, Fei XiaoFaculty of Information Technology, School of Microelectronics, Beijing University of Technology, Beijing, China, Ruijie WangFaculty of Information Technology
The ever-expanding scale of integrated circuits has brought about a significant rise in the design risks associated with radiation-resistant integrated circuit chips. Traditional single-particle experimental methods, with their iterative design approach, are increasingly ill-suited for the challenges posed by large-scale integrated circuits. In response, this article introduces a novel sensitivity-aware
-
GraphMatch: Subgraph Query Processing on FPGAs arXiv.cs.AR Pub Date : 2024-02-27 Jonas Dann, Tobias Götz, Daniel Ritter, Jana Giceva, Holger Fröning
Efficiently finding subgraph embeddings in large graphs is crucial for many application areas like biology and social network analysis. Set intersections are the predominant and most challenging aspect of current join-based subgraph query processing systems for CPUs. Previous work has shown the viability of utilizing FPGAs for acceleration of graph and join processing. In this work, we propose GraphMatch
-
Trimma: Trimming Metadata Storage and Latency for Hybrid Memory Systems arXiv.cs.AR Pub Date : 2024-02-26 Yiwei Li, Boyu Tian, Mingyu Gao
Hybrid main memory systems combine both performance and capacity advantages from heterogeneous memory technologies. With larger capacities, higher associativities, and finer granularities, hybrid memory systems currently exhibit significant metadata storage and lookup overheads for flexibly remapping data blocks between the two memory tiers. To alleviate the inefficiencies of existing designs, we propose
-
A New Secure Memory System for Efficient Data Protection and Access Pattern Obfuscation arXiv.cs.AR Pub Date : 2024-02-24 Haoran Geng, Yuezhi Che, Aaron Dingler, Michael Niemier, Xiaobo Sharon Hu
As the reliance on secure memory environments permeates across applications, memory encryption is used to ensure memory security. However, most effective encryption schemes, such as the widely used AES-CTR, inherently introduce extra overheads, including those associated with counter storage and version number integrity checks. Moreover, encryption only protects data content, and it does not fully
-
Prime+Retouch: When Cache is Locked and Leaked arXiv.cs.AR Pub Date : 2024-02-23 Jaehyuk Lee, Fan Sang, Taesoo Kim
Caches on the modern commodity CPUs have become one of the major sources of side-channel leakages and been abused as a new attack vector. To thwart the cache-based side-channel attacks, two types of countermeasures have been proposed: detection-based ones that limit the amount of microarchitectural traces an attacker can leave, and cache prefetching-and-locking techniques that claim to prevent such
-
Thermal-Aware Floorplanner for 3D IC, including TSVs, Liquid Microchannels and Thermal Domains Optimization arXiv.cs.AR Pub Date : 2024-02-22 David Cuesta, José L. Risco-Martín, José L. Ayala, J. Ignacio Hidalgo
3D stacked technology has emerged as an effective mechanism to overcome physical limits and communication delays found in 2D integration. However, 3D technology also presents several drawbacks that prevent its smooth application. Two of the major concerns are heat reduction and power density distribution. In our work, we propose a novel 3D thermal-aware floorplanner that includes: (1) an effective
-
ModSRAM: Algorithm-Hardware Co-Design for Large Number Modular Multiplication in SRAM arXiv.cs.AR Pub Date : 2024-02-21 Jonathan Ku, Junyao Zhang, Haoxuan Shan, Saichand Samudrala, Jiawen Wu, Qilin Zheng, Ziru Li, JV Rajendran, Yiran Chen
Elliptic curve cryptography (ECC) is widely used in security applications such as public key cryptography (PKC) and zero-knowledge proofs (ZKP). ECC is composed of modular arithmetic, where modular multiplication takes most of the processing time. Computational complexity and memory constraints of ECC limit the performance. Therefore, hardware acceleration on ECC is an active field of research. Processing-in-memory
-
Guac: Energy-Aware and SSA-Based Generation of Coarse-Grained Merged Accelerators from LLVM-IR arXiv.cs.AR Pub Date : 2024-02-21 Iulian Brumar, Rodrigo Rocha, Alex Bernat, Devashree Tripathy, David Brooks, Gu-Yeon Wei
Designing accelerators for resource- and power-constrained applications is a daunting task. High-level Synthesis (HLS) addresses these constraints through resource sharing, an optimization at the HLS binding stage that maps multiple operations to the same functional unit. However, resource sharing is often limited to reusing instructions within a basic block. Instead of searching globally for the best
-
Benchmarking and Dissecting the Nvidia Hopper GPU Architecture arXiv.cs.AR Pub Date : 2024-02-21 Weile Luo, Ruibo Fan, Zeyu Li, Dayou Du, Qiang Wang, Xiaowen Chu
Graphics processing units (GPUs) are continually evolving to cater to the computational demands of contemporary general-purpose workloads, particularly those driven by artificial intelligence (AI) utilizing deep learning techniques. A substantial body of studies have been dedicated to dissecting the microarchitectural metrics characterizing diverse GPU generations, which helps researchers understand
-
Identifying Unnecessary 3D Gaussians using Clustering for Fast Rendering of 3D Gaussian Splatting arXiv.cs.AR Pub Date : 2024-02-21 Joongho Jo, Hyeongwon Kim, Jongsun Park
3D Gaussian splatting (3D-GS) is a new rendering approach that outperforms the neural radiance field (NeRF) in terms of both speed and image quality. 3D-GS represents 3D scenes by utilizing millions of 3D Gaussians and projects these Gaussians onto the 2D image plane for rendering. However, during the rendering process, a substantial number of unnecessary 3D Gaussians exist for the current view direction
-
Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters arXiv.cs.AR Pub Date : 2024-02-20 Sergio Mazzola, Samuel Riedel, Luca Benini
Systolic arrays and shared L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit data flow management and synchronization. This work aims at enabling
-
SAT-based Exact Modulo Scheduling Mapping for Resource-Constrained CGRAs arXiv.cs.AR Pub Date : 2024-02-20 Cristian Tirelli, Juan Sapriza, Rubén Rodríguez Álvarez, Lorenzo Ferretti, Benoît Denkinger, Giovanni Ansaloni, José Miranda Calero, David Atienza, Laura Pozzi
Coarse-Grain Reconfigurable Arrays (CGRAs) represent emerging low-power architectures designed to accelerate Compute-Intensive Loops (CILs). The effectiveness of CGRAs in providing acceleration relies on the quality of mapping: how efficiently the CIL is compiled onto the platform. State of the Art (SoA) compilation techniques utilize modulo scheduling to minimize the Iteration Interval (II) and use
-
DDC: A Vision for a Disaggregated Datacenter arXiv.cs.AR Pub Date : 2024-02-20 Mohammad Ewais, Paul Chow
Datacenters of today have maintained the same architecture for decades using the server as the primary building block. However, this traditional approach suffers from under-utilization of its resources, often caused by over-allocating these resources when deploying applications to accommodate worst-case scenarios. Specifically, servers can quickly drain their over-allocated memory resources while their
-
A System Development Kit for Big Data Applications on FPGA-based Clusters: The EVEREST Approach arXiv.cs.AR Pub Date : 2024-02-20 Christian Pilato, Subhadeep Banik, Jakub Beranek, Fabien Brocheton, Jeronimo Castrillon, Riccardo Cevasco, Radim Cmar, Serena Curzel, Fabrizio Ferrandi, Karl F. A. Friebel, Antonella Galizia, Matteo Grasso, Paulo Silva, Jan Martinovic, Gianluca Palermo, Michele Paolino, Andrea Parodi, Antonio Parodi, Fabio Pintus, Raphael Polig, David Poulet, Francesco Regazzoni, Burkhard Ringlein, Roberto Rocco, Katerina
Modern big data workflows are characterized by computationally intensive kernels. The simulated results are often combined with knowledge extracted from AI models to ultimately support decision-making. These energy-hungry workflows are increasingly executed in data centers with energy-efficient hardware accelerators since FPGAs are well-suited for this task due to their inherent parallelism. We present
-
Factor Machine: Mixed-signal Architecture for Fine-Grained Graph-Based Computing arXiv.cs.AR Pub Date : 2024-02-19 Pitr Dudek
This paper proposes the design and implementation strategy of a novel computing architecture, the Factor Machine. The work is a step towards a general-purpose parallel system operating in a non-sequential manner, exploiting processing/memory co-integration and replacing the traditional Turing/von Neumann model of a computer system with a framework based on "factorised computation". This architecture
-
Towards Joint Optimization for DNN Architecture and Configuration for Compute-In-Memory Hardware arXiv.cs.AR Pub Date : 2024-02-19 Souvik Kundu, Anthony Sarah, Vinay Joshi, Om J Omer, Sreenivas Subramoney
With the recent growth in demand for large-scale deep neural networks, compute in-memory (CiM) has come up as a prominent solution to alleviate bandwidth and on-chip interconnect bottlenecks that constrain Von-Neuman architectures. However, the construction of CiM hardware poses a challenge as any specific memory hierarchy in terms of cache sizes and memory bandwidth at different interfaces may not
-
Stochastic Nonlinear Dynamical Modelling of SRAM Bitcells in Retention Mode arXiv.cs.AR Pub Date : 2024-02-18 Léopold Van Brandt, Denis Flandre, Jean-Charles Delvenne
SRAM bitcells in retention mode behave as autonomous stochastic nonlinear dynamical systems. From observation of variability-aware transient noise simulations, we provide an unidimensional model, fully characterizable by conventional deterministic SPICE simulations, insightfully explaining the mechanism of intrinsic noise-induced bit flips. The proposed model is exploited to, first, explain the reported
-
Variability-Aware Noise-Induced Dynamic Instability of Ultra-Low-Voltage SRAM Bitcells arXiv.cs.AR Pub Date : 2024-02-18 Léopold Van Brandt, Jean-Charles Delvenne, Denis Flandre
Stability of ultra-low-voltage SRAM bitcells in retention mode is threatened by two types of uncertainty: process variability and intrinsic noise. While variability dominates the failure probability, noise-induced bit flips in weakened bitcells lead to dynamic instability. We study both effects jointly in a unified SPICE simulation framework. Starting from a synthetic representation of process variations
-
SCARF: Securing Chips with a Robust Framework against Fabrication-time Hardware Trojans arXiv.cs.AR Pub Date : 2024-02-19 Mohammad Eslami, Tara Ghasempouri, Samuel Pagliarini
The globalization of the semiconductor industry has introduced security challenges to Integrated Circuits (ICs), particularly those related to the threat of Hardware Trojans (HTs) - malicious logic that can be introduced during IC fabrication. While significant efforts are directed towards verifying the correctness and reliability of ICs, their security is often overlooked. In this paper, we propose
-
NestedSGX: Bootstrapping Trust to Enclaves within Confidential VMs arXiv.cs.AR Pub Date : 2024-02-18 Wenhao Wang, Linke Song, Benshan Mei, Shuang Liu, Shijun Zhao, Shoumeng Yan, XiaoFeng Wang, Dan Meng, Rui Hou
Integrity is critical for maintaining system security, as it ensures that only genuine software is loaded onto a machine. Although confidential virtual machines (CVMs) function within isolated environments separate from the host, it is important to recognize that users still encounter challenges in maintaining control over the integrity of the code running within the trusted execution environments
-
Error Checking for Sparse Systolic Tensor Arrays arXiv.cs.AR Pub Date : 2024-02-16 Christodoulos Peltekis, Dionysios Filippas, Giorgos Dimitrakopoulos
Structured sparsity is an efficient way to prune the complexity of modern Machine Learning (ML) applications and to simplify the handling of sparse data in hardware. In such cases, the acceleration of structured-sparse ML models is handled by sparse systolic tensor arrays. The increasing prevalence of ML in safety-critical systems requires enhancing the sparse tensor arrays with online error detection
-
A Novel Computing Paradigm for MobileNetV3 using Memristor arXiv.cs.AR Pub Date : 2024-02-16 Jiale Li, Longyu Ma, Chiu-Wing Sham, Chong Fu
The advancement in the field of machine learning is inextricably linked with the concurrent progress in domain-specific hardware accelerators such as GPUs and TPUs. However, the rapidly growing computational demands necessitated by larger models and increased data have become a primary bottleneck in further advancing machine learning, especially in mobile and edge devices. Currently, the neuromorphic
-
LFOC+: A Fair OS-level Cache-Clustering Policy for Commodity Multicore Systems arXiv.cs.AR Pub Date : 2024-02-12 Juan Carlos Saez, Fernando Castro, Graziano Fanizzi, Manuel Prieto-Matias
Commodity multicore systems are increasingly adopting hardware support that enables the system software to partition the last-level cache (LLC). This support makes it possible for the operating system (OS) or the Virtual Machine Monitor (VMM) to mitigate shared-resource contention effects on multicores by assigning different co-running applications to various cache partitions. Recently cache-clustering
-
LFOC: A Lightweight Fairness-Oriented Cache Clustering Policy for Commodity Multicores arXiv.cs.AR Pub Date : 2024-02-12 Adrián García-García, Juan Carlos Sáez, Fernando Castro, Manuel Prieto-Matías
Multicore processors constitute the main architecture choice for modern computing systems in different market segments. Despite their benefits, the contention that naturally appears when multiple applications compete for the use of shared resources among cores, such as the last-level cache (LLC), may lead to substantial performance degradation. This may have a negative impact on key system aspects
-
PULSE: Parametric Hardware Units for Low-power Sparsity-Aware Convolution Engine arXiv.cs.AR Pub Date : 2024-02-09 Ilkin Aliyev, Tosiron Adegbija
Spiking Neural Networks (SNNs) have become popular for their more bio-realistic behavior than Artificial Neural Networks (ANNs). However, effectively leveraging the intrinsic, unstructured sparsity of SNNs in hardware is challenging, especially due to the variability in sparsity across network layers. This variability depends on several factors, including the input dataset, encoding scheme, and neuron
-
Algorithm-hardware co-design for Energy-Efficient A/D conversion in ReRAM-based accelerators arXiv.cs.AR Pub Date : 2024-02-09 Chenguang Zhang, Zhihang Yuan, Xingchen Li, Guangyu Sun
Deep neural networks are widely deployed in many fields. Due to the in-situ computation (known as processing in memory) capacity of the Resistive Random Access Memory (ReRAM) crossbar, ReRAM-based accelerator shows potential in accelerating DNN with low power and high performance. However, despite power advantage, such kind of accelerators suffer from the high power consumption of peripheral circuits
-
ARMAN: A Reconfigurable Monolithic 3D Accelerator Architecture for Convolutional Neural Networks arXiv.cs.AR Pub Date : 2024-02-06 Ali Sedaghatgoo, Amir M. Hajisadeghi, Mahmoud Momtazpour, Nader Bagherzadeh
The Convolutional Neural Network (CNN) has emerged as a powerful and versatile tool for artificial intelligence (AI) applications. Conventional computing architectures face challenges in meeting the demanding processing requirements of compute-intensive CNN applications, as they suffer from limited throughput and low utilization. To this end, specialized accelerators have been developed to speed up
-
HEAM : Hashed Embedding Acceleration using Processing-In-Memory arXiv.cs.AR Pub Date : 2024-02-06 Youngsuk Kim, Hyuk-Jae Lee, Chae Eun Rhee
In today's data centers, personalized recommendation systems face challenges such as the need for large memory capacity and high bandwidth, especially when performing embedding operations. Previous approaches have relied on DIMM-based near-memory processing techniques or introduced 3D-stacked DRAM to address memory-bound issues and expand memory bandwidth. However, these solutions fall short when dealing