• arXiv.cs.AR Pub Date : 2019-01-27
Di Gao; Dayane Reis; Xiaobo Sharon Hu; Cheng Zhuo

Computing-in-Memory (CiM) architectures aim to reduce costly data transfers by performing arithmetic and logic operations in memory and hence relieve the pressure due to the memory wall. However, determining whether a given workload can really benefit from CiM, which memory hierarchy and what device technology should be adopted by a CiM architecture requires in-depth study that is not only time consuming but also demands significant expertise in architectures and compilers. This paper presents an energy evaluation framework, Eva-CiM, for systems based on CiM architectures. Eva-CiM encompasses a multi-level (from device to architecture) comprehensive tool chain by leveraging existing modeling and simulation tools such as GEM5, McPAT [2] and DESTINY [3]. To support high-confidence prediction, rapid design space exploration and ease of use, Eva-CiM introduces several novel modeling/analysis approaches including models for capturing memory access and dependency-aware ISA traces, and for quantifying interactions between the host CPU and CiM modules. Eva-CiM can readily produce energy estimates of the entire system for a given program, a processor architecture, and the CiM array and technology specifications. Eva-CiM is validated by comparing with DESTINY [3] and [4], and enables findings including practical contributions from CiM-supported accesses, CiM-sensitive benchmarking as well as the pros and cons of increased memory size for CiM. Eva-CiM also enables exploration over different configurations and device technologies, showing 1.3-6.0X energy improvement for SRAM and 2.0-7.9X for FeFET-RAM, respectively.

更新日期：2020-01-16
• arXiv.cs.AR Pub Date : 2020-01-13
Paul Whatmough; Marco Donato; Glenn Ko; David Brooks; Gu-Yeon Wei

The current trend for domain-specific architectures (DSAs) has led to renewed interest in research test chips to demonstrate new specialized hardware. Tape-outs also offer huge pedagogical value garnered from real hands-on exposure to the whole system stack. However, successful tape-outs demand hard-earned experience, and the design process is time consuming and fraught with challenges. Therefore, custom chips have remained the preserve of a small number of research groups, typically focused on circuit design research. This paper describes the CHIPKIT framework. We describe a reusable SoC subsystem which provides basic IO, an on-chip programmable host, memory and peripherals. This subsystem can be readily extended with new IP blocks to generate custom test chips. We also present an agile RTL development flow, including a code generation tool calledVGEN. Finally, we outline best practices for full-chip validation across the entire design cycle.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2020-01-13
Yann Kurzo; Andreas Toftegaard Kristensen; Andreas Burg; Alexios Balatsoukas-Stimming

In-band full-duplex systems can transmit and receive information simultaneously on the same frequency band. However, due to the strong self-interference caused by the transmitter to its own receiver, the use of non-linear digital self-interference cancellation is essential. In this work, we describe a hardware architecture for a neural network-based non-linear self-interference (SI) canceller and we compare it with our own hardware implementation of a conventional polynomial based SI canceller. In particular, we present implementation results for a shallow and a deep neural network SI canceller as well as for a polynomial SI canceller. Our results show that the deep neural network canceller achieves a hardware efficiency of up to $312.8$ Msamples/s/mm$^2$ and an energy efficiency of up to $0.9$ nJ/sample, which is $2.1\times$ and $2\times$ better than the polynomial SI canceller, respectively. These results show that NN-based methods applied to communications are not only useful from a performance perspective, but can also be a very effective means to reduce the implementation complexity.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2020-01-14
Jesus Rodriguez Sanchez; Ove Edfors; Fredrik Rusek; Liang Liu

The Large Intelligent Surface (LIS) concept has emerged recently as a new paradigm for wireless communication, remote sensing and positioning. Despite of its potential, there are a lot of challenges from an implementation point of view, with the interconnection data-rate and computational complexity being the most relevant. Distributed processing techniques and hierarchical architectures are expected to play a vital role addressing this. In this paper we perform algorithm-architecture codesign and analyze the hardware requirements and architecture trade-offs for a discrete LIS to perform uplink detection. By doing this, we expect to give concrete case studies and guidelines for efficient implementation of LIS systems.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2020-01-14
Chuteng Zhou; Prad Kadambi; Matthew Mattina; Paul N. Whatmough

The success of deep learning has brought forth a wave of interest in computer hardware design to better meet the high demands of neural network inference. In particular, analog computing hardware has been heavily motivated specifically for accelerating neural networks, based on either electronic, optical or photonic devices, which may well achieve lower power consumption than conventional digital electronics. However, these proposed analog accelerators suffer from the intrinsic noise generated by their physical components, which makes it challenging to achieve high accuracy on deep neural networks. Hence, for successful deployment on analog accelerators, it is essential to be able to train deep neural networks to be robust to random continuous noise in the network weights, which is a somewhat new challenge in machine learning. In this paper, we advance the understanding of noisy neural networks. We outline how a noisy neural network has reduced learning capacity as a result of loss of mutual information between its input and output. To combat this, we propose using knowledge distillation combined with noise injection during training to achieve more noise robust networks, which is demonstrated experimentally across different networks and datasets, including ImageNet. Our method achieves models with as much as two times greater noise tolerance compared with the previous best attempts, which is a significant step towards making analog hardware practical for deep learning.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2019-12-11
Jan Moritz Joseph; Lennart Bamberg; Imad Hajjar; Anna Drewes; Behnam Razi Perjikolaei; Alberto García-Ortiz; Thilo Pionteck

We introduce ratatoskr, an open-source framework for in-depth power, performance and area (PPA) analysis in NoCs for 3D-integrated and heterogeneous System-on-Chips (SoCs). It covers all layers of abstraction by providing a NoC hardware implementation on RT level, a NoC simulator on cycle-accurate level and an application model on transaction level. By this comprehensive approach, ratatoskr can provide the following specific PPA analyses: Dynamic power of links can be measured within 2.4% accuracy of bit-level simulations while maintaining cycle-accurate simulation speed. Router power is determined from RT level synthesis combined with cycle-accurate simulations. The performance of the whole NoC can be measured both via cycle-accurate and RT level simulations. The performance of individual routers is obtained from RT level including gate-level verification. The NoC area is calculated from RT level. Despite these manifold features, ratatoskr offers easy two-step user interaction: First, a single point-of-entry that allows to set design parameters and second, PPA reports are generated automatically. For both the input and the output, different levels of abstraction can be chosen for high-level rapid network analysis or low-level improvement of architectural details. The synthesize NoC model reduces up to 32% total router power and 3% router area in comparison to a conventional standard router. As a forward-thinking and unique feature not found in other NoC PPA-measurement tools, ratatoskr supports heterogeneous 3D integration that is one of the most promising integration paradigms for upcoming SoCs. Thereby, ratatoskr lies the groundwork to design their communication architectures.

更新日期：2020-01-15
• arXiv.cs.AR Pub Date : 2020-01-11
Yongjune Kim; Yoocharn Jeon; Cyril Guyot; Yuval Cassuto

Magnetic random-access memory (MRAM) is a promising memory technology due to its high density, non-volatility, and high endurance. However, achieving high memory fidelity incurs significant write-energy costs, which should be reduced for large-scale deployment of MRAMs. In this paper, we formulate an optimization problem for maximizing the memory fidelity given energy constraints, and propose a biconvex optimization approach to solve it. The basic idea is to allocate non-uniform write pulses depending on the importance of each bit position. The fidelity measure we consider is minimum mean squared error (MSE), for which we propose an iterative water-filling algorithm. Although the iterative algorithm does not guarantee global optimality, we can choose a proper starting point that decreases the MSE exponentially and guarantees fast convergence. For an 8-bit accessed word, the proposed algorithm reduces the MSE by a factor of 21.

更新日期：2020-01-14
• arXiv.cs.AR Pub Date : 2020-01-13
Sai Aparna Aketi; Smriti Gupta; Huimei Cheng; Joycee Mekie; Peter A. Beerel

The risk of soft errors due to radiation continues to be a significant challenge for engineers trying to build systems that can handle harsh environments. Building systems that are Radiation Hardened by Design (RHBD) is the preferred approach, but existing techniques are expensive in terms of performance, power, and/or area. This paper introduces a novel soft-error resilient asynchronous bundled-data design template, SERAD, which uses a combination of temporal and spatial redundancy to mitigate Single Event Transients (SETs) and upsets (SEUs). SERAD uses Error Detecting Logic (EDL) to detect SETs at the inputs of sequential elements and correct them via re-sampling. Because SERAD only pays the delay penalty in the presence of an SET, which rarely occurs, its average performance is comparable to the baseline synchronous design. We tested the SERAD design using a combination of Spice and Verilog simulations and evaluated its impact on area, frequency, and power on an open-core MIPS-like processor using a NCSU 45nm cell library. Our post-synthesis results show that the SERAD design consumes less than half of the area of the Triple Modular Redundancy (TMR), exhibits significantly less performance degradation than Glitch Filtering (GF), and consumes no more total power than the baseline unhardened design.

更新日期：2020-01-14
• arXiv.cs.AR Pub Date : 2019-08-19
Karthik Ganesan; Srinivasa Shashank Nuthakki

Existing techniques to ensure functional correctness and hardware trust during pre-silicon verification face severe limitations. In this work, we systematically leverage two key ideas: 1) Symbolic Quick Error Detection (Symbolic QED or SQED), a recent bug detection and localization technique using Bounded Model Checking (BMC); and 2) Symbolic starting states, to present a method that: i) Effectively detects both "difficult" logic bugs and Hardware Trojans, even with long activation sequences where traditional BMC techniques fail; and ii) Does not need skilled manual guidance for writing testbenches, writing design-specific assertions, or debugging spurious counter-examples. Using open-source RISC-V cores, we demonstrate the following: 1. Quick (<5 minutes for an in-order scalar core and <2.5 hours for an out-of-order superscalar core) detection of 100% of hundreds of logic bug and hardware Trojan scenarios from commercial chips and research literature, and 97.9% of "extremal" bugs (randomly-generated bugs requiring ~100,000 activation instructions taken from random test programs). 2. Quick (~1 minute) detection of several previously unknown bugs in open-source RISC-V designs.

更新日期：2020-01-09
• arXiv.cs.AR Pub Date : 2020-01-03

Stochastic unary computing provides low-area circuits. However, the required area consuming stochastic number generators (SNGs) in these circuits can diminish their overall gain in area, particularly if several SNGs are required. We propose area-efficient SNGs by sharing the permuted output of one linear feedback shift register (LFSR) among several SNGs. With no hardware overhead, the proposed architecture generates stochastic bit streams with minimum stochastic computing correlation (SCC). Compared to the circular shifting approach presented in prior work, our approach produces stochastic bit streams with 67% less average SCC when a 10-bit LFSR is shared between two SNGs. To generalize our approach, we propose an algorithm to find a set of m permutations (n>m>2) with minimum pairwise SCC, for an n-bit LFSR. The search space for finding permutations with exact minimum SCC grows rapidly when n increases and it is intractable to perform a search algorithm using accurately calculated pairwise SCC values, for n>9. We propose a similarity function that can be used in the proposed search algorithm to quickly find a set of permutations with SCC values close to the minimum one. We evaluated our approach for several applications. The results show that, compared to prior work, it achieves lower MSE with the same (or even lower) area. Additionally, based on simulation results, we show that replacing the comparator component of an SNG circuit with a weighted binary generator can reduce SCC.

更新日期：2020-01-08
• arXiv.cs.AR Pub Date : 2020-01-06
Mantas Mikaitis

General algorithms and a hardware accelerator for performing stochastic rounding (SR) are presented. The main goal is to augment the ARM M4F based multi-core processor SpiNNaker 2 with a more flexible rounding functionality than is available in the ARM processor itself. The motivation of adding such an accelerator in hardware is based on our previous results showing improvements in numerical accuracy of ODE solvers in fixed-point arithmetic with SR, compared to standard round-to-nearest or bit truncation rounding modes. Furthermore, performing SR purely in software can be expensive, due to requirement of a pseudo-random number generator (PRNG), multiple masking and shifting instructions and an addition operation. Also, saturation of the rounded values is included, since rounding is usually followed by saturation, which is especially important in fixed-point arithmetic due to a narrow dynamic range of representable values. The main intended use of the accelerator is to round fixed-point multiplier outputs, which are returned unrounded by the ARM processor in a wider fixed-point format than the arguments.

更新日期：2020-01-07
• arXiv.cs.AR Pub Date : 2019-07-11
Ming Ling; Jiancong Ge; Guangmin Wang

To mitigate the performance gap between CPU and the main memory, multi-level cache architectures are widely used in modern processors. Therefore, modeling the behaviors of the downstream caches becomes a critical part of the processor performance evaluation in the early stage of Design Space Exploration (DSE). In this paper, we propose a fast and accurate L2 cache reuse distance histogram model, which can be used to predict the behaviors of the multi-level cache architectures where the L1 cache uses the LRU replacement policy and the L2 cache uses LRU/Random replacement policies. We use the profiled L1 reuse distance histogram and two newly proposed metrics, namely the RST table and the Hit-RDH, that describing more detailed information of the software traces as the inputs. For a given L1 cache configuration, the profiling results can be reused for different configurations of the L2 cache. The output of our model is the L2 cache reuse distance histogram, based on which the L2 cache miss rates can be evaluated. We compare the L2 cache miss rates with the results from gem5 cycle-accurate simulations of 15 benchmarks chosen from SPEC CPU 2006 and 9 benchmarks from SPEC CPU 2017. The average absolute error is less than 5%, while the evaluation time for each L2 configuration can be sped up almost 30X for four L2 cache candidates.

更新日期：2020-01-07
• arXiv.cs.AR Pub Date : 2020-01-03
Karthikeyan Nagarajan; Asmit De; Mohammad Nasim Imtiaz Khan; Swaroop Ghosh

In this paper, we investigate the advanced circuit features such as wordline- (WL) underdrive (prevents retention failure) and overdrive (assists write) employed in the peripherals of Dynamic RAM (DRAM) memories from a security perspective. In an ideal environment, these features ensure fast and reliable read and write operations. However, an adversary can re-purpose them by inserting Trojans to deliver malicious payloads such as fault injections, Denial-of-Service (DoS), and information leakage attacks when activated by the adversary. Simulation results indicate that wordline voltage can be increased to cause retention failure and thereby launch a DoS attack in DRAM memory. Furthermore, two wordlines or bitlines can be shorted to leak information or inject faults by exploiting the DRAM's refresh operation. We demonstrate an information leakage system exploit by implementing TrappeD on RocketChip SoC.

更新日期：2020-01-06
Contents have been reproduced by permission of the publishers.

down
wechat
bug