当前位置: X-MOL 学术Proc. IEEE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Addressing Unreliability in Emerging Devices and Non-von Neumann Architectures Using Coded Computing
Proceedings of the IEEE ( IF 20.6 ) Pub Date : 2020-08-01 , DOI: 10.1109/jproc.2020.2986362
Sanghamitra Dutta , Haewon Jeong , Yaoqing Yang , Viveck Cadambe , Tze Meng Low , Pulkit Grover

Computing systems are evolving rapidly. At the device level, emerging devices are beginning to compete with traditional CMOS systems. At the architecture level, novel architectures are successfully avoiding the communication bottleneck that is a central feature, and a central limitation, of the von Neumann architecture. Furthermore, such systems are increasingly plagued by unreliability. This unreliability arises at device or gate-level in emerging devices, and can percolate up to processor or system-level if left unchecked. The goal of this article is to survey recent advances in reliable computing using unreliable elements, with an eye on nonsilicon and non-von Neumann architectures. We first observe that instead of aiming for generic computing problems, the community could use “dwarfs of modern computing,” first noted in the high-performance computing (HPC) community, as a starting point. These computing problems are the basic building blocks of almost all scientific computing, machine learning, and data analytics today. Next, we survey the state of the art in “coded computing,” which is an emerging area that advances on classical algorithm-based fault-tolerance (ABFT) and brings a fundamental information-theoretic perspective. By weaving error-correcting codes into a computing algorithm, coded computing provides dramatic improvements on solutions, as well as obtains novel fundamental limits, for problems that have been open for more than 30 years. We introduce existing and novel coded computing techniques in the context of “coded dwarfs,” where a specific dwarf’s computation is made resilient by applying coding. We discuss how, for the same redundancy, “coded dwarfs” are significantly more resilient compared to classical techniques such as replication. Furthermore, by examining a widely popular computation task—training large neural networks—we demonstrate how coded dwarfs can be applied to address this fundamentally nonlinear problem. Finally, we discuss practical challenges and future directions in implementing coded computing techniques on emerging and existing nonsilicon and/or non-von Neumann architectures.

中文翻译:

使用编码计算解决新兴设备和非冯诺依曼架构的不可靠性问题

计算系统正在迅速发展。在器件层面,新兴器件开始与传统 CMOS 系统展开竞争。在架构级别,新颖的架构成功地避免了通信瓶颈,这是冯诺依曼架构的一个中心特征和中心限制。此外,此类系统越来越受到不可靠性的困扰。这种不可靠性出现在新兴设备的设备或门级,如果不加以检查,可能会渗透到处理器或系统级。本文的目标是调查使用不可靠元素的可靠计算的最新进展,并着眼于非硅和非冯诺依曼架构。我们首先观察到,社区可以使用“现代计算的矮人”,而不是针对通用计算问题,”首先在高性能计算 (HPC) 社区中被提及,作为起点。这些计算问题是当今几乎所有科学计算、机器学习和数据分析的基本构建块。接下来,我们调查了“编码计算”的最新技术,这是一个新兴领域,它在基于经典算法的容错 (ABFT) 基础上有所进步,并带来了基本的信息理论视角。通过将纠错码编织到计算算法中,编码计算为解决方案提供了巨大的改进,并获得了新的基本限制,对于已经开放了 30 多年的问题。我们在“编码矮人”的背景下介绍了现有的和新颖的编码计算技术,其中通过应用编码使特定矮人的计算具有弹性。我们讨论如何,对于相同的冗余,与复制等经典技术相比,“编码矮人”的弹性要大得多。此外,通过检查一个广泛流行的计算任务——训练大型神经网络——我们展示了如何应用编码矮人来解决这个基本的非线性问题。最后,我们讨论了在新兴和现有非硅和/或非冯诺依曼架构上实施编码计算技术的实际挑战和未来方向。
更新日期:2020-08-01
down
wechat
bug