当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
HyCA: A Hybrid Computing Architecture for Fault Tolerant Deep Learning
arXiv - CS - Hardware Architecture Pub Date : 2021-06-09 , DOI: arxiv-2106.04772
Dawen Xu, Qianlong Wang, Cheng Liu, Cheng Chu, Ying Wang, Huawei Li, Xiaowei Li, Kwang-Ting Cheng

Hardware faults on the regular 2-D computing array of a typical deep learning accelerator (DLA) can lead to dramatic prediction accuracy loss. Prior redundancy design approaches typically have each homogeneous redundant processing element (PE) to mitigate faulty PEs for a limited region of the 2-D computing array rather than the entire computing array to avoid the excessive hardware overhead. However, they fail to recover the computing array when the number of faulty PEs in any region exceeds the number of redundant PEs in the same region. The mismatch problem deteriorates when the fault injection rate rises and the faults are unevenly distributed. To address the problem, we propose a hybrid computing architecture (HyCA) for fault-tolerant DLAs. It has a set of dot-production processing units (DPPUs) to recompute all the operations that are mapped to the faulty PEs despite the faulty PE locations. According to our experiments, HyCA shows significantly higher reliability, scalability, and performance with less chip area penalty when compared to the conventional redundancy approaches. Moreover, by taking advantage of the flexible recomputing, HyCA can also be utilized to scan the entire 2-D computing array and detect the faulty PEs effectively at runtime.

中文翻译:

HyCA:容错深度学习的混合计算架构

典型深度学习加速器 (DLA) 的常规 2-D 计算阵列上的硬件故障会导致预测准确度大幅下降。先前的冗余设计方法通常具有每个同类冗余处理元件 (PE),以减轻二维计算阵列的有限区域而非整个计算阵列的故障 PE,以避免过多的硬件开销。但是,当任何一个区域的故障PE数量超过同一区域的冗余PE数量时,它们都无法恢复计算阵列。当故障注入率上升且故障分布不均时,失配问题会恶化。为了解决这个问题,我们提出了一种用于容错 DLA 的混合计算架构(HyCA)。它有一组点生产处理单元 (DPPU) 来重新计算映射到故障 PE 的所有操作,尽管 PE 位置有问题。根据我们的实验,与传统的冗余方法相比,HyCA 显示出更高的可靠性、可扩展性和性能,同时芯片面积损失更少。此外,通过灵活的重新计算,HyCA 还可以用于扫描整个二维计算阵列,并在运行时有效地检测故障 PE。
更新日期:2021-06-10
down
wechat
bug