当前位置: X-MOL 学术Microelectron. Reliab. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Novel lockstep-based fault mitigation approach for SoCs with roll-back and roll-forward recovery
Microelectronics Reliability ( IF 1.6 ) Pub Date : 2021-08-05 , DOI: 10.1016/j.microrel.2021.114297
Server Kasap 1 , Eduardo Weber Wächter 2 , Xiaojun Zhai 3 , Shoaib Ehsan 3 , Klaus D. McDonald-Maier 3
Affiliation  

All-Programmable System-on-Chips (APSoCs) constitute a compelling option for employing applications in radiation environments thanks to their high-performance computing and power efficiency merits. Despite these advantages, APSoCs are sensitive to radiation like any other electronic device. Processors embedded in APSoCs, therefore, have to be adequately hardened against ionizing-radiation to make them a viable choice of design for harsh environments. This paper proposes a novel lockstep-based approach to harden the dual-core ARM Cortex-A9 processor in the Xilinx Zynq-7000 APSoC against radiation-induced soft errors by coupling it with a MicroBlaze TMR subsystem in the programmable logic (PL) layer of the Zynq. The proposed technique uses the concepts of checkpointing along with roll-back and roll-forward mechanisms at the software level, i.e. software redundancy, as well as processor replication and checker circuits at the hardware level (i.e. hardware redundancy). Results of fault injection experiments show that the proposed approach achieves high levels of protection against soft errors by mitigating around 98% of bit-flips injected into the register files of both ARM cores while keeping timing performance overhead as low as 25% if block and application sizes are adjusted appropriately. Furthermore, the incorporation of the roll-forward recovery operation in addition to the roll-back operation improves the Mean Workload between Failures (MWBF) of the system by up to ≈19% depending on the nature of the running application, since the application can proceed faster, in a scenario where a fault occurs, when treated with the roll-forward operation rather than roll-back operation. Thus, relatively more data can be processed before the next error occurs in the system.



中文翻译:

用于具有回滚和前滚恢复的 SoC 的新型基于锁步的故障缓解方法

全可编程片上系统 (APSoC) 凭借其高性能计算和电源效率优势,成为在辐射环境中使用应用程序的一个引人注目的选择。尽管有这些优势,但 APSoC 与任何其他电子设备一样对辐射很敏感。因此,嵌入在 APSoC 中的处理器必须针对电离辐射进行充分加固,使其成为适用于恶劣环境的可行设计选择。本文提出了一种新的基于锁步的方法,通过将其与可编程逻辑 (PL) 层中的 MicroBlaze TMR 子系统耦合,来强化 Xilinx Zynq-7000 APSoC 中的双核 ARM Cortex-A9 处理器以抵抗辐射引起的软错误。 Zynq。所提出的技术使用检查点的概念以及软件级别的回滚和前滚机制,即 软件冗余,以及硬件级的处理器复制和检查器电路(即硬件冗余)。故障注入实验的结果表明,所提出的方法通过减少大约 98% 注入两个 ARM 内核的寄存器文件的位翻转,同时将时序性能开销保持在低至 25% 的块和应用程序,从而实现了针对软错误的高级保护大小适当调整。此外,根据正在运行的应用程序的性质,将前滚恢复操作与回滚操作相结合,可以将系统的平均故障间工作负载 (MWBF) 提高约 19%,因为应用程序可以在发生故障的情况下,当使用前滚操作而不是回滚操作进行处理时,可以更快地进行。因此,

更新日期:2021-08-05
down
wechat
bug