当前位置: X-MOL 学术Analog Integr. Circ. Signal Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
System management recovery in NoC-based many-core systems
Analog Integrated Circuits and Signal Processing ( IF 1.4 ) Pub Date : 2020-03-12 , DOI: 10.1007/s10470-020-01631-y
Vinicius Fochi , Luciano L. Caimi , Marcelo H. da Silva , Fernando Gehm Moraes

Abstract

The technology nodes reduction enabled the emergence of NoC-based many-cores with dozens to hundreds of processing elements (PEs). Despite the processing power offered by a large number of processors and communication flexibility due to the adoption of NoCs, it is necessary to manage the many-core resources to ensure scalability. The execution of the management tasks requires a PE reserved exclusively to execute such actions. These processors are named managers PE–MPE. A centralized approach would induce a significant load to the MPE in large-scale systems, and a permanent fault in the MPE would compromise the entire system. The adoption of a distributed approach, organization adopted in this work, with MPEs hierarchically organized, reduces the management load, and a fault in an MPE would compromise only the PEs managed by the faulty MPE. The literature presents several fault-tolerant proposals targeting the NoC or the processors. However, there is a significant gap related to fault-tolerant methods at the system level, i.e., related to fault-tolerant techniques regarding the MPEs. The goal of this paper is to present a recovery method when an MPE became faulty, and propose a protocol to migrate the management software safely to a new PE. The method adopts task migration to release a processor if there is no processor to receive the kernel that was executing in a faulty processor. The proposal is transparent to the applications running in the many-core, with an overhead in the execution time varying between 1.5 and 1.65 ms during the management and task migration.



中文翻译:

基于NoC的多核系统中的系统管理恢复

摘要

技术节点的减少使具有数十到数百个处理元件(PE)的基于NoC的多核得以出现。尽管由于采用NoC而由大量处理器提供处理能力和通信灵活性,但仍需要管理多核资源以确保可伸缩性。管理任务的执行需要专用于执行此类操作的PE。这些处理器称为管理器PE–MPE。集中式方法会给大型系统中的MPE带来很大的负担,而MPE中的永久性故障会损害整个系统。采用分布式方法(在此工作中采用的组织)具有MPE的层次结构,可以减轻管理负担,并且MPE中的故障只会损害由故障的MPE管理的PE。文献提出了针对NoC或处理器的几种容错方案。但是,与系统级别的容错方法有关,即与MPE的容错技术有关,存在很大差距。本文的目的是提出一种MPE发生故障时的恢复方法,并提出一种协议,以将管理软件安全地迁移到新的PE。如果没有处理器接收故障处理器中正在执行的内核,则该方法采用任务迁移来释放处理器。该建议对于在多核中运行的应用程序是透明的,在管理和任务迁移期间,执行时间的开销在1.5到1.65毫秒之间变化。在系统级别上,与容错方法有关的差距很大,即与MPE有关的容错技术有关。本文的目的是提出一种MPE发生故障时的恢复方法,并提出一种协议,以将管理软件安全地迁移到新的PE。如果没有处理器接收故障处理器中正在执行的内核,则该方法采用任务迁移来释放处理器。该建议对于在多核中运行的应用程序是透明的,在管理和任务迁移期间,执行时间的开销在1.5到1.65毫秒之间变化。在系统级别上,与容错方法有关的差距很大,即与MPE有关的容错技术有关。本文的目的是提出一种MPE发生故障时的恢复方法,并提出一种协议,以将管理软件安全地迁移到新的PE。如果没有处理器接收故障处理器中正在执行的内核,则该方法采用任务迁移来释放处理器。该建议对于在多核中运行的应用程序是透明的,在管理和任务迁移期间,执行时间的开销在1.5到1.65毫秒之间变化。并提出协议,将管理软件安全地迁移到新的PE。如果没有处理器接收故障处理器中正在执行的内核,则该方法采用任务迁移来释放处理器。该建议对于在多核中运行的应用程序是透明的,在管理和任务迁移期间,执行时间的开销在1.5到1.65毫秒之间变化。并提出协议,将管理软件安全地迁移到新的PE。如果没有处理器接收故障处理器中正在执行的内核,则该方法采用任务迁移来释放处理器。该建议对于在多核中运行的应用程序是透明的,在管理和任务迁移期间,执行时间的开销在1.5到1.65毫秒之间变化。

更新日期:2020-03-12
down
wechat
bug