System management recovery in NoC-based many-core systems

Fochi, Vinicius; Caimi, Luciano L.; Silva, Marcelo H. da; Moraes, Fernando Gehm

doi:10.1007/s10470-020-01631-y

System management recovery in NoC-based many-core systems

Published: 12 March 2020

Volume 106, pages 85–98, (2021)
Cite this article

Analog Integrated Circuits and Signal Processing Aims and scope Submit manuscript

Vinicius Fochi¹,
Luciano L. Caimi²,
Marcelo H. da Silva¹ &
…
Fernando Gehm Moraes ORCID: orcid.org/0000-0001-6126-6847¹

198 Accesses
Explore all metrics

Abstract

The technology nodes reduction enabled the emergence of NoC-based many-cores with dozens to hundreds of processing elements (PEs). Despite the processing power offered by a large number of processors and communication flexibility due to the adoption of NoCs, it is necessary to manage the many-core resources to ensure scalability. The execution of the management tasks requires a PE reserved exclusively to execute such actions. These processors are named managers PE–MPE. A centralized approach would induce a significant load to the MPE in large-scale systems, and a permanent fault in the MPE would compromise the entire system. The adoption of a distributed approach, organization adopted in this work, with MPEs hierarchically organized, reduces the management load, and a fault in an MPE would compromise only the PEs managed by the faulty MPE. The literature presents several fault-tolerant proposals targeting the NoC or the processors. However, there is a significant gap related to fault-tolerant methods at the system level, i.e., related to fault-tolerant techniques regarding the MPEs. The goal of this paper is to present a recovery method when an MPE became faulty, and propose a protocol to migrate the management software safely to a new PE. The method adopts task migration to release a processor if there is no processor to receive the kernel that was executing in a faulty processor. The proposal is transparent to the applications running in the many-core, with an overhead in the execution time varying between 1.5 and 1.65 ms during the management and task migration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

System Software for Many-Core and Multi-core Architecture

Modeling Remapping Based Fault Tolerance Techniques for Chip Multiprocessor Cache with Design Space Exploration

Article 18 February 2020

Avishek Choudhury & Biplab K. Sikdar

Communication and aging aware application mapping for multicore based edge computing servers

Article 30 March 2022

Javid Ali, Tahir Maqsood, … Sajjad A. Madani

References

Barreto, F., Amory, A. M., & Moraes, F. G. (2015). Fault recovery protocol for distributed memory MPSoCs. In IEEE international symposium on circuits and systems (ISCAS) (pp. 421–424).
Bhowmik, B., Deka, J. K., Biswas, S., & Bhattacharya, B. (2016). On-line detection and diagnosis of stuck-at faults in channels of NoC-based systems. In IEEE international conference on systems, man, and cybernetics (SMC) (pp. 4567–4572).
Bolchini, C., Carminati, M., & Miele, A. (2013). Self-adaptive fault tolerance in multi-/many-core systems. Journal of Electronic Testing: Theory and Applications, 29(2), 159–175.
Article Google Scholar
Boraten, T., & Kodi, A. K. (2018). Runtime techniques to mitigate soft errors in network-on-chip (NoC) architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(3), 682–695.
Article Google Scholar
Braak, T. D. T., Burgess, S. T., Hurskainen, H., Kerkhoff, H. G., Vermeulen, B., & Zhang, X. (2010). On-line dependability enhancement of multiprocessor SoCs by resource management. In SoC (pp. 103–110).
Brillu, R., Pillement, S., Lemonnier, F., & Millet, P. (2013). Cluster based MPSoC architecture: an on-chip message passing implementation. Design Automation for Embedded Systems, 17(3–4), 587–607.
Article Google Scholar
Caimi, L., Fochi, V., Wachter, E., Munhoz, D., & Moraes, F. G. (2017). Secure admission and execution of applications in many-core systems. In Symposium on integrated circuits and systems design (SBCCI) (pp. 65–71).
Carara, E., de Oliveira, R., Calazans, N., & Moraes, F. G. (2009). HeMPS—A framework for NoC-based MPSoC generation. In IEEE international symposium on circuits and systems (ISCAS) (pp. 1345–1348).
Castilhos, G., Mandelli, M., Madalozzo, G., & Moraes, F. G. (2013). Distributed resource management in NoC-based MPSoCs with dynamic cluster sizes. In IEEE computer society annual symposium on VLSI (ISVLSI) (pp. 153–158).
Chen, Y., Chang, E., Hsin, H., Chen, K., & Wu, A. (2017). Path-diversity-aware fault-tolerant routing algorithm for network-on-chip systems. IEEE Transactions on Parallel and Distributed Systems, 28(3), 838–849.
Article Google Scholar
Domingues, A. R. P., Hamerski, J. C., & Amory, A. (2018). Broker fault recovery for a multiprocessor system-an-chip middleware. In Symposium on integrated circuits and systems design (SBCCI) (pp. 1–6).
Dutt, N., Jantsch, A., & Sarma, S. (2015). Self-aware cyber-physical systems-on-chip. In IEEE/ACM international conference on computer-aided design (ICCAD) (pp. 46–50).
Fick, D., DeOrio, A., Hu, J., Bertacco, V., Blaauw, D., & Sylvester, D. (2009). Vicis: A reliable network for unreliable silicon. In DAC (pp. 812–817).
Fochi, V., Wächter, E., Erichsen, A., Amory, A. M., & Moraes, F. G. (2015). An integrated method for implementing online fault detection in NoC-based MPSoCs. In IEEE International symposium on circuits and systems (ISCAS) (pp. 1562–1565).
Heron, O., Guilhemsang, J., Ventroux, N., & Giulieri, A. (2010). Analysis of on-line self-testing policies for real-time embedded multiprocessors in DSM technologies. In IEEE international conference on electronics, circuits and systems (ICECS) (pp. 49–55).
Kamran, A., & Navabi, Z. (2016). Stochastic testing of processing cores in a many-core architecture. Integration, the VLSI Journal, 55(1), 183–193.
Article Google Scholar
Kim, H., Vitkovskiy, A., Gratz, P. V., & Soteriou, V. (2013). Use it or lose it: Wear-out and lifetime in future chip multiprocessors. In IEEE/ACM international symposium on microarchitecture (MICRO) (pp. 136–147).
Martins, A. L. M., Sant’Ana, A. C., & Moraes, F. G. (2016). Runtime energy management for many-core systems. In IEEE international conference on electronics, circuits and systems (ICECS) (pp. 380–383).
Meloni, P., Tuveri, G., Raffo, L., Cannella, E., Stefanov, T. P., Derin, O., Fiorin, L., & Sami, M. (2012). System adaptivity and fault-tolerance in NoC-based MPSoCs: The MADNESS project approach. In Euromicro conference on digital system design (DSD) (pp. 517–524).
Paul, J., Oechslein, B., Erhardt, C., Schedel, J., Kröhnert, M., Lohmann, D., et al. (2015). Self-adaptive corner detection on MPSoC through resource-aware programming. Journal of System Architecture, 61(10), 520–530.
Article Google Scholar
Reddy, B., Vasantha, M., & Kumar, Y. (2016). A gracefully degrading and energy-efficient fault tolerant NoC using spare core. In IEEE computer society annual symposium on VLSI (ISVLSI) (pp. 146–151).
Ruaro, M., Lazzarotto, F. B., Marcon, C. A., & Moraes, F. G. (2016). DMNI: A specialized network interface for NoC-based MPSoCs. In IEEE international symposium on circuits and systems (ISCAS) (pp. 1202–1205).
Silveira, J., Marcon, C., Cortez, P., Barroso, G., Ferreira, J. M., & Mota, R. (2016). Scenario preprocessing approach for the reconfiguration of fault-tolerant NoC-based MPSoCs. Microprocessors and Microsystems, 40(1), 137–153.
Article Google Scholar
Paul, S., Chatterjee, N., & Ghosal, P. (2018). A permanent fault tolerant dynamic task allocation approach for network-on-chip based multicore systems. Journal of Systems Architecture, 97(1), 287–303.
Google Scholar
Tajik, H., Donyanavard, B., Dutt, N., Jahn, J., & Henkel, J. (2016). SPMPool: Runtime SPM management for memory-intensive applications in embedded many-cores. ACM Transactions on Embedded Computing Systems, 16(1), 25:1–25:27.
Article Google Scholar
Tsoutsouras, V., Masouros, D., Xydis, S., & Soudris, D. (2017). SoftRM: Self-organized fault-tolerant resource management for failure detection and recovery in NoC based many-cores. ACM Transactions on Embedded Computing Systems, 16(5s), 144:1–144:19.
Article Google Scholar
Wachter, E., Caimi, L. L., Fochi, V., Munhoz, D., & Moraes, F. G. (2017). BrNoC: A broadcast NoC for control messages in many-core systems. Microelectronics Journal, 68(1), 69–77.
Article Google Scholar
Walters, J. P., Kost, R., Singh, K., Suh, J., & Crago, S. P. (2011). Software-based fault tolerance for the Maestro many-core processor. In IEEE aerospace conference (pp. 1–12).
Wentzlaff, D., et al. (2007). On-chip interconnection architecture of the tile processor. IEEE Micro, 27(5), 15–31.
Article Google Scholar
Yu, Q., Zhang, M., & Ampadu, P. (2011). Exploiting inherent information redundancy to manage transient errors in NoC routing arbitration. In NoCS (pp. 105–112).
Zhang, Y., Morris, R., DiTomaso, D., & Kodi, A. (2012). Energy-efficient and fault-tolerant unified buffer and bufferless crossbar architecture for NoCs. In IPDPS (pp. 972–981).

Download references

Author information

Authors and Affiliations

School of Technology, PUCRS, Av. Ipiranga 6681, Porto Alegre, 90619-900, Brazil
Vinicius Fochi, Marcelo H. da Silva & Fernando Gehm Moraes
UFFS, Av. Fernando Machado 108E, Chapecó, 89802-112, Brazil
Luciano L. Caimi

Authors

Vinicius Fochi
View author publications
You can also search for this author in PubMed Google Scholar
Luciano L. Caimi
View author publications
You can also search for this author in PubMed Google Scholar
Marcelo H. da Silva
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Gehm Moraes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fernando Gehm Moraes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author Vinicius Fochi is financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001. Author Fernando Gehm Moraes is supported by FAPERGS (17/2551-196-1) and CNPq (302531/2016-5), Brazilian funding agencies.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fochi, V., Caimi, L.L., Silva, M.H.d. et al. System management recovery in NoC-based many-core systems. Analog Integr Circ Sig Process 106, 85–98 (2021). https://doi.org/10.1007/s10470-020-01631-y

Download citation

Received: 31 July 2019
Revised: 03 January 2020
Accepted: 06 March 2020
Published: 12 March 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s10470-020-01631-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

System management recovery in NoC-based many-core systems

Abstract

Access this article

Similar content being viewed by others

System Software for Many-Core and Multi-core Architecture

Modeling Remapping Based Fault Tolerance Techniques for Chip Multiprocessor Cache with Design Space Exploration

Communication and aging aware application mapping for multicore based edge computing servers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

System management recovery in NoC-based many-core systems

Abstract

Access this article

Similar content being viewed by others

System Software for Many-Core and Multi-core Architecture

Modeling Remapping Based Fault Tolerance Techniques for Chip Multiprocessor Cache with Design Space Exploration

Communication and aging aware application mapping for multicore based edge computing servers

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation