Skip to main content
Log in

System management recovery in NoC-based many-core systems

  • Published:
Analog Integrated Circuits and Signal Processing Aims and scope Submit manuscript

Abstract

The technology nodes reduction enabled the emergence of NoC-based many-cores with dozens to hundreds of processing elements (PEs). Despite the processing power offered by a large number of processors and communication flexibility due to the adoption of NoCs, it is necessary to manage the many-core resources to ensure scalability. The execution of the management tasks requires a PE reserved exclusively to execute such actions. These processors are named managers PE–MPE. A centralized approach would induce a significant load to the MPE in large-scale systems, and a permanent fault in the MPE would compromise the entire system. The adoption of a distributed approach, organization adopted in this work, with MPEs hierarchically organized, reduces the management load, and a fault in an MPE would compromise only the PEs managed by the faulty MPE. The literature presents several fault-tolerant proposals targeting the NoC or the processors. However, there is a significant gap related to fault-tolerant methods at the system level, i.e., related to fault-tolerant techniques regarding the MPEs. The goal of this paper is to present a recovery method when an MPE became faulty, and propose a protocol to migrate the management software safely to a new PE. The method adopts task migration to release a processor if there is no processor to receive the kernel that was executing in a faulty processor. The proposal is transparent to the applications running in the many-core, with an overhead in the execution time varying between 1.5 and 1.65 ms during the management and task migration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Barreto, F., Amory, A. M., & Moraes, F. G. (2015). Fault recovery protocol for distributed memory MPSoCs. In IEEE international symposium on circuits and systems (ISCAS) (pp. 421–424).

  2. Bhowmik, B., Deka, J. K., Biswas, S., & Bhattacharya, B. (2016). On-line detection and diagnosis of stuck-at faults in channels of NoC-based systems. In IEEE international conference on systems, man, and cybernetics (SMC) (pp. 4567–4572).

  3. Bolchini, C., Carminati, M., & Miele, A. (2013). Self-adaptive fault tolerance in multi-/many-core systems. Journal of Electronic Testing: Theory and Applications, 29(2), 159–175.

    Article  Google Scholar 

  4. Boraten, T., & Kodi, A. K. (2018). Runtime techniques to mitigate soft errors in network-on-chip (NoC) architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(3), 682–695.

    Article  Google Scholar 

  5. Braak, T. D. T., Burgess, S. T., Hurskainen, H., Kerkhoff, H. G., Vermeulen, B., & Zhang, X. (2010). On-line dependability enhancement of multiprocessor SoCs by resource management. In SoC (pp. 103–110).

  6. Brillu, R., Pillement, S., Lemonnier, F., & Millet, P. (2013). Cluster based MPSoC architecture: an on-chip message passing implementation. Design Automation for Embedded Systems, 17(3–4), 587–607.

    Article  Google Scholar 

  7. Caimi, L., Fochi, V., Wachter, E., Munhoz, D., & Moraes, F. G. (2017). Secure admission and execution of applications in many-core systems. In Symposium on integrated circuits and systems design (SBCCI) (pp. 65–71).

  8. Carara, E., de Oliveira, R., Calazans, N., & Moraes, F. G. (2009). HeMPS—A framework for NoC-based MPSoC generation. In IEEE international symposium on circuits and systems (ISCAS) (pp. 1345–1348).

  9. Castilhos, G., Mandelli, M., Madalozzo, G., & Moraes, F. G. (2013). Distributed resource management in NoC-based MPSoCs with dynamic cluster sizes. In IEEE computer society annual symposium on VLSI (ISVLSI) (pp. 153–158).

  10. Chen, Y., Chang, E., Hsin, H., Chen, K., & Wu, A. (2017). Path-diversity-aware fault-tolerant routing algorithm for network-on-chip systems. IEEE Transactions on Parallel and Distributed Systems, 28(3), 838–849.

    Article  Google Scholar 

  11. Domingues, A. R. P., Hamerski, J. C., & Amory, A. (2018). Broker fault recovery for a multiprocessor system-an-chip middleware. In Symposium on integrated circuits and systems design (SBCCI) (pp. 1–6).

  12. Dutt, N., Jantsch, A., & Sarma, S. (2015). Self-aware cyber-physical systems-on-chip. In IEEE/ACM international conference on computer-aided design (ICCAD) (pp. 46–50).

  13. Fick, D., DeOrio, A., Hu, J., Bertacco, V., Blaauw, D., & Sylvester, D. (2009). Vicis: A reliable network for unreliable silicon. In DAC (pp. 812–817).

  14. Fochi, V., Wächter, E., Erichsen, A., Amory, A. M., & Moraes, F. G. (2015). An integrated method for implementing online fault detection in NoC-based MPSoCs. In IEEE International symposium on circuits and systems (ISCAS) (pp. 1562–1565).

  15. Heron, O., Guilhemsang, J., Ventroux, N., & Giulieri, A. (2010). Analysis of on-line self-testing policies for real-time embedded multiprocessors in DSM technologies. In IEEE international conference on electronics, circuits and systems (ICECS) (pp. 49–55).

  16. Kamran, A., & Navabi, Z. (2016). Stochastic testing of processing cores in a many-core architecture. Integration, the VLSI Journal, 55(1), 183–193.

    Article  Google Scholar 

  17. Kim, H., Vitkovskiy, A., Gratz, P. V., & Soteriou, V. (2013). Use it or lose it: Wear-out and lifetime in future chip multiprocessors. In IEEE/ACM international symposium on microarchitecture (MICRO) (pp. 136–147).

  18. Martins, A. L. M., Sant’Ana, A. C., & Moraes, F. G. (2016). Runtime energy management for many-core systems. In IEEE international conference on electronics, circuits and systems (ICECS) (pp. 380–383).

  19. Meloni, P., Tuveri, G., Raffo, L., Cannella, E., Stefanov, T. P., Derin, O., Fiorin, L., & Sami, M. (2012). System adaptivity and fault-tolerance in NoC-based MPSoCs: The MADNESS project approach. In Euromicro conference on digital system design (DSD) (pp. 517–524).

  20. Paul, J., Oechslein, B., Erhardt, C., Schedel, J., Kröhnert, M., Lohmann, D., et al. (2015). Self-adaptive corner detection on MPSoC through resource-aware programming. Journal of System Architecture, 61(10), 520–530.

    Article  Google Scholar 

  21. Reddy, B., Vasantha, M., & Kumar, Y. (2016). A gracefully degrading and energy-efficient fault tolerant NoC using spare core. In IEEE computer society annual symposium on VLSI (ISVLSI) (pp. 146–151).

  22. Ruaro, M., Lazzarotto, F. B., Marcon, C. A., & Moraes, F. G. (2016). DMNI: A specialized network interface for NoC-based MPSoCs. In IEEE international symposium on circuits and systems (ISCAS) (pp. 1202–1205).

  23. Silveira, J., Marcon, C., Cortez, P., Barroso, G., Ferreira, J. M., & Mota, R. (2016). Scenario preprocessing approach for the reconfiguration of fault-tolerant NoC-based MPSoCs. Microprocessors and Microsystems, 40(1), 137–153.

    Article  Google Scholar 

  24. Paul, S., Chatterjee, N., & Ghosal, P. (2018). A permanent fault tolerant dynamic task allocation approach for network-on-chip based multicore systems. Journal of Systems Architecture, 97(1), 287–303.

    Google Scholar 

  25. Tajik, H., Donyanavard, B., Dutt, N., Jahn, J., & Henkel, J. (2016). SPMPool: Runtime SPM management for memory-intensive applications in embedded many-cores. ACM Transactions on Embedded Computing Systems, 16(1), 25:1–25:27.

    Article  Google Scholar 

  26. Tsoutsouras, V., Masouros, D., Xydis, S., & Soudris, D. (2017). SoftRM: Self-organized fault-tolerant resource management for failure detection and recovery in NoC based many-cores. ACM Transactions on Embedded Computing Systems, 16(5s), 144:1–144:19.

    Article  Google Scholar 

  27. Wachter, E., Caimi, L. L., Fochi, V., Munhoz, D., & Moraes, F. G. (2017). BrNoC: A broadcast NoC for control messages in many-core systems. Microelectronics Journal, 68(1), 69–77.

    Article  Google Scholar 

  28. Walters, J. P., Kost, R., Singh, K., Suh, J., & Crago, S. P. (2011). Software-based fault tolerance for the Maestro many-core processor. In IEEE aerospace conference (pp. 1–12).

  29. Wentzlaff, D., et al. (2007). On-chip interconnection architecture of the tile processor. IEEE Micro, 27(5), 15–31.

    Article  Google Scholar 

  30. Yu, Q., Zhang, M., & Ampadu, P. (2011). Exploiting inherent information redundancy to manage transient errors in NoC routing arbitration. In NoCS (pp. 105–112).

  31. Zhang, Y., Morris, R., DiTomaso, D., & Kodi, A. (2012). Energy-efficient and fault-tolerant unified buffer and bufferless crossbar architecture for NoCs. In IPDPS (pp. 972–981).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fernando Gehm Moraes.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author Vinicius Fochi is financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001. Author Fernando Gehm Moraes is supported by FAPERGS (17/2551-196-1) and CNPq (302531/2016-5), Brazilian funding agencies.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fochi, V., Caimi, L.L., Silva, M.H.d. et al. System management recovery in NoC-based many-core systems. Analog Integr Circ Sig Process 106, 85–98 (2021). https://doi.org/10.1007/s10470-020-01631-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10470-020-01631-y

Keywords

Navigation