Abstract
On top of the wear-out failures and external particle interventions, voltage scaling to mitigate the power consumption in multiprocessor makes cache more vulnerable to cell failures. For the indispensable voltage reduction to prolong the battery life of handheld devices, fault tolerance techniques are extremely important to ensure fault free execution in near-threshold voltage. Several fault tolerance techniques have been proposed and the remapping based techniques are found to be effective to address the issue of fault tolerance in single core systems. This work proposes an analytical model for remapping based fault tolerance techniques to evaluate the effectiveness of such schemes in multicore systems. The metrics Expected Miss Ratio in Multicore (EMRMC) and Expected Latency Ratio in Multicore (ELRMC), are introduced to characterize the behavior of remapping based techniques. The EMRMC and ELRMC are defined as the function of probability of cell failure (Pfail), block size, number of cores and threads. The system is simulated in Multi2sim 5.0, a multicore CPU-GPU simulator. The values of the metrics for different configuration parameters like probability of cell failure, number of cores, number of blocks, block size and number of threads are analysed for framing the guidelines of system configuration to deliver better performance in remapping based fault tolerance. It is observed that the EMRMC is proportional to Pfail and block size but inversely proportional to the number of cores and threads and it is not affected by the number of blocks. On the contrary, the ELRMC is inversely proportional to Pfail and block size and proportional to the number of cores and threads. It is also observed that the ELRMC is independent of the number of cores and blocks. EMRMC is best minimized for Pfail ≤ 1e-4, block size ≤ 64 bytes, number of cores ≥ 4 and number of threads ≥ 2. On the other hand, ELRMC is best observed for Pfail ≤ 1e-4, block size ≥ 64 bytes, number of cores ≥ 4 and number of threads 2.
Similar content being viewed by others
References
Ansari A, et al. (2009) Enabling ultra low voltage system operation by tolerating on-chip cache failures. In: Proc. ISLPED
Ansari A, et al. (2009) Enabling ultra low voltage system operation by tolerating on-chip cache failures. In: Proc. of the intl. symposium on low power electronics and design
Ansari A, et al. (2011) Archipelago: a polymorphic cache design for enabling robust near-threshold operation. In: Proc of the international conference on computer architecture (HPCA), pp 539–550
BanaiyanMofrad A, et al. (2011) FFT-Cache: a flexible fault-tolerant cache architecture for ultra low voltage operation. In: Proc. CASES
BanaiyanMofrad A, et al. (2013) REMEDIATE: a scalable fault-tolerant architecture for low-power NUCA cache in tiled CMPs. In: Proc. of the international green computing conference (IGCC)
Banaiyanmofrad A, Homayoun H, Dutt N (2015) Using a flexible fault- tolerant cache to improve reliability for ultra low voltage operation. ACM Trans Embedded Comput Syst 14(2):Article 32. Publication date: February 2015
Calhoun B, Chandrakasan A (2006) A 256 kb subthreshold SRAM in 65nm CMOS. In: Proc. IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, pp 480–48
Chen C, Hsiao M (1984) Error-correcting codes for semiconductor memory applications: a state of the art review. IBM J R & D
Choudhury A, Sikdar BK Modeling and analysis of redundancy based fault tolerance for permanent faults in chip multiprocessor cache. In: Proceedings of the 31st international conference on VLSI design, VLSID 2018, ISSN-2380-6923, pp 115–120
Choudhury A, Sikdar BK (2017) CIFR: a complete in-place fault remapping strategy for CMP cache for dynamic reuse distance. In: Proc. of the 7th International conference on embedded computing and system design, ISED
Duong N, et al. (2012) Improving cache management policies using dynamic reuse distances. In: Proceedings of the 45th Annual IEEE/ACM international symposium on microarchitecture (MICRO)
Kim J, Hardavellas N, Mai K, Falsafi B, Hoe JC (2007) Multi-bit error tolerant caches using two-dimensional error coding. In: Proc. 39th annual IEEE/ACM international symposium on microarchitecture (MICRO 39), pp 15–25
Koh C-K, et al. (2009) The salvage cache: a fault-tolerant cache architecture for next-generation memory technologies. In: Proc. of the international conference on computer design (ICCD)
Kulkarni JP, Kim K, Roy K (2007) A 160 mv, fully differential, robust Schmitt trigger based sub-threshold sram. In: Proc. of the 2007 international symposium on low power electronics and design. ACM, New York, pp 171–176
Ladas N, Sazeides Y, Desmet V (2010) Performance-effective operation below Vcc-min. In: Proc of the intl symposium on performance analysis of systems & software
Moradi F, Wisland D, Aunet S, Mahmoodi H, Cao T (2008) 65nm sub-threshold 11t-sram for ultra low voltage applications. In: Intl. symposium on system-on-a-chip, p 113118
Morita Y, Fujiwara H, Noguchi H, Iguchi Y, Nii K, Kawaguchi H, Yoshimoto M (2007) An area-conscious low-voltage-oriented 8t-sram design under dvs environment. IEEE Symposium on VLSI circuits, pp. 256–257
Ozdemir S, Sinha D, Memik G, Adams J, Zhou H (2006) Yield-aware cache architectures. In: Proc. of the international symposium on microarchitecture
Pour F, Hill MD (1993) Performance implications of tolerating cache faults, Trans Comput
Sa’nchez D, Sazeides Y, Cebria’n J, Garc’ia JM, Arago’ JLN (2013) Modeling the impact of permanent faults in caches. ACM Trans Arch Code Optim 10(4):Article 29. Publication date: December 2013
Sasan A, Homayoun H, Eltawil A, Kurdahi F (2009) A fault tolerant cache architecture for sub 500mv operation: resizable data composer cache (RDC-Cache). In: Proc. of international conference on compilers, architectures and synthesis for embedded systems (CASES)
Skotnicki T, Hutchby J, King T-J, Wong H-S, Boeuf F (2005) The end of cmos scaling: toward the introduction of new materials and structural changes to improve mosfet performance. Circ Dev Mag IEEE 21(1):16
Sohi G (1989) Cache memory organization to enhance the yield of high-performance VLSI processors. IEEE Trans. Computers 38(4):484–492
Ubal R, Jang B, Mistry P, Schaa D, Kaeli D (2012) Multi2Sim: a simulation framework for CPU-GPU computing. In: Proc. of 21st international conference on parallel architectures and compilation techniques. Minneapolis
Vergos HT, Nikolos D (1995) Performance recovery in direct- mapped faulty caches via the use of a very small fully associative spare cache. In: Proc. of the intl. computer performance and dependability symposium
Wilkerson C et al (2008) Trading off cache capacity for reliability to enable low voltage operation. In: Proc. of international symposium on computer architecture (ISCA)
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible Editor: C.-W. Wu
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Choudhury, A., Sikdar, B.K. Modeling Remapping Based Fault Tolerance Techniques for Chip Multiprocessor Cache with Design Space Exploration. J Electron Test 36, 59–73 (2020). https://doi.org/10.1007/s10836-019-05852-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10836-019-05852-6