Modeling Remapping Based Fault Tolerance Techniques for Chip Multiprocessor Cache with Design Space Exploration,Journal of Electronic Testing

当前位置： X-MOL 学术 › J. Electron. Test. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Modeling Remapping Based Fault Tolerance Techniques for Chip Multiprocessor Cache with Design Space Exploration
Journal of Electronic Testing ( IF 1.1 ) Pub Date : 2020-02-01 , DOI: 10.1007/s10836-019-05852-6
Avishek Choudhury , Biplab K. Sikdar

On top of the wear-out failures and external particle interventions, voltage scaling to mitigate the power consumption in multiprocessor makes cache more vulnerable to cell failures. For the indispensable voltage reduction to prolong the battery life of handheld devices, fault tolerance techniques are extremely important to ensure fault free execution in near-threshold voltage. Several fault tolerance techniques have been proposed and the remapping based techniques are found to be effective to address the issue of fault tolerance in single core systems. This work proposes an analytical model for remapping based fault tolerance techniques to evaluate the effectiveness of such schemes in multicore systems. The metrics Expected Miss Ratio in Multicore ( E M R M C ) and Expected Latency Ratio in Multicore ( E L R M C ), are introduced to characterize the behavior of remapping based techniques. The E M R M C and E L R M C are defined as the function of probability of cell failure ( P f a i l ), block size, number of cores and threads. The system is simulated in Multi2sim 5.0, a multicore CPU-GPU simulator. The values of the metrics for different configuration parameters like probability of cell failure, number of cores, number of blocks, block size and number of threads are analysed for framing the guidelines of system configuration to deliver better performance in remapping based fault tolerance. It is observed that the E M R M C is proportional to P f a i l and block size but inversely proportional to the number of cores and threads and it is not affected by the number of blocks. On the contrary, the E L R M C is inversely proportional to P f a i l and block size and proportional to the number of cores and threads. It is also observed that the E L R M C is independent of the number of cores and blocks. E M R M C is best minimized for P f a i l ≤ 1e-4, block size ≤ 64 bytes, number of cores ≥ 4 and number of threads ≥ 2. On the other hand, E L R M C is best observed for P f a i l ≤ 1e-4, block size ≥ 64 bytes, number of cores ≥ 4 and number of threads 2.

中文翻译：

基于设计空间探索的芯片多处理器缓存容错技术建模

除了磨损故障和外部粒子干预之外，电压缩放以减轻多处理器中的功耗使缓存更容易受到单元故障的影响。对于延长手持设备电池寿命必不可少的降压，容错技术对于确保在接近阈值电压下无故障执行极为重要。已经提出了几种容错技术，并且发现基于重映射的技术对于解决单核系统中的容错问题是有效的。这项工作提出了一种基于重映射的容错技术的分析模型，以评估这种方案在多核系统中的有效性。多核中的预期未命中率 ( EMRMC ) 和多核中的预期延迟比 ( ELRMC ) 指标，被引入来表征基于重映射的技术的行为。EMRMC 和 ELRMC 被定义为单元故障概率 (P fail )、块大小、内核和线程数的函数。该系统在 Multi2sim 5.0 中进行模拟，Multi2sim 5.0 是一个多核 CPU-GPU 模拟器。分析了不同配置参数（如单元故障概率、核心数、块数、块大小和线程数）的度量值，以制定系统配置指南，以在基于重映射的容错中提供更好的性能。观察到 EMRMC 与 P fail 和块大小成正比，但与内核和线程数成反比，并且不受块数的影响。相反，ELRMC 与 P fail 和块大小成反比，与内核和线程数成正比。还观察到 ELRMC 与内核和块的数量无关。EMRMC 在 P fail ≤ 1e-4、块大小 ≤ 64 字节、内核数 ≥ 4 和线程数 ≥ 2 的情况下最好最小化。另一方面，在 P fail ≤ 1e-4、块大小 ≥ 时观察 ELRMC 最好64 字节，内核数 ≥ 4 和线程数 2。

更新日期：2020-02-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文