当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Silent Data Corruptions at Scale
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2021-02-22 , DOI: arxiv-2102.11245
Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar

Silent Data Corruption (SDC) can have negative impact on large-scale infrastructure services. SDCs are not captured by error reporting mechanisms within a Central Processing Unit (CPU) and hence are not traceable at the hardware level. However, the data corruptions propagate across the stack and manifest as application-level problems. These types of errors can result in data loss and can require months of debug engineering time. In this paper, we describe common defect types observed in silicon manufacturing that leads to SDCs. We discuss a real-world example of silent data corruption within a datacenter application. We provide the debug flow followed to root-cause and triage faulty instructions within a CPU using a case study, as an illustration on how to debug this class of errors. We provide a high-level overview of the mitigations to reduce the risk of silent data corruptions within a large production fleet. In our large-scale infrastructure, we have run a vast library of silent error test scenarios across hundreds of thousands of machines in our fleet. This has resulted in hundreds of CPUs detected for these errors, showing that SDCs are a systemic issue across generations. We have monitored SDCs for a period longer than 18 months. Based on this experience, we determine that reducing silent data corruptions requires not only hardware resiliency and production detection mechanisms, but also robust fault-tolerant software architectures.

中文翻译:

大规模无声数据损坏

静默数据损坏(SDC)可能会对大规模基础架构服务产生负面影响。SDC不能通过中央处理单元(CPU)中的错误报告机制捕获,因此无法在硬件级别上进行跟踪。但是,数据损坏在整个堆栈中传播,并表现为应用程序级问题。这些类型的错误可能导致数据丢失,并且可能需要数月的调试工程时间。在本文中,我们描述了导致SDC的硅制造过程中常见的缺陷类型。我们讨论了一个数据中心应用程序中静默数据损坏的真实示例。我们提供了一个调试案例,以通过案例研究来跟踪CPU中的根本原因和对错误指令进行分类,以举例说明如何调试此类错误。我们提供了缓解措施的高级概述,以减少大型生产团队中无提示数据损坏的风险。在我们的大型基础架构中,我们已经在我们机队中成千上万的机器上运行了庞大的静默错误测试方案库。这导致检测到数百个CPU出现这些错误,这表明SDC是几代人之间的系统性问题。我们对SDC的监控时间超过18个月。根据这一经验,我们确定减少静默数据损坏不仅需要硬件弹性和生产检测机制,还需要强大的容错软件体系结构。我们已经在我们机队中的数十万台机器上运行了一个庞大的静默错误测试方案库。这导致检测到数百个CPU出现这些错误,这表明SDC是几代人之间的系统性问题。我们对SDC的监控时间超过18个月。根据这一经验,我们确定减少静默数据损坏不仅需要硬件弹性和生产检测机制,还需要强大的容错软件体系结构。我们已经在我们机队中的数十万台机器上运行了一个庞大的静默错误测试方案库。这导致检测到数百个CPU出现这些错误,这表明SDC是几代人之间的系统性问题。我们对SDC的监控时间超过18个月。根据这一经验,我们确定减少静默数据损坏不仅需要硬件弹性和生产检测机制,还需要强大的容错软件体系结构。
更新日期:2021-02-23
down
wechat
bug