当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Large-scale memory failure prediction using mcelog-based Data Mining and Machine Learning
arXiv - CS - Performance Pub Date : 2021-04-24 , DOI: arxiv-2105.04547
Chengdong Yao

In the data center, unexpected downtime caused by memory failures can lead to a decline in the stability of the server and even the entire information technology infrastructure, which harms the business. Therefore, whether the memory failure can be accurately predicted in advance has become one of the most important issues to be studied in the data center. However, for the memory failure prediction in the production system, it is necessary to solve technical problems such as huge data noise and extreme imbalance between positive and negative samples, and at the same time ensure the long-term stability of the algorithm. This paper compares and summarizes some commonly used skills and the improvement they can bring. The single model we proposed won the top 15th in the 2nd Alibaba Cloud AIOps Competition belonging to the 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining.

中文翻译:

使用基于mcelog的数据挖掘和机器学习进行大规模内存故障预测

在数据中心中,由内存故障引起的意外停机会导致服务器甚至整个信息技术基础架构的稳定性下降,从而损害业务。因此,能否提前准确地预测出内存故障已经成为数据中心要研究的最重要的问题之一。但是,对于生产系统中的内存故障预测,有必要解决技术问题,例如巨大的数据噪声以及正负样本之间的极端失衡,同时确保算法的长期稳定性。本文比较并总结了一些常用技能及其可以带来的改进。
更新日期:2021-05-11
down
wechat
bug