Clustering error messages produced by distributed computing infrastructure during the processing of high energy physics data,International Journal of Modern Physics A

当前位置： X-MOL 学术 › Int. J. Mod. Phys. A › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Clustering error messages produced by distributed computing infrastructure during the processing of high energy physics data
International Journal of Modern Physics A ( IF 1.4 ) Pub Date : 2021-04-20 , DOI: 10.1142/s0217751x21500706
Maria Grigorieva _{1,

2} , Dmitry Grin ₃

Affiliation

Large-scale distributed computing infrastructures ensure the operation and maintenance of scientific experiments at the LHC: more than 160 computing centers all over the world execute tens of millions of computing jobs per day. ATLAS — the largest experiment at the LHC — creates an enormous flow of data which has to be recorded and analyzed by a complex heterogeneous and distributed computing environment. Statistically, about 10–12% of computing jobs end with a failure: network faults, service failures, authorization failures, and other error conditions trigger error messages which provide detailed information about the issue, which can be used for diagnosis and proactive fault handling. However, this analysis is complicated by the sheer scale of textual log data, and often exacerbated by the lack of a well-defined structure: human experts have to interpret the detected messages and create parsing rules manually, which is time-consuming and does not allow identifying previously unknown error conditions without further human intervention. This paper is dedicated to the description of a pipeline of methods for the unsupervised clustering of multi-source error messages. The pipeline is data-driven, based on machine learning algorithms, and executed fully automatically, allowing categorizing error messages according to textual patterns and meaning.

中文翻译：

对分布式计算基础设施在处理高能物理数据过程中产生的错误消息进行聚类

大规模分布式计算基础设施保障了大型强子对撞机科学实验的运行和维护：全球 160 多个计算中心每天执行数千万次计算工作。ATLAS——大型强子对撞机上最大的实验——产生了巨大的数据流，必须通过复杂的异构和分布式计算环境进行记录和分析。据统计，大约 10-12% 的计算作业以失败告终：网络故障、服务失败、授权失败和其他错误条件触发错误消息，提供有关问题的详细信息，可用于诊断和主动故障处理。然而，这种分析由于文本日志数据的庞大规模而变得复杂，并且经常因缺乏明确定义的结构而加剧：人工专家必须解释检测到的消息并手动创建解析规则，这非常耗时，并且不允许在没有进一步人工干预的情况下识别以前未知的错误条件。本文致力于描述用于多源错误消息的无监督聚类的一系列方法。该管道是数据驱动的，基于机器学习算法，并且完全自动执行，允许根据文本模式和含义对错误消息进行分类。

更新日期：2021-04-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11