当前位置: X-MOL 学术Appl. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An automated fault detection system for communication networks and distributed systems
Applied Intelligence ( IF 3.4 ) Pub Date : 2021-01-08 , DOI: 10.1007/s10489-020-02026-2
Sinh Van Nguyen , Ha Manh Tran

Automating fault detection in communication networks and distributed systems is a challenging process that usually requires the involvement of supporting tools and the expertise of system operators. Automated event monitoring and correlating systems produce event data that is forwarded to system operators for analyzing error events and creating fault reports. Machine learning methods help not only analyzing event data more precisely but also forecasting possible error events by learning from existing faults. This study introduces an automated fault detection system that assists system operators in detecting and forecasting faults. This system is characterized by the capability of exploiting bug knowledge resources at various online repositories, log events and status parameters from the monitored system; and applying bug analysis and event filtering methods for evaluating events and forecasting faults. The system contains a fault data model to collect bug reports, a feature and semantic filtering method to correlate log events, and machine learning methods to evaluate the severity, priority and relation of log events and forecast the forthcoming critical faults of the monitored system. We have evaluated the prototyping implementation of the proposed system on a high performance computing cluster system and provided analysis with lessons learned.



中文翻译:

用于通信网络和分布式系统的自动化故障检测系统

通信网络和分布式系统中的自动故障检测是一个具有挑战性的过程,通常需要支持工具和系统操作员的专业知识的参与。自动化的事件监视和关联系统生成事件数据,该事件数据转发给系统操作员以分析错误事件并创建故障报告。机器学习方法不仅有助于更精确地分析事件数据,而且还可以通过从现有故障中学习来预测可能的错误事件。本研究介绍了一种自动故障检测系统,可协助系统操作员检测和预测故障。该系统的特点是能够利用各种在线存储库中的错误知识资源,来自受监视系统的日志事件和状态参数;并应用错误分析和事件过滤方法来评估事件和预测故障。该系统包含一个故障数据模型以收集错误报告,一种功能和语义过滤方法以关联日志事件,以及机器学习方法以评估日志事件的严重性,优先级和关系并预测受监视系统即将发生的严重故障。我们在高性能计算集群系统上评估了所提出系统的原型实现,并提供了经验教训。日志事件的优先级和关系,并预测受监视系统即将发生的严重故障。我们在高性能计算集群系统上评估了所提出系统的原型实现,并提供了经验教训。日志事件的优先级和关系,并预测受监视系统即将发生的严重故障。我们在高性能计算集群系统上评估了所提出系统的原型实现,并提供了经验教训。

更新日期:2021-01-08
down
wechat
bug