Abstract
Automating fault detection in communication networks and distributed systems is a challenging process that usually requires the involvement of supporting tools and the expertise of system operators. Automated event monitoring and correlating systems produce event data that is forwarded to system operators for analyzing error events and creating fault reports. Machine learning methods help not only analyzing event data more precisely but also forecasting possible error events by learning from existing faults. This study introduces an automated fault detection system that assists system operators in detecting and forecasting faults. This system is characterized by the capability of exploiting bug knowledge resources at various online repositories, log events and status parameters from the monitored system; and applying bug analysis and event filtering methods for evaluating events and forecasting faults. The system contains a fault data model to collect bug reports, a feature and semantic filtering method to correlate log events, and machine learning methods to evaluate the severity, priority and relation of log events and forecast the forthcoming critical faults of the monitored system. We have evaluated the prototyping implementation of the proposed system on a high performance computing cluster system and provided analysis with lessons learned.
Similar content being viewed by others
References
Launchpad Bugs. https://bugs.launchpad.net/. Last access in Jan. 2019
Mantis Bug Tracker. https://www.mantisbt.org/. Last access in Jan. 2019
Trac Bug Tracker. https://trac.edgewall.org/. Last access in Jan. 2019
Network Monitoring Solutions (1998) https://www.ntop.org/. Last access in Jan. 2019
Ganglia Monitoring System (2001) http://ganglia.sourceforge.net/. Last access in Jan. 2019
The Complete Network Graphing Solution (2004) http://www.cacti.net/. Last access in Jan. 2019
The Industry Standard In IT Infrastructure Monitoring (2009) https://www.nagios.org/. Last access in Jan. 2019
Aha DW, Kibler D, Albert MK (1991) Instance-Based Learning algorithms. Mach Learn 6(1):37–66
Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. ACM Commun 53(4):50–58
Bashir M, Awan I, Ugail H, Younas M (2019) Failure prediction using machine learning in a virtualised hpc system and application. Clust Comput 22:471–485
Benham, DO. Debian bug tracking system. https://www.debian.org/Bugs/ Last access in Jan. 2019
Bishop CM (1995) Neural networks for pattern recognition. Oxford university press, New York
Bloom D (1994) Selection criterion and implementation of a trouble tracking system: what’s in a paradigm?. In: Proceedings 22nd annual ACM SIGUCCS conference on user services (SIGUCCS ’94). ACM Press, New York, pp 201–203
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Buckley MF, Siewiorek DP (1995) VAX/VMS Event monitoring and analysis. In: Proceedings 25th international symposium on fault-tolerant computing (FTCS’95). IEEE computer society, pp 414–423
Buckley MF, Siewiorek DP (1996) A comparative analysis of event tupling schemes. In: Proceedings 26th annual international symposium on fault-tolerant computing (FTCS ’96). IEEE computer society, pp 294–303
Cao L, Nguyen NT (2008) Intelligence metasynthesis and knowledge processing in intelligent systems. J Univers Comput Sci 14(14):2256–2262
Case JD, Fedor M, Schoffstall ML, Davin J (1990) Simple network management protocol (snmp). RFC 1098. https://tools.ietf.org/html/rfc1157
Claise B, Trammell B, Aitken P (2013) Specification of the ip flow information export (ipfix) protocol for the exchange of ip traffic flow information. RFC 7011. https://tools.ietf.org/html/rfc7011
Clemm A (2006) Network management fundamentals. Cisco press, Indianapolis
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Dall C, Nieh J (2014) Kvm/arm: The design and implementation of the linux arm hypervisor. SIGARCH Comput Archit News 42(1):333–348
Dudko R, Sharma A, Tedesco J (2012) Effective failure prediction in hadoop clusters. Tech. rep., University of Illinois
Duenas JC, Navarro JM, Parada HA, Andion J, Cuadrado F (2018) Applying event stream processing to network online failure prediction. Commun Mag 56(1):166–170
Ferreira VC, Carrano RC, Silva JO, Albuquerque CVN (2017) Muchaluat-saade, D.C., passos, D.G.: Fault detection and diagnosis for Solar-Powered wireless mesh networks using machine learning. In: Proceedings IFIP/IEEE symposium on integrated network and service management (IM’17), pp 456–462
Francis P, Leon D, Minch M, Podgurski A (2004) Tree-Based Methods for classifying software failures. In: Proceedings 15th international symposium on software reliability engineering (ISSRE’04). IEEE, Washington, pp 451–462
Gerhards R (2009) The syslog protocol. RFC 5424. https://tools.ietf.org/html/rfc5424
Guo L, Ma Y, Cukic B, Singh H (2004) Robust prediction of Fault-Proneness by random forests. In: Proceedings 15th international symposium on software reliability engineering (ISSRE’04). IEEE, Washington, pp 417–428
Reynolds HT (1977) The analysis of Cross-Classifications. The Free Press, New York
Hamilton J, Berges MG, Schofield B, Tournier JC (2018) SCADA Statistics monitoring using the Elastic Stack (Elasticsearch, Logstash, Kibana) p. TUPHA034 5 p
ITU-T (1995) Trouble management function for ITU-T applications. X.790 Recommendation
Johnson D (1992) NOC Internal integrated trouble ticket system functional specification wishlist. RFC 1297
Liang Y, Zhang Y, Sivasubramaniam A, Sahoo R (2005) Moreira, J., Gupta, M.: Filtering failure logs for a bluegene/l prototype. In: Proceedings 44th international conference on dependable systems and networks (DSN’05). IEEE computer society, pp 476–485
Liang Y, Zhang Y, Xiong H, Sahoo R (2007) An adaptive semantic filter for blue Gene/L failure log analysis. In: Proceedings 21st international parallel and distributed processing symposium (IPDPS’07). IEEE Computer Society, Long Beach, pp 1–8
Mulvey D, Foh CH, Imran MA, Tafazolli R (2019) Cell fault management using machine learning techniques, vol 7
Musumeci F, Rottondi C, Corani G, Shahkarami S, Cugini F, Tornatore M (2019) A tutorial on machine learning for failure management in optical networks. J Light Technol 37(16):4125–4139
Nguyen NT (2009) Rough classification – new approach and applications. J Univers Comput Sci 15(13):2622–2628
Oliphant T (2006) A guide to NumPy, vol 1. Trelgol Publishing USA
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830
Porter MF (1997) An algorithm for suffix stripping. In: Readings in Information Retrieval, pp 313–316
Rish I (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46
Sahoo R, Sivasubramaniam A, Squillante M, Zhang Y (2004) Failure data analysis of a large-scale heterogeneous server environment. In: Proceedings 43rd annual IEEE/IFIP international conference on dependable systems and networks (DSN’04)
Sefraoui O, Aissaoui M, Eleuldj M (2012) Openstack: Toward an open-source solution for cloud computing. Int J Comput Appl 55:38–42
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: Proc. IEEE 26th symposium on mass storage systems and technologies (MSST’10). IEEE Computer Society, Washington, pp 1–10
Silva FB (2013) Learning SciPy for numerical and scientific computing. Packt Publishing, Birmingham
Tan JS, Ho CK, Lim AH, Ramly MR (2018) Predicting network faults using random forest and C5.0. Int J Eng Technol 7(2.14):93–96
Thanamani AS (2011) A survey on failure prediction method. Int J Eng Sci Technol (IJEST) 3(2)
TMF (1996) Customer to service provider trouble administration business agreement. NMF 501, Issue 1.0
TMF (1997) Customer to service provider trouble administration information agreement. NMF 601, Issue 1.0
Tran HM, Le ST (2014) Software bug ontology supporting semantic bug search on peer-to-peer networks. New Gener Comput 32(2):145–162
Tran HM, Nguyen SV, Le ST, Vu QT (2017) Applying data analytic techniques for fault detection. In: Transactions on large-scale data- and knowledge-centered systems (TLDKS), vol 31, pp 30–46
Uddin M, Stadler R, Clemm A (2013) A query language for network search. In: Proceedings 13th IFIP/IEEE international symposium on integrated network management (IM ’13). IEEE computer society
Velasco L, Rafique D (2019) Fault management based on machine learning. In: Proceedings Optical fiber communications conference and exhibition (OFC). IEEE Computer Society, San Diego, pp 1–3
Wang T, Zhang W, Wei J, Zhong H (2015) Fault detection for cloud computing systems with correlation analysis. In: Proceedings IFIP/IEEE international symposium on integrated network management (IM’15), pp 652–658
Wang Z, Zhang M, Wang D, Song C, Liu M, Li J, Lou L, Liu Z (2017) Failure prediction using machine learning and time series in optical network. Opt Express 25(16):18553–18565
Weissman T Mozilla Bug Tracking System. https://bugzilla.mozilla.org/ Last access in Jan. 2019
Xu J, Kalbarczyk Z, Iyer RK (1999) Networked windows NT system field failure data analysis. In: Proceedings Pacific rim international symposium on dependable computing (PRDC ’99). IEEE computer society, pp 178–185
Zheng AX, Lloyd J, Brewer E (2004) Failure diagnosis using decision trees. In: Proceedings 1st international conference on autonomic computing (ICAC’04). IEEE Computer Society, Washington, pp 36–43
Zhou W, Tang L, Li T, Shwartz L, Grabarnik GY (2015) Resolution recommendation for event tickets in service management. In: Proceedings IFIP/IEEE international symposium on integrated network management (IM’15), pp 287–295
Acknowledgements
This research activity is funded by Vietnam National University in Ho Chi Minh City (VNU-HCM) under the grant number B2017-28-01.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Van Nguyen, S., Tran, H.M. An automated fault detection system for communication networks and distributed systems. Appl Intell 51, 5405–5419 (2021). https://doi.org/10.1007/s10489-020-02026-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-02026-2