Skip to main content
Log in

An automated fault detection system for communication networks and distributed systems

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Automating fault detection in communication networks and distributed systems is a challenging process that usually requires the involvement of supporting tools and the expertise of system operators. Automated event monitoring and correlating systems produce event data that is forwarded to system operators for analyzing error events and creating fault reports. Machine learning methods help not only analyzing event data more precisely but also forecasting possible error events by learning from existing faults. This study introduces an automated fault detection system that assists system operators in detecting and forecasting faults. This system is characterized by the capability of exploiting bug knowledge resources at various online repositories, log events and status parameters from the monitored system; and applying bug analysis and event filtering methods for evaluating events and forecasting faults. The system contains a fault data model to collect bug reports, a feature and semantic filtering method to correlate log events, and machine learning methods to evaluate the severity, priority and relation of log events and forecast the forthcoming critical faults of the monitored system. We have evaluated the prototyping implementation of the proposed system on a high performance computing cluster system and provided analysis with lessons learned.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://github.com/hivaids2512/mitm-thesis

References

  1. Launchpad Bugs. https://bugs.launchpad.net/. Last access in Jan. 2019

  2. Mantis Bug Tracker. https://www.mantisbt.org/. Last access in Jan. 2019

  3. Trac Bug Tracker. https://trac.edgewall.org/. Last access in Jan. 2019

  4. Network Monitoring Solutions (1998) https://www.ntop.org/. Last access in Jan. 2019

  5. Ganglia Monitoring System (2001) http://ganglia.sourceforge.net/. Last access in Jan. 2019

  6. The Complete Network Graphing Solution (2004) http://www.cacti.net/. Last access in Jan. 2019

  7. The Industry Standard In IT Infrastructure Monitoring (2009) https://www.nagios.org/. Last access in Jan. 2019

  8. Aha DW, Kibler D, Albert MK (1991) Instance-Based Learning algorithms. Mach Learn 6(1):37–66

    Google Scholar 

  9. Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. ACM Commun 53(4):50–58

    Article  Google Scholar 

  10. Bashir M, Awan I, Ugail H, Younas M (2019) Failure prediction using machine learning in a virtualised hpc system and application. Clust Comput 22:471–485

    Article  Google Scholar 

  11. Benham, DO. Debian bug tracking system. https://www.debian.org/Bugs/ Last access in Jan. 2019

  12. Bishop CM (1995) Neural networks for pattern recognition. Oxford university press, New York

    MATH  Google Scholar 

  13. Bloom D (1994) Selection criterion and implementation of a trouble tracking system: what’s in a paradigm?. In: Proceedings 22nd annual ACM SIGUCCS conference on user services (SIGUCCS ’94). ACM Press, New York, pp 201–203

  14. Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  Google Scholar 

  15. Buckley MF, Siewiorek DP (1995) VAX/VMS Event monitoring and analysis. In: Proceedings 25th international symposium on fault-tolerant computing (FTCS’95). IEEE computer society, pp 414–423

  16. Buckley MF, Siewiorek DP (1996) A comparative analysis of event tupling schemes. In: Proceedings 26th annual international symposium on fault-tolerant computing (FTCS ’96). IEEE computer society, pp 294–303

  17. Cao L, Nguyen NT (2008) Intelligence metasynthesis and knowledge processing in intelligent systems. J Univers Comput Sci 14(14):2256–2262

    Google Scholar 

  18. Case JD, Fedor M, Schoffstall ML, Davin J (1990) Simple network management protocol (snmp). RFC 1098. https://tools.ietf.org/html/rfc1157

  19. Claise B, Trammell B, Aitken P (2013) Specification of the ip flow information export (ipfix) protocol for the exchange of ip traffic flow information. RFC 7011. https://tools.ietf.org/html/rfc7011

  20. Clemm A (2006) Network management fundamentals. Cisco press, Indianapolis

    Google Scholar 

  21. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    MATH  Google Scholar 

  22. Dall C, Nieh J (2014) Kvm/arm: The design and implementation of the linux arm hypervisor. SIGARCH Comput Archit News 42(1):333–348

    Article  Google Scholar 

  23. Dudko R, Sharma A, Tedesco J (2012) Effective failure prediction in hadoop clusters. Tech. rep., University of Illinois

  24. Duenas JC, Navarro JM, Parada HA, Andion J, Cuadrado F (2018) Applying event stream processing to network online failure prediction. Commun Mag 56(1):166–170

    Article  Google Scholar 

  25. Ferreira VC, Carrano RC, Silva JO, Albuquerque CVN (2017) Muchaluat-saade, D.C., passos, D.G.: Fault detection and diagnosis for Solar-Powered wireless mesh networks using machine learning. In: Proceedings IFIP/IEEE symposium on integrated network and service management (IM’17), pp 456–462

  26. Francis P, Leon D, Minch M, Podgurski A (2004) Tree-Based Methods for classifying software failures. In: Proceedings 15th international symposium on software reliability engineering (ISSRE’04). IEEE, Washington, pp 451–462

  27. Gerhards R (2009) The syslog protocol. RFC 5424. https://tools.ietf.org/html/rfc5424

  28. Guo L, Ma Y, Cukic B, Singh H (2004) Robust prediction of Fault-Proneness by random forests. In: Proceedings 15th international symposium on software reliability engineering (ISSRE’04). IEEE, Washington, pp 417–428

  29. Reynolds HT (1977) The analysis of Cross-Classifications. The Free Press, New York

    Google Scholar 

  30. Hamilton J, Berges MG, Schofield B, Tournier JC (2018) SCADA Statistics monitoring using the Elastic Stack (Elasticsearch, Logstash, Kibana) p. TUPHA034 5 p

  31. ITU-T (1995) Trouble management function for ITU-T applications. X.790 Recommendation

  32. Johnson D (1992) NOC Internal integrated trouble ticket system functional specification wishlist. RFC 1297

  33. Liang Y, Zhang Y, Sivasubramaniam A, Sahoo R (2005) Moreira, J., Gupta, M.: Filtering failure logs for a bluegene/l prototype. In: Proceedings 44th international conference on dependable systems and networks (DSN’05). IEEE computer society, pp 476–485

  34. Liang Y, Zhang Y, Xiong H, Sahoo R (2007) An adaptive semantic filter for blue Gene/L failure log analysis. In: Proceedings 21st international parallel and distributed processing symposium (IPDPS’07). IEEE Computer Society, Long Beach, pp 1–8

  35. Mulvey D, Foh CH, Imran MA, Tafazolli R (2019) Cell fault management using machine learning techniques, vol 7

  36. Musumeci F, Rottondi C, Corani G, Shahkarami S, Cugini F, Tornatore M (2019) A tutorial on machine learning for failure management in optical networks. J Light Technol 37(16):4125–4139

    Article  Google Scholar 

  37. Nguyen NT (2009) Rough classification – new approach and applications. J Univers Comput Sci 15(13):2622–2628

    Google Scholar 

  38. Oliphant T (2006) A guide to NumPy, vol 1. Trelgol Publishing USA

  39. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  40. Porter MF (1997) An algorithm for suffix stripping. In: Readings in Information Retrieval, pp 313–316

  41. Rish I (2001) An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, vol 3, pp 41–46

  42. Sahoo R, Sivasubramaniam A, Squillante M, Zhang Y (2004) Failure data analysis of a large-scale heterogeneous server environment. In: Proceedings 43rd annual IEEE/IFIP international conference on dependable systems and networks (DSN’04)

  43. Sefraoui O, Aissaoui M, Eleuldj M (2012) Openstack: Toward an open-source solution for cloud computing. Int J Comput Appl 55:38–42

    Google Scholar 

  44. Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: Proc. IEEE 26th symposium on mass storage systems and technologies (MSST’10). IEEE Computer Society, Washington, pp 1–10

  45. Silva FB (2013) Learning SciPy for numerical and scientific computing. Packt Publishing, Birmingham

    Google Scholar 

  46. Tan JS, Ho CK, Lim AH, Ramly MR (2018) Predicting network faults using random forest and C5.0. Int J Eng Technol 7(2.14):93–96

    Article  Google Scholar 

  47. Thanamani AS (2011) A survey on failure prediction method. Int J Eng Sci Technol (IJEST) 3(2)

  48. TMF (1996) Customer to service provider trouble administration business agreement. NMF 501, Issue 1.0

  49. TMF (1997) Customer to service provider trouble administration information agreement. NMF 601, Issue 1.0

  50. Tran HM, Le ST (2014) Software bug ontology supporting semantic bug search on peer-to-peer networks. New Gener Comput 32(2):145–162

  51. Tran HM, Nguyen SV, Le ST, Vu QT (2017) Applying data analytic techniques for fault detection. In: Transactions on large-scale data- and knowledge-centered systems (TLDKS), vol 31, pp 30–46

  52. Uddin M, Stadler R, Clemm A (2013) A query language for network search. In: Proceedings 13th IFIP/IEEE international symposium on integrated network management (IM ’13). IEEE computer society

  53. Velasco L, Rafique D (2019) Fault management based on machine learning. In: Proceedings Optical fiber communications conference and exhibition (OFC). IEEE Computer Society, San Diego, pp 1–3

  54. Wang T, Zhang W, Wei J, Zhong H (2015) Fault detection for cloud computing systems with correlation analysis. In: Proceedings IFIP/IEEE international symposium on integrated network management (IM’15), pp 652–658

  55. Wang Z, Zhang M, Wang D, Song C, Liu M, Li J, Lou L, Liu Z (2017) Failure prediction using machine learning and time series in optical network. Opt Express 25(16):18553–18565

    Article  Google Scholar 

  56. Weissman T Mozilla Bug Tracking System. https://bugzilla.mozilla.org/ Last access in Jan. 2019

  57. Xu J, Kalbarczyk Z, Iyer RK (1999) Networked windows NT system field failure data analysis. In: Proceedings Pacific rim international symposium on dependable computing (PRDC ’99). IEEE computer society, pp 178–185

  58. Zheng AX, Lloyd J, Brewer E (2004) Failure diagnosis using decision trees. In: Proceedings 1st international conference on autonomic computing (ICAC’04). IEEE Computer Society, Washington, pp 36–43

  59. Zhou W, Tang L, Li T, Shwartz L, Grabarnik GY (2015) Resolution recommendation for event tickets in service management. In: Proceedings IFIP/IEEE international symposium on integrated network management (IM’15), pp 287–295

Download references

Acknowledgements

This research activity is funded by Vietnam National University in Ho Chi Minh City (VNU-HCM) under the grant number B2017-28-01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ha Manh Tran.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Van Nguyen, S., Tran, H.M. An automated fault detection system for communication networks and distributed systems. Appl Intell 51, 5405–5419 (2021). https://doi.org/10.1007/s10489-020-02026-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-02026-2

Keywords

Navigation