skip to main content
research-article

Predicting Performance Anomalies in Software Systems at Run-time

Published:23 April 2021Publication History
Skip Abstract Section

Abstract

High performance is a critical factor to achieve and maintain the success of a software system. Performance anomalies represent the performance degradation issues (e.g., slowing down in system response times) of software systems at run-time. Performance anomalies can cause a dramatically negative impact on users’ satisfaction. Prior studies propose different approaches to detect anomalies by analyzing execution logs and resource utilization metrics after the anomalies have happened. However, the prior detection approaches cannot predict the anomalies ahead of time; such limitation causes an inevitable delay in taking corrective actions to prevent performance anomalies from happening. We propose an approach that can predict performance anomalies in software systems and raise anomaly warnings in advance. Our approach uses a Long-Short Term Memory neural network to capture the normal behaviors of a software system. Then, our approach predicts performance anomalies by identifying the early deviations from the captured normal system behaviors. We conduct extensive experiments to evaluate our approach using two real-world software systems (i.e., Elasticsearch and Hadoop). We compare the performance of our approach with two baselines. The first baseline is one state-to-the-art baseline called Unsupervised Behavior Learning. The second baseline predicts performance anomalies by checking if the resource utilization exceeds pre-defined thresholds. Our results show that our approach can predict various performance anomalies with high precision (i.e., 97–100%) and recall (i.e., 80–100%), while the baselines achieve 25–97% precision and 93–100% recall. For a range of performance anomalies, our approach can achieve sufficient lead times that vary from 20 to 1,403 s (i.e., 23.4 min). We also demonstrate the ability of our approach to predict the performance anomalies that are caused by real-world performance bugs. For predicting performance anomalies that are caused by real-world performance bugs, our approach achieves 95–100% precision and 87–100% recall, while the baselines achieve 49–83% precision and 100% recall. The obtained results show that our approach outperforms the existing anomaly prediction approaches and is able to predict performance anomalies in real-world systems.

References

  1. Amazon. [n.d.]. Amazon EC2 Instance Types. Retrieved from https://aws.amazon.com/ec2/instance-types/.Google ScholarGoogle Scholar
  2. Apache. [n.d.]. Apache Hadoop RandomTextWriter application. Retrieved from https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/RandomTextWriter.html.Google ScholarGoogle Scholar
  3. Apache. [n.d.]. Apache Hadoop System. Retrieved from http://hadoop.apache.org/.Google ScholarGoogle Scholar
  4. Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04), Vol. 4. 18--18.Google ScholarGoogle Scholar
  5. Stefan Berner, Roland Weber, and Rudolf K. Keller. 2005. Observations and lessons learned from automated testing. In Proceedings of the 27th International Conference on Software Engineering. 571--579.Google ScholarGoogle Scholar
  6. Peter Bodík, Moises Goldszmidt, and Armando Fox. 2008. HiLighter: Automatically building robust signatures of performance behavior for small-and large-scale systems. In Proceedings of the Systems Modeling Language Conference (SysML).Google ScholarGoogle Scholar
  7. Sucheta Chauhan and Lovekesh Vig. 2015. Anomaly detection in ECG time signals via deep long short-term memory networks. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA’15). IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  8. Mike Chen, Alice X. Zheng, Jim Lloyd, Michael I. Jordan, and Eric Brewer. 2004. Failure diagnosis using decision trees. In Proceedings of the International Conference on Autonomic Computing. IEEE, 36--43.Google ScholarGoogle ScholarCross RefCross Ref
  9. Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2008. Anomaly? Application change? or workload change? towards automated detection of application performance anomaly and change. In Proceedings of the IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN’08). IEEE, 452--461.Google ScholarGoogle ScholarCross RefCross Ref
  10. Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, indexing, clustering, and retrieving system history. In ACM SIGOPS Operating Systems Review, Vol. 39. ACM, 105--118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Daniel Joseph Dean, Hiep Nguyen, and Xiaohui Gu. 2012. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th International Conference on Autonomic Computing. ACM, 191--200.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). IEEE, 859--864.Google ScholarGoogle ScholarCross RefCross Ref
  13. Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285--1298.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Elastic. [n.d.]. Rally. Retrieved from https://github.com/elastic/rally.Google ScholarGoogle Scholar
  15. ElasticSearch. [n.d.]. Elasticsearch. Retrieved from https://www.elastic.co.Google ScholarGoogle Scholar
  16. ElasticSearch. [n.d.]. Elasticsearch Reference. Retrieved from https://www.elastic.co/guide/en/elasticsearch/reference/5.3/modules-threadpool.html.Google ScholarGoogle Scholar
  17. Úlfar Erlingsson, Marcus Peinado, Simon Peter, Mihai Budiu, and Gloria Mainar-Ruiz. 2012. Fay: Extensible distributed tracing from kernels to clusters. ACM Trans. Comput. Syst. 30, 4 (2012), 1--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 9th IEEE International Conference on Data Mining. IEEE, 149--158.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Xiaohui Gu and Haixun Wang. 2009. Online anomaly prediction for robust cluster systems. In Proceedings of the IEEE 25th International Conference on Data Engineering. IEEE, 1000--1011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zhen Guo, Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). IEEE, 259--268.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the IEEE International Conference on Web Services (ICWS’17). IEEE, 33--40.Google ScholarGoogle Scholar
  22. Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural networks. In Advances in Neural Information Processing Systems. MIT Press, 190--198.Google ScholarGoogle Scholar
  23. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Shaohan Huang, Carol Fung, Kui Wang, Polo Pei, Zhongzhi Luan, and Depei Qian. 2016. Using recurrent neural networks toward black-box system anomaly prediction. In Proceedings of the IEEE/ACM 24th International Symposium on Quality of Service (IWQoS’16). IEEE, 1--10.Google ScholarGoogle Scholar
  25. IBM. [n.d.]. IBM Javametrics. Retrieved from https://developer.ibm.com/javasdk/application-metrics-java/.Google ScholarGoogle Scholar
  26. Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Discovering likely invariants of distributed transaction systems for autonomic system management. In Proceedings of the IEEE International Conference on Autonomic Computing. IEEE, 199--208.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Modeling and tracking of transaction flow dynamics for fault detection in complex systems. IEEE Trans. Depend. Secure Comput. 3, 4 (2006), 312--326.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Miao Jiang, Mohammad A. Munawar, Thomas Reidemeister, and Paul A. S. Ward. 2009. Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems & Networks. IEEE, 285--294.Google ScholarGoogle Scholar
  29. Miao Jiang, Mohammad A. Munawar, Thomas Reidemeister, and Paul A. S. Ward. 2009. System monitoring with metric-correlation models: Problems and solutions. In Proceedings of the 6th International Conference on Autonomic Computing. ACM, 13--22.Google ScholarGoogle Scholar
  30. Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs. ACM SIGPLAN Notices 47, 6 (2012), 77--88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. JustGlowing. [n.d.]. MiniSom: a minimalistic implementation of the Self Organizing Maps. Retrieved from https://github.com/JustGlowing/minisom.Google ScholarGoogle Scholar
  32. Eamonn Keogh, Jessica Lin, and Ada Fu. 2005. Hot sax: Efficiently finding the most unusual time series subsequence. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05). Ieee, 8–pp.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Keras. [n.d.]. Keras: The Python Deep Learning library. Retrieved from https://keras.io/.Google ScholarGoogle Scholar
  34. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Retrieved from https://arXiv:1412.6980.Google ScholarGoogle Scholar
  35. Mayuresh Kunjir, Yuzhang Han, and Shivnath Babu. 2016. Where does memory go?: Study of memory management in JVM-based data analytics.Google ScholarGoogle Scholar
  36. Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in ibm bluegene/l event logs. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). IEEE, 583--588.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering-based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 102--111.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Matteo Lissandrini, Martin Brugnara, and Yannis Velegrakis. 2018. Beyond macrobenchmarks: Microbenchmark-based graph database evaluation. Proc. VLDB Endow. 12, 4 (2018), 390--403.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining invariants from console logs for system problem detection. In Proceedings of the USENIX Annual Technical Conference. 1--14.Google ScholarGoogle Scholar
  40. Adetokunbo Makanju, A. Nur Zincir-Heywood, and Evangelos E. Milios. 2012. A lightweight algorithm for message type extraction in system application logs. IEEE Trans. Knowl. Data Eng. 24, 11 (2012), 1921--1936.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. 2016. LSTM-based encoder-decoder for multi-sensor anomaly detection. Retrieved from https://arXiv:1607.00148.Google ScholarGoogle Scholar
  42. Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. 2015. Long short term memory networks for anomaly detection in time series. In Proceedings. Presses Universitaires de Louvain, 89.Google ScholarGoogle Scholar
  43. Shigeru Maya, Ken Ueno, and Takeichiro Nishikawa. 2019. dLSTM: A new approach for anomaly detection using deep learning with delayed prediction. Int. J. Data Sci. Anal. (2019), 1–28.Google ScholarGoogle ScholarCross RefCross Ref
  44. Mohammad A. Munawar and Paul A. S. Ward. 2007. A comparative study of pairwise regression techniques for problem determination. In Proceedings of the Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp., 152--166.Google ScholarGoogle Scholar
  45. Mohammad Ahmad Munawar and Paul A. S. Ward. 2007. Leveraging many simple statistical models to adaptively monitor software systems. In Proceedings of the International Symposium on Parallel and Distributed Processing and Applications. Springer, 457--470.Google ScholarGoogle Scholar
  46. Openjdk. [n.d.]. Openjdk documentation. Retrieved December 2, 2019 from http://openjdk.java.net/groups/hotspot/docs/HotSpotGlossary.htmlGoogle ScholarGoogle Scholar
  47. Oracle. [n.d.]. jconsole. Retrieved from http://openjdk.java.net/tools/svc/jconsole/.Google ScholarGoogle Scholar
  48. Rob Powers, Moises Goldszmidt, and Ira Cohen. 2005. Short term performance forecasting in enterprise systems. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 801--807.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Sudip Roy, Arnd Christian König, Igor Dvorkin, and Manish Kumar. 2015. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 1167--1178.Google ScholarGoogle ScholarCross RefCross Ref
  50. Ripon K. Saha, Sarfraz Khurshid, and Dewayne E. Perry. 2014. An empirical study of long lived bugs. In Proceedings of the Software Evolution Week—IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE’14). IEEE, 144--153.Google ScholarGoogle Scholar
  51. Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li. 2009. Reference-driven performance anomaly identification. In ACM SIGMETRICS Performance Evaluation Review, Vol. 37. ACM, 85--96.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. solarwinds. [n.d.]. Solarwinds SAM Server & Application Monitor. Retrieved from https://www.solarwinds.com/server-application-monitor?CMP=BIZ-TAD-PCWDLD-SAM_PP-A-PP-Q116.Google ScholarGoogle Scholar
  53. Christopher Stewart, Terence Kelly, and Alex Zhang. 2007. Exploiting nonstationarity for performance prediction. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 31--44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Proceedings of the 13th Annual Conference of the International Speech Communication Association.Google ScholarGoogle ScholarCross RefCross Ref
  55. Yongmin Tan and Xiaohui Gu. 2010. On predictability of system anomalies in real world. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, 133--140.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yongmin Tan, Xiaohui Gu, and Haixun Wang. 2010. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing. ACM, 173--182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. Prepare: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems. IEEE, 285--294.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 785--794.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Adrian Taylor, Sylvain Leblanc, and Nathalie Japkowicz. 2016. Anomaly detection in automobile control network data with long short-term memory networks. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA’16). IEEE, 130--139.Google ScholarGoogle ScholarCross RefCross Ref
  60. Avishay Traeger, Ivan Deras, and Erez Zadok. 2008. DARC: Dynamic analysis of root causes of latency distributions. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 277--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Dylan Tweney. 2013. Amazon website goes down for 40 minutes, costing the company $5 million. Retrieved from https://venturebeat.com/2013/08/19/amazon-website-down/.Google ScholarGoogle Scholar
  62. Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM’03). IEEE, 119--126.Google ScholarGoogle ScholarCross RefCross Ref
  63. VMware. [n.d.]. Virtual machine CPU usage alarm. Retrieved from https://kb.vmware.com/s/article/2057830.Google ScholarGoogle Scholar
  64. Chengwei Wang, Vanish Talwar, Karsten Schwan, and Parthasarathy Ranganathan. 2010. Online detection of utility cloud anomalies using metric distributions. In Proceedings of the IEEE Network Operations and Management Symposium (NOMS’10). IEEE, 96--103.Google ScholarGoogle Scholar
  65. James C. Warner. 2013. top, Linux man page. Retrieved from https://linux.die.net/man/1/top.Google ScholarGoogle Scholar
  66. Andrew W. Williams, Soila M. Pertet, and Priya Narasimhan. 2007. Tiresias: Black-box failure prediction in distributed systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  67. Cort J. Willmott and Kenji Matsuura. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res. 30, 1 (2005), 79--82.Google ScholarGoogle ScholarCross RefCross Ref
  68. Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Online system problem detection by mining patterns of console logs. In Proceedings of the 9th IEEE International Conference on Data Mining. IEEE, 588--597.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, 117--132.Google ScholarGoogle Scholar
  70. Ting-Fang Yen, Alina Oprea, Kaan Onarlioglu, Todd Leetham, William Robertson, Ari Juels, and Engin Kirda. 2013. Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks. In Proceedings of the 29th Annual Computer Security Applications Conference. ACM, 199--208.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A C-LSTM neural network for text classification. Retrieved from https://arXiv:1511.08630.Google ScholarGoogle Scholar
  72. Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R. Lyu. 2018. Tools and benchmarks for automated log parsing. Retrieved from https://arXiv:1811.03509.Google ScholarGoogle Scholar

Index Terms

  1. Predicting Performance Anomalies in Software Systems at Run-time

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Software Engineering and Methodology
      ACM Transactions on Software Engineering and Methodology  Volume 30, Issue 3
      Continuous Special Section: AI and SE
      July 2021
      600 pages
      ISSN:1049-331X
      EISSN:1557-7392
      DOI:10.1145/3450566
      • Editor:
      • Mauro Pezzè
      Issue’s Table of Contents

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 April 2021
      • Revised: 1 November 2020
      • Accepted: 1 November 2020
      • Received: 1 March 2020
      Published in tosem Volume 30, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader