research-article

Predicting Performance Anomalies in Software Systems at Run-time

Authors:
Guoliang Zhao

School of Computing, Queen’s University, Canada

School of Computing, Queen’s University, Canada

0000-0003-0152-5100
View Profile

,
Safwat Hassan

Department of Engineering, Thompson Rivers University, Canada

Department of Engineering, Thompson Rivers University, Canada
View Profile

,
Ying Zou

Department of Electrical and Computer Engineering, Canada

Department of Electrical and Computer Engineering, Canada
View Profile

,
Derek Truong

IBM, Canada

IBM, Canada
View Profile

,
Toby Corbin

IBM, United Kingdom

IBM, United Kingdom
View Profile

ACM Transactions on Software Engineering and Methodology Volume 30 Issue 3Article No.: 33pp 1–33https://doi.org/10.1145/3440757

Published:23 April 2021Publication History

ACM Transactions on Software Engineering and Methodology

Abstract

High performance is a critical factor to achieve and maintain the success of a software system. Performance anomalies represent the performance degradation issues (e.g., slowing down in system response times) of software systems at run-time. Performance anomalies can cause a dramatically negative impact on users’ satisfaction. Prior studies propose different approaches to detect anomalies by analyzing execution logs and resource utilization metrics after the anomalies have happened. However, the prior detection approaches cannot predict the anomalies ahead of time; such limitation causes an inevitable delay in taking corrective actions to prevent performance anomalies from happening. We propose an approach that can predict performance anomalies in software systems and raise anomaly warnings in advance. Our approach uses a Long-Short Term Memory neural network to capture the normal behaviors of a software system. Then, our approach predicts performance anomalies by identifying the early deviations from the captured normal system behaviors. We conduct extensive experiments to evaluate our approach using two real-world software systems (i.e., Elasticsearch and Hadoop). We compare the performance of our approach with two baselines. The first baseline is one state-to-the-art baseline called Unsupervised Behavior Learning. The second baseline predicts performance anomalies by checking if the resource utilization exceeds pre-defined thresholds. Our results show that our approach can predict various performance anomalies with high precision (i.e., 97–100%) and recall (i.e., 80–100%), while the baselines achieve 25–97% precision and 93–100% recall. For a range of performance anomalies, our approach can achieve sufficient lead times that vary from 20 to 1,403 s (i.e., 23.4 min). We also demonstrate the ability of our approach to predict the performance anomalies that are caused by real-world performance bugs. For predicting performance anomalies that are caused by real-world performance bugs, our approach achieves 95–100% precision and 87–100% recall, while the baselines achieve 49–83% precision and 100% recall. The obtained results show that our approach outperforms the existing anomaly prediction approaches and is able to predict performance anomalies in real-world systems.

References

Amazon. [n.d.]. Amazon EC2 Instance Types. Retrieved from https://aws.amazon.com/ec2/instance-types/.Google Scholar
Apache. [n.d.]. Apache Hadoop RandomTextWriter application. Retrieved from https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/RandomTextWriter.html.Google Scholar
Apache. [n.d.]. Apache Hadoop System. Retrieved from http://hadoop.apache.org/.Google Scholar
Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04), Vol. 4. 18--18.Google Scholar
Stefan Berner, Roland Weber, and Rudolf K. Keller. 2005. Observations and lessons learned from automated testing. In Proceedings of the 27th International Conference on Software Engineering. 571--579.Google Scholar
Peter Bodík, Moises Goldszmidt, and Armando Fox. 2008. HiLighter: Automatically building robust signatures of performance behavior for small-and large-scale systems. In Proceedings of the Systems Modeling Language Conference (SysML).Google Scholar
Sucheta Chauhan and Lovekesh Vig. 2015. Anomaly detection in ECG time signals via deep long short-term memory networks. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA’15). IEEE, 1--7.Google ScholarCross Ref
Mike Chen, Alice X. Zheng, Jim Lloyd, Michael I. Jordan, and Eric Brewer. 2004. Failure diagnosis using decision trees. In Proceedings of the International Conference on Autonomic Computing. IEEE, 36--43.Google ScholarCross Ref
Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2008. Anomaly? Application change? or workload change? towards automated detection of application performance anomaly and change. In Proceedings of the IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN’08). IEEE, 452--461.Google ScholarCross Ref
Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, indexing, clustering, and retrieving system history. In ACM SIGOPS Operating Systems Review, Vol. 39. ACM, 105--118.Google ScholarDigital Library
Daniel Joseph Dean, Hiep Nguyen, and Xiaohui Gu. 2012. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th International Conference on Autonomic Computing. ACM, 191--200.Google ScholarDigital Library
Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). IEEE, 859--864.Google ScholarCross Ref
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285--1298.Google ScholarDigital Library
Elastic. [n.d.]. Rally. Retrieved from https://github.com/elastic/rally.Google Scholar
ElasticSearch. [n.d.]. Elasticsearch. Retrieved from https://www.elastic.co.Google Scholar
ElasticSearch. [n.d.]. Elasticsearch Reference. Retrieved from https://www.elastic.co/guide/en/elasticsearch/reference/5.3/modules-threadpool.html.Google Scholar
Úlfar Erlingsson, Marcus Peinado, Simon Peter, Mihai Budiu, and Gloria Mainar-Ruiz. 2012. Fay: Extensible distributed tracing from kernels to clusters. ACM Trans. Comput. Syst. 30, 4 (2012), 1--35.Google ScholarDigital Library
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 9th IEEE International Conference on Data Mining. IEEE, 149--158.Google ScholarDigital Library
Xiaohui Gu and Haixun Wang. 2009. Online anomaly prediction for robust cluster systems. In Proceedings of the IEEE 25th International Conference on Data Engineering. IEEE, 1000--1011.Google ScholarDigital Library
Zhen Guo, Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). IEEE, 259--268.Google ScholarDigital Library
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the IEEE International Conference on Web Services (ICWS’17). IEEE, 33--40.Google Scholar
Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural networks. In Advances in Neural Information Processing Systems. MIT Press, 190--198.Google Scholar
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.Google ScholarDigital Library
Shaohan Huang, Carol Fung, Kui Wang, Polo Pei, Zhongzhi Luan, and Depei Qian. 2016. Using recurrent neural networks toward black-box system anomaly prediction. In Proceedings of the IEEE/ACM 24th International Symposium on Quality of Service (IWQoS’16). IEEE, 1--10.Google Scholar
IBM. [n.d.]. IBM Javametrics. Retrieved from https://developer.ibm.com/javasdk/application-metrics-java/.Google Scholar
Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Discovering likely invariants of distributed transaction systems for autonomic system management. In Proceedings of the IEEE International Conference on Autonomic Computing. IEEE, 199--208.Google ScholarDigital Library
Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Modeling and tracking of transaction flow dynamics for fault detection in complex systems. IEEE Trans. Depend. Secure Comput. 3, 4 (2006), 312--326.Google ScholarDigital Library
Miao Jiang, Mohammad A. Munawar, Thomas Reidemeister, and Paul A. S. Ward. 2009. Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems & Networks. IEEE, 285--294.Google Scholar
Miao Jiang, Mohammad A. Munawar, Thomas Reidemeister, and Paul A. S. Ward. 2009. System monitoring with metric-correlation models: Problems and solutions. In Proceedings of the 6th International Conference on Autonomic Computing. ACM, 13--22.Google Scholar
Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs. ACM SIGPLAN Notices 47, 6 (2012), 77--88.Google ScholarDigital Library
JustGlowing. [n.d.]. MiniSom: a minimalistic implementation of the Self Organizing Maps. Retrieved from https://github.com/JustGlowing/minisom.Google Scholar
Eamonn Keogh, Jessica Lin, and Ada Fu. 2005. Hot sax: Efficiently finding the most unusual time series subsequence. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05). Ieee, 8–pp.Google ScholarDigital Library
Keras. [n.d.]. Keras: The Python Deep Learning library. Retrieved from https://keras.io/.Google Scholar
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Retrieved from https://arXiv:1412.6980.Google Scholar
Mayuresh Kunjir, Yuzhang Han, and Shivnath Babu. 2016. Where does memory go?: Study of memory management in JVM-based data analytics.Google Scholar
Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in ibm bluegene/l event logs. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). IEEE, 583--588.Google ScholarDigital Library
Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering-based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 102--111.Google ScholarDigital Library
Matteo Lissandrini, Martin Brugnara, and Yannis Velegrakis. 2018. Beyond macrobenchmarks: Microbenchmark-based graph database evaluation. Proc. VLDB Endow. 12, 4 (2018), 390--403.Google ScholarDigital Library
Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining invariants from console logs for system problem detection. In Proceedings of the USENIX Annual Technical Conference. 1--14.Google Scholar
Adetokunbo Makanju, A. Nur Zincir-Heywood, and Evangelos E. Milios. 2012. A lightweight algorithm for message type extraction in system application logs. IEEE Trans. Knowl. Data Eng. 24, 11 (2012), 1921--1936.Google ScholarDigital Library
Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. 2016. LSTM-based encoder-decoder for multi-sensor anomaly detection. Retrieved from https://arXiv:1607.00148.Google Scholar
Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. 2015. Long short term memory networks for anomaly detection in time series. In Proceedings. Presses Universitaires de Louvain, 89.Google Scholar
Shigeru Maya, Ken Ueno, and Takeichiro Nishikawa. 2019. dLSTM: A new approach for anomaly detection using deep learning with delayed prediction. Int. J. Data Sci. Anal. (2019), 1–28.Google ScholarCross Ref
Mohammad A. Munawar and Paul A. S. Ward. 2007. A comparative study of pairwise regression techniques for problem determination. In Proceedings of the Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp., 152--166.Google Scholar
Mohammad Ahmad Munawar and Paul A. S. Ward. 2007. Leveraging many simple statistical models to adaptively monitor software systems. In Proceedings of the International Symposium on Parallel and Distributed Processing and Applications. Springer, 457--470.Google Scholar
Openjdk. [n.d.]. Openjdk documentation. Retrieved December 2, 2019 from http://openjdk.java.net/groups/hotspot/docs/HotSpotGlossary.htmlGoogle Scholar
Oracle. [n.d.]. jconsole. Retrieved from http://openjdk.java.net/tools/svc/jconsole/.Google Scholar
Rob Powers, Moises Goldszmidt, and Ira Cohen. 2005. Short term performance forecasting in enterprise systems. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 801--807.Google ScholarDigital Library
Sudip Roy, Arnd Christian König, Igor Dvorkin, and Manish Kumar. 2015. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 1167--1178.Google ScholarCross Ref
Ripon K. Saha, Sarfraz Khurshid, and Dewayne E. Perry. 2014. An empirical study of long lived bugs. In Proceedings of the Software Evolution Week—IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE’14). IEEE, 144--153.Google Scholar
Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li. 2009. Reference-driven performance anomaly identification. In ACM SIGMETRICS Performance Evaluation Review, Vol. 37. ACM, 85--96.Google ScholarDigital Library
solarwinds. [n.d.]. Solarwinds SAM Server & Application Monitor. Retrieved from https://www.solarwinds.com/server-application-monitor?CMP=BIZ-TAD-PCWDLD-SAM_PP-A-PP-Q116.Google Scholar
Christopher Stewart, Terence Kelly, and Alex Zhang. 2007. Exploiting nonstationarity for performance prediction. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 31--44.Google ScholarDigital Library
Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Proceedings of the 13th Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
Yongmin Tan and Xiaohui Gu. 2010. On predictability of system anomalies in real world. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, 133--140.Google ScholarDigital Library
Yongmin Tan, Xiaohui Gu, and Haixun Wang. 2010. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing. ACM, 173--182.Google ScholarDigital Library
Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. Prepare: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems. IEEE, 285--294.Google ScholarDigital Library
Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 785--794.Google ScholarDigital Library
Adrian Taylor, Sylvain Leblanc, and Nathalie Japkowicz. 2016. Anomaly detection in automobile control network data with long short-term memory networks. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA’16). IEEE, 130--139.Google ScholarCross Ref
Avishay Traeger, Ivan Deras, and Erez Zadok. 2008. DARC: Dynamic analysis of root causes of latency distributions. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 277--288.Google ScholarDigital Library
Dylan Tweney. 2013. Amazon website goes down for 40 minutes, costing the company $5 million. Retrieved from https://venturebeat.com/2013/08/19/amazon-website-down/.Google Scholar
Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM’03). IEEE, 119--126.Google ScholarCross Ref
VMware. [n.d.]. Virtual machine CPU usage alarm. Retrieved from https://kb.vmware.com/s/article/2057830.Google Scholar
Chengwei Wang, Vanish Talwar, Karsten Schwan, and Parthasarathy Ranganathan. 2010. Online detection of utility cloud anomalies using metric distributions. In Proceedings of the IEEE Network Operations and Management Symposium (NOMS’10). IEEE, 96--103.Google Scholar
James C. Warner. 2013. top, Linux man page. Retrieved from https://linux.die.net/man/1/top.Google Scholar
Andrew W. Williams, Soila M. Pertet, and Priya Narasimhan. 2007. Tiresias: Black-box failure prediction in distributed systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--8.Google ScholarCross Ref
Cort J. Willmott and Kenji Matsuura. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res. 30, 1 (2005), 79--82.Google ScholarCross Ref
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Online system problem detection by mining patterns of console logs. In Proceedings of the 9th IEEE International Conference on Data Mining. IEEE, 588--597.Google ScholarDigital Library
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, 117--132.Google Scholar
Ting-Fang Yen, Alina Oprea, Kaan Onarlioglu, Todd Leetham, William Robertson, Ari Juels, and Engin Kirda. 2013. Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks. In Proceedings of the 29th Annual Computer Security Applications Conference. ACM, 199--208.Google ScholarDigital Library
Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A C-LSTM neural network for text classification. Retrieved from https://arXiv:1511.08630.Google Scholar
Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R. Lyu. 2018. Tools and benchmarks for automated log parsing. Retrieved from https://arXiv:1811.03509.Google Scholar

Index Terms

Predicting Performance Anomalies in Software Systems at Run-time
1. Software and its engineering
  1. Software creation and management
    1. Software post-development issues

Recommendations

Performance engineering for software architectures
COMPSAC '97: Proceedings of the 21st International Computer Software and Applications Conference

Software Performance Engineering (SPE) is a method for constructing software systems that meet performance goals. SPE includes techniques for gathering data, coping with uncertainty, constructing and evaluating performance models, evaluating ...
Read More
Identifying performance anomalies in fluctuating cloud environments: A robust correlative-GNN-based explainable approach
Abstract
Cloud computing provides scalable and elastic resources to customers as a low-cost, on-demand utility service. Multivariate time series anomaly detection is crucial to promise the overall performance of cloud computing systems. However, due to ...
Highlights
- Adopt two GNNs to extract feature and temporal correlations to reduce false positives.
- Accurately and robustly capture data features via the integration of multiple models.
- Propose an unsupervised explainable model for time series ...
Read More
Detecting performance anomalies in large-scale software systems using entropy

Large-scale software systems (LSSs) are composed of hundreds of subsystems that interact with each other in an unforeseen and complex ways. The operators of these LSSs strictly monitor thousands of metrics (performance counters) to quickly identify ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Software Engineering and Methodology Volume 30, Issue 3
Continuous Special Section: AI and SE
July 2021
600 pages
ISSN:1049-331X
EISSN:1557-7392
DOI:10.1145/3450566
Editor:
Mauro Pezzè
Università della Svizzera italiana and Università di Milano-Bicocca, Switzerland
Issue’s Table of Contents
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 April 2021
- Revised: 1 November 2020
- Accepted: 1 November 2020
- Received: 1 March 2020
Published in tosem Volume 30, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
LSTM neural network
Performance anomaly prediction
software systems
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 474
  Total Downloads
- Downloads (Last 12 months)73
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Predicting Performance Anomalies in Software Systems at Run-time

ACM Transactions on Software Engineering and Methodology

Abstract

References

Cited By

Index Terms

Recommendations

Performance engineering for software architectures

Identifying performance anomalies in fluctuating cloud environments: A robust correlative-GNN-based explainable approach

Detecting performance anomalies in large-scale software systems using entropy