Abstract
High performance is a critical factor to achieve and maintain the success of a software system. Performance anomalies represent the performance degradation issues (e.g., slowing down in system response times) of software systems at run-time. Performance anomalies can cause a dramatically negative impact on users’ satisfaction. Prior studies propose different approaches to detect anomalies by analyzing execution logs and resource utilization metrics after the anomalies have happened. However, the prior detection approaches cannot predict the anomalies ahead of time; such limitation causes an inevitable delay in taking corrective actions to prevent performance anomalies from happening. We propose an approach that can predict performance anomalies in software systems and raise anomaly warnings in advance. Our approach uses a Long-Short Term Memory neural network to capture the normal behaviors of a software system. Then, our approach predicts performance anomalies by identifying the early deviations from the captured normal system behaviors. We conduct extensive experiments to evaluate our approach using two real-world software systems (i.e., Elasticsearch and Hadoop). We compare the performance of our approach with two baselines. The first baseline is one state-to-the-art baseline called Unsupervised Behavior Learning. The second baseline predicts performance anomalies by checking if the resource utilization exceeds pre-defined thresholds. Our results show that our approach can predict various performance anomalies with high precision (i.e., 97–100%) and recall (i.e., 80–100%), while the baselines achieve 25–97% precision and 93–100% recall. For a range of performance anomalies, our approach can achieve sufficient lead times that vary from 20 to 1,403 s (i.e., 23.4 min). We also demonstrate the ability of our approach to predict the performance anomalies that are caused by real-world performance bugs. For predicting performance anomalies that are caused by real-world performance bugs, our approach achieves 95–100% precision and 87–100% recall, while the baselines achieve 49–83% precision and 100% recall. The obtained results show that our approach outperforms the existing anomaly prediction approaches and is able to predict performance anomalies in real-world systems.
- Amazon. [n.d.]. Amazon EC2 Instance Types. Retrieved from https://aws.amazon.com/ec2/instance-types/.Google Scholar
- Apache. [n.d.]. Apache Hadoop RandomTextWriter application. Retrieved from https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/RandomTextWriter.html.Google Scholar
- Apache. [n.d.]. Apache Hadoop System. Retrieved from http://hadoop.apache.org/.Google Scholar
- Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using magpie for request extraction and workload modelling. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’04), Vol. 4. 18--18.Google Scholar
- Stefan Berner, Roland Weber, and Rudolf K. Keller. 2005. Observations and lessons learned from automated testing. In Proceedings of the 27th International Conference on Software Engineering. 571--579.Google Scholar
- Peter Bodík, Moises Goldszmidt, and Armando Fox. 2008. HiLighter: Automatically building robust signatures of performance behavior for small-and large-scale systems. In Proceedings of the Systems Modeling Language Conference (SysML).Google Scholar
- Sucheta Chauhan and Lovekesh Vig. 2015. Anomaly detection in ECG time signals via deep long short-term memory networks. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA’15). IEEE, 1--7.Google ScholarCross Ref
- Mike Chen, Alice X. Zheng, Jim Lloyd, Michael I. Jordan, and Eric Brewer. 2004. Failure diagnosis using decision trees. In Proceedings of the International Conference on Autonomic Computing. IEEE, 36--43.Google ScholarCross Ref
- Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2008. Anomaly? Application change? or workload change? towards automated detection of application performance anomaly and change. In Proceedings of the IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN’08). IEEE, 452--461.Google ScholarCross Ref
- Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, indexing, clustering, and retrieving system history. In ACM SIGOPS Operating Systems Review, Vol. 39. ACM, 105--118.Google ScholarDigital Library
- Daniel Joseph Dean, Hiep Nguyen, and Xiaohui Gu. 2012. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th International Conference on Autonomic Computing. ACM, 191--200.Google ScholarDigital Library
- Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In Proceedings of the IEEE 16th International Conference on Data Mining (ICDM’16). IEEE, 859--864.Google ScholarCross Ref
- Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security. ACM, 1285--1298.Google ScholarDigital Library
- Elastic. [n.d.]. Rally. Retrieved from https://github.com/elastic/rally.Google Scholar
- ElasticSearch. [n.d.]. Elasticsearch. Retrieved from https://www.elastic.co.Google Scholar
- ElasticSearch. [n.d.]. Elasticsearch Reference. Retrieved from https://www.elastic.co/guide/en/elasticsearch/reference/5.3/modules-threadpool.html.Google Scholar
- Úlfar Erlingsson, Marcus Peinado, Simon Peter, Mihai Budiu, and Gloria Mainar-Ruiz. 2012. Fay: Extensible distributed tracing from kernels to clusters. ACM Trans. Comput. Syst. 30, 4 (2012), 1--35.Google ScholarDigital Library
- Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 9th IEEE International Conference on Data Mining. IEEE, 149--158.Google ScholarDigital Library
- Xiaohui Gu and Haixun Wang. 2009. Online anomaly prediction for robust cluster systems. In Proceedings of the IEEE 25th International Conference on Data Engineering. IEEE, 1000--1011.Google ScholarDigital Library
- Zhen Guo, Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN’06). IEEE, 259--268.Google ScholarDigital Library
- Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the IEEE International Conference on Web Services (ICWS’17). IEEE, 33--40.Google Scholar
- Michiel Hermans and Benjamin Schrauwen. 2013. Training and analysing deep recurrent neural networks. In Advances in Neural Information Processing Systems. MIT Press, 190--198.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.Google ScholarDigital Library
- Shaohan Huang, Carol Fung, Kui Wang, Polo Pei, Zhongzhi Luan, and Depei Qian. 2016. Using recurrent neural networks toward black-box system anomaly prediction. In Proceedings of the IEEE/ACM 24th International Symposium on Quality of Service (IWQoS’16). IEEE, 1--10.Google Scholar
- IBM. [n.d.]. IBM Javametrics. Retrieved from https://developer.ibm.com/javasdk/application-metrics-java/.Google Scholar
- Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Discovering likely invariants of distributed transaction systems for autonomic system management. In Proceedings of the IEEE International Conference on Autonomic Computing. IEEE, 199--208.Google ScholarDigital Library
- Guofei Jiang, Haifeng Chen, and Kenji Yoshihira. 2006. Modeling and tracking of transaction flow dynamics for fault detection in complex systems. IEEE Trans. Depend. Secure Comput. 3, 4 (2006), 312--326.Google ScholarDigital Library
- Miao Jiang, Mohammad A. Munawar, Thomas Reidemeister, and Paul A. S. Ward. 2009. Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In Proceedings of the IEEE/IFIP International Conference on Dependable Systems & Networks. IEEE, 285--294.Google Scholar
- Miao Jiang, Mohammad A. Munawar, Thomas Reidemeister, and Paul A. S. Ward. 2009. System monitoring with metric-correlation models: Problems and solutions. In Proceedings of the 6th International Conference on Autonomic Computing. ACM, 13--22.Google Scholar
- Guoliang Jin, Linhai Song, Xiaoming Shi, Joel Scherpelz, and Shan Lu. 2012. Understanding and detecting real-world performance bugs. ACM SIGPLAN Notices 47, 6 (2012), 77--88.Google ScholarDigital Library
- JustGlowing. [n.d.]. MiniSom: a minimalistic implementation of the Self Organizing Maps. Retrieved from https://github.com/JustGlowing/minisom.Google Scholar
- Eamonn Keogh, Jessica Lin, and Ada Fu. 2005. Hot sax: Efficiently finding the most unusual time series subsequence. In Proceedings of the 5th IEEE International Conference on Data Mining (ICDM’05). Ieee, 8–pp.Google ScholarDigital Library
- Keras. [n.d.]. Keras: The Python Deep Learning library. Retrieved from https://keras.io/.Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. Retrieved from https://arXiv:1412.6980.Google Scholar
- Mayuresh Kunjir, Yuzhang Han, and Shivnath Babu. 2016. Where does memory go?: Study of memory management in JVM-based data analytics.Google Scholar
- Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in ibm bluegene/l event logs. In Proceedings of the 7th IEEE International Conference on Data Mining (ICDM’07). IEEE, 583--588.Google ScholarDigital Library
- Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering-based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 102--111.Google ScholarDigital Library
- Matteo Lissandrini, Martin Brugnara, and Yannis Velegrakis. 2018. Beyond macrobenchmarks: Microbenchmark-based graph database evaluation. Proc. VLDB Endow. 12, 4 (2018), 390--403.Google ScholarDigital Library
- Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining invariants from console logs for system problem detection. In Proceedings of the USENIX Annual Technical Conference. 1--14.Google Scholar
- Adetokunbo Makanju, A. Nur Zincir-Heywood, and Evangelos E. Milios. 2012. A lightweight algorithm for message type extraction in system application logs. IEEE Trans. Knowl. Data Eng. 24, 11 (2012), 1921--1936.Google ScholarDigital Library
- Pankaj Malhotra, Anusha Ramakrishnan, Gaurangi Anand, Lovekesh Vig, Puneet Agarwal, and Gautam Shroff. 2016. LSTM-based encoder-decoder for multi-sensor anomaly detection. Retrieved from https://arXiv:1607.00148.Google Scholar
- Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. 2015. Long short term memory networks for anomaly detection in time series. In Proceedings. Presses Universitaires de Louvain, 89.Google Scholar
- Shigeru Maya, Ken Ueno, and Takeichiro Nishikawa. 2019. dLSTM: A new approach for anomaly detection using deep learning with delayed prediction. Int. J. Data Sci. Anal. (2019), 1–28.Google ScholarCross Ref
- Mohammad A. Munawar and Paul A. S. Ward. 2007. A comparative study of pairwise regression techniques for problem determination. In Proceedings of the Conference of the Center for Advanced Studies on Collaborative Research. IBM Corp., 152--166.Google Scholar
- Mohammad Ahmad Munawar and Paul A. S. Ward. 2007. Leveraging many simple statistical models to adaptively monitor software systems. In Proceedings of the International Symposium on Parallel and Distributed Processing and Applications. Springer, 457--470.Google Scholar
- Openjdk. [n.d.]. Openjdk documentation. Retrieved December 2, 2019 from http://openjdk.java.net/groups/hotspot/docs/HotSpotGlossary.htmlGoogle Scholar
- Oracle. [n.d.]. jconsole. Retrieved from http://openjdk.java.net/tools/svc/jconsole/.Google Scholar
- Rob Powers, Moises Goldszmidt, and Ira Cohen. 2005. Short term performance forecasting in enterprise systems. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 801--807.Google ScholarDigital Library
- Sudip Roy, Arnd Christian König, Igor Dvorkin, and Manish Kumar. 2015. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In Proceedings of the IEEE 31st International Conference on Data Engineering. IEEE, 1167--1178.Google ScholarCross Ref
- Ripon K. Saha, Sarfraz Khurshid, and Dewayne E. Perry. 2014. An empirical study of long lived bugs. In Proceedings of the Software Evolution Week—IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE’14). IEEE, 144--153.Google Scholar
- Kai Shen, Christopher Stewart, Chuanpeng Li, and Xin Li. 2009. Reference-driven performance anomaly identification. In ACM SIGMETRICS Performance Evaluation Review, Vol. 37. ACM, 85--96.Google ScholarDigital Library
- solarwinds. [n.d.]. Solarwinds SAM Server & Application Monitor. Retrieved from https://www.solarwinds.com/server-application-monitor?CMP=BIZ-TAD-PCWDLD-SAM_PP-A-PP-Q116.Google Scholar
- Christopher Stewart, Terence Kelly, and Alex Zhang. 2007. Exploiting nonstationarity for performance prediction. In ACM SIGOPS Operating Systems Review, Vol. 41. ACM, 31--44.Google ScholarDigital Library
- Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Proceedings of the 13th Annual Conference of the International Speech Communication Association.Google ScholarCross Ref
- Yongmin Tan and Xiaohui Gu. 2010. On predictability of system anomalies in real world. In Proceedings of the IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. IEEE, 133--140.Google ScholarDigital Library
- Yongmin Tan, Xiaohui Gu, and Haixun Wang. 2010. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing. ACM, 173--182.Google ScholarDigital Library
- Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. Prepare: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems. IEEE, 285--294.Google ScholarDigital Library
- Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, 785--794.Google ScholarDigital Library
- Adrian Taylor, Sylvain Leblanc, and Nathalie Japkowicz. 2016. Anomaly detection in automobile control network data with long short-term memory networks. In Proceedings of the IEEE International Conference on Data Science and Advanced Analytics (DSAA’16). IEEE, 130--139.Google ScholarCross Ref
- Avishay Traeger, Ivan Deras, and Erez Zadok. 2008. DARC: Dynamic analysis of root causes of latency distributions. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 277--288.Google ScholarDigital Library
- Dylan Tweney. 2013. Amazon website goes down for 40 minutes, costing the company $5 million. Retrieved from https://venturebeat.com/2013/08/19/amazon-website-down/.Google Scholar
- Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM’03). IEEE, 119--126.Google ScholarCross Ref
- VMware. [n.d.]. Virtual machine CPU usage alarm. Retrieved from https://kb.vmware.com/s/article/2057830.Google Scholar
- Chengwei Wang, Vanish Talwar, Karsten Schwan, and Parthasarathy Ranganathan. 2010. Online detection of utility cloud anomalies using metric distributions. In Proceedings of the IEEE Network Operations and Management Symposium (NOMS’10). IEEE, 96--103.Google Scholar
- James C. Warner. 2013. top, Linux man page. Retrieved from https://linux.die.net/man/1/top.Google Scholar
- Andrew W. Williams, Soila M. Pertet, and Priya Narasimhan. 2007. Tiresias: Black-box failure prediction in distributed systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. IEEE, 1--8.Google ScholarCross Ref
- Cort J. Willmott and Kenji Matsuura. 2005. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res. 30, 1 (2005), 79--82.Google ScholarCross Ref
- Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan. 2009. Online system problem detection by mining patterns of console logs. In Proceedings of the 9th IEEE International Conference on Data Mining. IEEE, 588--597.Google ScholarDigital Library
- Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, 117--132.Google Scholar
- Ting-Fang Yen, Alina Oprea, Kaan Onarlioglu, Todd Leetham, William Robertson, Ari Juels, and Engin Kirda. 2013. Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks. In Proceedings of the 29th Annual Computer Security Applications Conference. ACM, 199--208.Google ScholarDigital Library
- Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A C-LSTM neural network for text classification. Retrieved from https://arXiv:1511.08630.Google Scholar
- Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R. Lyu. 2018. Tools and benchmarks for automated log parsing. Retrieved from https://arXiv:1811.03509.Google Scholar
Index Terms
- Predicting Performance Anomalies in Software Systems at Run-time
Recommendations
Performance engineering for software architectures
COMPSAC '97: Proceedings of the 21st International Computer Software and Applications ConferenceSoftware Performance Engineering (SPE) is a method for constructing software systems that meet performance goals. SPE includes techniques for gathering data, coping with uncertainty, constructing and evaluating performance models, evaluating ...
Identifying performance anomalies in fluctuating cloud environments: A robust correlative-GNN-based explainable approach
AbstractCloud computing provides scalable and elastic resources to customers as a low-cost, on-demand utility service. Multivariate time series anomaly detection is crucial to promise the overall performance of cloud computing systems. However, due to ...
Highlights- Adopt two GNNs to extract feature and temporal correlations to reduce false positives.
- Accurately and robustly capture data features via the integration of multiple models.
- Propose an unsupervised explainable model for time series ...
Detecting performance anomalies in large-scale software systems using entropy
Large-scale software systems (LSSs) are composed of hundreds of subsystems that interact with each other in an unforeseen and complex ways. The operators of these LSSs strictly monitor thousands of metrics (performance counters) to quickly identify ...
Comments