当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Detecting anomalies in microservices with execution trace comparison
Future Generation Computer Systems ( IF 6.2 ) Pub Date : 2020-11-04 , DOI: 10.1016/j.future.2020.10.040
Lun Meng , Feng Ji , Yao Sun , Tao Wang

More and more developers and companies have adopted the concept of microservice. Detecting anomalies and locating root causes are important for improving the reliability of microservices. Current approaches typically monitor the metrics of physical resources, and manually set alarm rules. However, they often require domain knowledge to detect anomalies, and cannot locate faulty microservices causing the anomalies accurately. To address the above issues, we propose an anomaly detection approach for microservice application by comparing execution traces. First, we use dynamic instrumentations to collect execution traces across microservices, and then use call trees to describe an application’s execution traces. Then, we calculate the anomaly degree of traces with tree edit distance to detect structural anomalies, and then analyze the difference between traces to locate the components causing the anomalies. Third, we locate suspicious component calls causing the response time fluctuation with principal component analysis to detect response time anomalies. Finally, we have evaluated the approach with a TPC-W based benchmark called as Bench4Q and a typical microservice-based application called as Social Network. The results demonstrate that the approach achieves 81%–97% precision and 75%–99% recall in detecting anomalies caused by injected CPU, network, memory and service faults.



中文翻译:

通过执行跟踪比较来检测微服务中的异常

越来越多的开发人员和公司采用了微服务的概念。检测异常并找出根本原因对于提高微服务的可靠性很重要。当前的方法通常监视物理资源的指标,并手动设置警报规则。但是,它们通常需要领域知识来检测异常,并且无法准确定位导致异常的故障微服务。为了解决上述问题,我们通过比较执行迹线,提出了一种微服务应用程序的异常检测方法。首先,我们使用动态工具来收集跨微服务的执行跟踪,然后使用调用树来描述应用程序的执行跟踪。然后,我们用树编辑距离计算迹线的异常程度,以检测结构异常,然后分析迹线之间的差异,以定位导致异常的组件。第三,我们通过主成分分析来定位引起响应时间波动的可疑组件调用,以检测响应时间异常。最后,我们使用称为Tench-Q的基于TPC-W的基准和称为社交网络的基于微服务的典型应用程序对该方法进行了评估。结果表明,该方法在检测由注入的CPU,网络,内存和服务故障引起的异常时,可达到81%–97%的精度和75%–99%的查全率。我们使用称为Tench-Q的基于TPC-W的基准和称为社交网络的基于微服务的典型应用程序对该方法进行了评估。结果表明,该方法在检测由注入的CPU,网络,内存和服务故障引起的异常时,可达到81%–97%的精度和75%–99%的查全率。我们使用称为Tench-Q的基于TPC-W的基准和称为社交网络的基于微服务的典型应用程序对该方法进行了评估。结果表明,该方法在检测由注入的CPU,网络,内存和服务故障引起的异常时,可达到81%–97%的精度和75%–99%的查全率。

更新日期:2020-11-17
down
wechat
bug