DiagNet: towards a generic, Internet-scale root cause analysis solution,arXiv - CS - Artificial Intelligence

当前位置： X-MOL 学术 › arXiv.cs.AI › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DiagNet: towards a generic, Internet-scale root cause analysis solution
arXiv - CS - Artificial Intelligence Pub Date : 2020-04-07 , DOI: arxiv-2004.03343
Lo\"ick Bonniot (WIDE), Christoph Neumann, Fran\c{c}ois Ta\"iani (WIDE)

Diagnosing problems in Internet-scale services remains particularly difficult and costly for both content providers and ISPs. Because the Internet is decentralized, the cause of such problems might lie anywhere between an end-user's device and the service datacenters. Further, the set of possible problems and causes is not known in advance, making it impossible in practice to train a classifier with all combinations of problems, causes and locations. In this paper, we explore how different machine learning techniques can be used for Internet-scale root cause analysis using measurements taken from end-user devices. We show how to build generic models that (i) are agnostic to the underlying network topology, (ii) do not require to define the full set of possible causes during training, and (iii) can be quickly adapted to diagnose new services. Our solution, DiagNet, adapts concepts from image processing research to handle network and system metrics. We evaluate DiagNet with a multi-cloud deployment of online services with injected faults and emulated clients with automated browsers. We demonstrate promising root cause analysis capabilities, with a recall of 73.9% including causes only being introduced at inference time.

中文翻译：

DiagNet：迈向通用的、互联网规模的根本原因分析解决方案

对于内容提供商和 ISP 而言，诊断 Internet 规模服务中的问题仍然特别困难且成本高昂。由于 Internet 是分散的，因此此类问题的原因可能存在于最终用户设备和服务数据中心之间的任何地方。此外，一组可能的问题和原因是事先不知道的，这使得在实践中不可能用问题、原因和位置的所有组合来训练分类器。在本文中，我们探讨了如何使用从最终用户设备获取的测量值，将不同的机器学习技术用于互联网规模的根本原因分析。我们展示了如何构建通用模型，这些模型 (i) 与底层网络拓扑无关，(ii) 不需要在训练期间定义完整的可能原因集，以及 (iii) 可以快速适应诊断新服务。我们的解决方案 DiagNet 采用了图像处理研究中的概念来处理网络和系统指标。我们通过具有注入故障的在线服务的多云部署和使用自动浏览器模拟客户端来评估 DiagNet。我们展示了很有前景的根本原因分析能力，召回率为 73.9%，包括仅在推理时引入的原因。

更新日期：2020-04-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文