当前位置: X-MOL 学术arXiv.cs.SE › 论文详情
DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services
arXiv - CS - Software Engineering Pub Date : 2019-10-11 , DOI: arxiv-1910.05339
Chetan Bansal; Sundararajan Renganathan; Ashima Asudani; Olivier Midy; Mathru Janakiraman

Large scale cloud services use Key Performance Indicators (KPIs) for tracking and monitoring performance. They usually have Service Level Objectives (SLOs) baked into the customer agreements which are tied to these KPIs. Dependency failures, code bugs, infrastructure failures, and other problems can cause performance regressions. It is critical to minimize the time and manual effort in diagnosing and triaging such issues to reduce customer impact. Large volumes of logs and mixed type of attributes (categorical, continuous) makes any automated or manual diagnosing non-trivial. In this paper, we present the design, implementation and experience from building and deploying DeCaf, a system for automated diagnosis and triaging of KPI issues using service logs. It uses machine learning along with pattern mining to help service owners automatically root cause and triage performance issues. We present the learnings and results from case studies on two large scale cloud services in Microsoft where DeCaf successfully diagnosed 10 known and 31 unknown issues. DeCaf also automatically triages the identified issues by leveraging historical data. Our key insights are that for any such diagnosis tool to be effective in practice, it should a) scale to large volumes of service logs and attributes, b) support different types of KPIs and ranking functions, c) be integrated into the DevOps processes.
更新日期:2020-01-04

 

全部期刊列表>>
化学/材料学中国作者研究精选
Springer Nature 2019高下载量文章和章节
《科学报告》最新环境科学研究
ACS材料视界
自然科研论文编辑服务
中南大学国家杰青杨华明
剑桥大学-
中国科学院大学化学科学学院
材料化学和生物传感方向博士后招聘
课题组网站
X-MOL
北京大学分子工程苏南研究院
华东师范大学分子机器及功能材料
中山大学化学工程与技术学院
试剂库存
天合科研
down
wechat
bug