当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An Improved KNN-Based Efficient Log Anomaly Detection Method with Automatically Labeled Samples
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2021-04-21 , DOI: 10.1145/3441448
Shi Ying 1 , Bingming Wang 1 , Lu Wang 2 , Qingshan Li 2 , Yishi Zhao 3 , Jianga Shang 3 , Hao Huang 1 , Guoli Cheng 1 , Zhe Yang 1 , Jiangyi Geng 1
Affiliation  

Logs that record system abnormal states (anomaly logs) can be regarded as outliers, and the k-Nearest Neighbor (kNN) algorithm has relatively high accuracy in outlier detection methods. Therefore, we use the kNN algorithm to detect anomalies in the log data. However, there are some problems when using the kNN algorithm to detect anomalies, three of which are: excessive vector dimension leads to inefficient kNN algorithm, unlabeled log data cannot support the kNN algorithm, and the imbalance of the number of log data distorts the classification decision of kNN algorithm. In order to solve these three problems, we propose an efficient log anomaly detection method based on an improved kNN algorithm with an automatically labeled sample set. This method first proposes a log parsing method based on N-gram and frequent pattern mining (FPM) method, which reduces the dimension of the log vector converted with Term frequency.Inverse Document Frequency (TF-IDF) technology. Then we use clustering and self-training method to get labeled log data sample set from historical logs automatically. Finally, we improve the kNN algorithm using average weighting technology, which improves the accuracy of the kNN algorithm on unbalanced samples. The method in this article is validated on six log datasets with different types.

中文翻译:

一种改进的基于 KNN 的具有自动标记样本的高效对数异常检测方法

记录系统异常状态的日志(anomaly log)可视为异常值,k-Nearest Neighbor(kNN)算法在异常值检测方法中具有较高的准确率。因此,我们使用 kNN 算法来检测日志数据中的异常。但是,在使用kNN算法检测异常时存在一些问题,其中三个是:向量维数过大导致kNN算法效率低下,未标记的日志数据无法支持kNN算法,日志数据数量不平衡导致分类失真kNN算法的决策。为了解决这三个问题,我们提出了一种基于改进的kNN算法和自动标记样本集的高效对数异常检测方法。该方法首先提出了一种基于N-gram和频繁模式挖掘(FPM)方法的日志解析方法,降低了用词频转换的对数向量的维数。逆文档频率(TF-IDF)技术。然后我们使用聚类和自训练的方法,从历史日志中自动获取标记的日志数据样本集。最后,我们使用平均加权技术改进了kNN算法,提高了kNN算法对不平衡样本的准确性。本文方法在六个不同类型的日志数据集上进行了验证。
更新日期:2021-04-21
down
wechat
bug