当前位置: X-MOL 学术ACM Trans. Knowl. Discov. Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Efficient Outlier Detection in Text Corpus Using Rare Frequency and Ranking
ACM Transactions on Knowledge Discovery from Data ( IF 3.6 ) Pub Date : 2020-10-03 , DOI: 10.1145/3399712
Wathsala Anupama Mohotti 1 , Richi Nayak 1
Affiliation  

Outlier detection in text data collections has become significant due to the need of finding anomalies in the myriad of text data sources. High feature dimensionality, together with the larger size of these document collections, presents a need for developing accurate outlier detection methods with high efficiency. Traditional outlier detection methods face several challenges including data sparseness, distance concentration, and the presence of a larger number of sub-groups when dealing with text data. In this article, we propose to address these issues by developing novel concepts such as presenting documents with the rare document frequency, finding ranking-based neighborhood for similarity computation, and identifying sub-dense local neighborhoods in high dimensions. To improve the proposed primary method based on rare document frequency, we present several novel ensemble approaches using the ranking concept to reduce the false identifications while finding the higher number of true outliers. Extensive empirical analysis shows that the proposed method and its ensemble variations improve the quality of outlier detection in document repositories as well as they are found scalable compared to the relevant benchmarking methods.

中文翻译:

使用稀有频率和排名的文本语料库中的有效异常值检测

由于需要在无数文本数据源中发现异常,因此文本数据集合中的异常值检测变得非常重要。高特征维数以及这些文档集合的较大规模,提出了开发高效、准确的异常值检测方法的需求。传统的异常值检测方法在处理文本数据时面临数据稀疏、距离集中以及存在大量子组等挑战。在本文中,我们建议通过开发新颖的概念来解决这些问题,例如以稀有文档频率呈现文档,为相似性计算找到基于排名的邻域,以及在高维中识别次密集的局部邻域。为了改进基于稀有文档频率的主要方法,我们提出了几种新颖的集成方法,使用排名概念来减少错误识别,同时找到更多数量的真实异常值。广泛的实证分析表明,所提出的方法及其集成变体提高了文档存储库中异常值检测的质量,并且与相关的基准测试方法相比,它们具有可扩展性。
更新日期:2020-10-03
down
wechat
bug