Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised modeling anomaly detection in discussion forums posts using global vectors for text representation
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-03-04 , DOI: 10.1017/s1351324920000066
Paweł Cichosz

Anomaly detection can be seen as an unsupervised learning task in which a predictive model created on historical data is used to detect outlying instances in new data. This work addresses possibly promising but relatively uncommon application of anomaly detection to text data. Two English-language and one Polish-language Internet discussion forums devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana, serve as text sources that are both realistic and possibly interesting on their own, due to potential associations with drug-related crime. The utility of two different vector text representations is examined: the simple bag of words representation and a more refined Global Vectors (GloVe) representation, which is an example of the increasingly popular word embedding approach. They are both combined with two unsupervised anomaly detection methods, based on one-class support vector machines (SVM) and based on dissimilarity to k-medoids clusters. The GloVe representation is found definitely more useful for anomaly detection, permitting better detection quality and ameliorating the curse of dimensionality issues with text clustering. The cluster dissimilarity approach combined with this representation outperforms one-class SVM with respect to detection quality and appears a more promising approach to anomaly detection in text data.

中文翻译：

使用全局向量进行文本表示的论坛帖子中的无监督建模异常检测

异常检测可以看作是一项无监督学习任务，其中使用基于历史数据创建的预测模型来检测新数据中的异常实例。这项工作解决了异常检测对文本数据的可能有希望但相对不常见的应用。两个英语和一个波兰语互联网讨论论坛专门讨论从本土植物（如大麻或大麻）中获得的精神活性物质，由于可能与毒品有关，它们本身就是既现实又可能有趣的文本来源-相关犯罪。检查了两种不同向量文本表示的效用：简单的词袋表示和更精细的全局向量 (GloVe) 表示，这是越来越流行的词嵌入方法的一个例子。ķ-medoids 集群。发现 GloVe 表示对于异常检测肯定更有用，允许更好的检测质量并改善文本聚类的维数问题。与这种表示相结合的集群相异性方法在检测质量方面优于一类 SVM，并且似乎是一种更有希望的文本数据异常检测方法。

更新日期：2020-03-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11