Fake opinion detection: how similar are crowdsourced datasets to real data?,Language Resources and Evaluation

当前位置： X-MOL 学术 › Lang. Resour. Eval. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fake opinion detection: how similar are crowdsourced datasets to real data?
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2020-03-28 , DOI: 10.1007/s10579-020-09486-5
Tommaso Fornaciari , Leticia Cagnina , Paolo Rosso , Massimo Poesio

Identifying deceptive online reviews is a challenging tasks for Natural Language Processing (NLP). Collecting corpora for the task is difficult, because normally it is not possible to know whether reviews are genuine. A common workaround involves collecting (supposedly) truthful reviews online and adding them to a set of deceptive reviews obtained through crowdsourcing services. Models trained this way are generally successful at discriminating between ‘genuine’ online reviews and the crowdsourced deceptive reviews. It has been argued that the deceptive reviews obtained via crowdsourcing are very different from real fake reviews, but the claim has never been properly tested. In this paper, we compare (false) crowdsourced reviews with a set of ‘real’ fake reviews published on line. We evaluate their degree of similarity and their usefulness in training models for the detection of untrustworthy reviews. We find that the deceptive reviews collected via crowdsourcing are significantly different from the fake reviews published online. In the case of the artificially produced deceptive texts, it turns out that their domain similarity with the targets affects the models’ performance, much more than their untruthfulness. This suggests that the use of crowdsourced datasets for opinion spam detection may not result in models applicable to the real task of detecting deceptive reviews. As an alternative method to create large-size datasets for the fake reviews detection task, we propose methods based on the probabilistic annotation of unlabeled texts, relying on the use of meta-information generally available on the e-commerce sites. Such methods are independent from the content of the reviews and allow to train reliable models for the detection of fake reviews.

中文翻译：

假意见检测：众包数据集与真实数据有多相似？

对于自然语言处理（NLP）而言，识别具有欺骗性的在线评论是一项艰巨的任务。为该任务收集语料库很困难，因为通常无法知道评论是否真实。常见的解决方法是在线收集（据说）真实的评论，并将其添加到通过众包服务获得的一组欺骗性评论中。经过这种方式训练的模型通常可以成功地区分“真正的”在线评论与众包的欺骗性评论。有人争辩说，通过众包获得的欺骗性评论与真实的假评论有很大不同，但该主张从未得到过适当的检验。在本文中，我们将众包评论（虚假）与在线发布的一组“真实”假评论进行比较。我们评估它们的相似程度，以及它们在训练模型中用于检测不可信评论的有用性。我们发现，通过众包收集的欺骗性评论与在线发布的假评论有很大不同。对于人工制作的欺骗性文本，事实证明，它们与目标的领域相似性不仅影响模型的真实性，还影响模型的性能。这表明使用众包数据集进行意见垃圾邮件检测可能不会产生适用于检测欺骗性评论的实际任务的模型。作为创建用于伪造评论检测任务的大型数据集的替代方法，我们提出了基于未标记文本的概率注释的方法，该方法依赖于电子商务站点上通常可用的元信息的使用。

更新日期：2020-03-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11