Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features,Multimedia Tools and Applications

当前位置： X-MOL 学术 › Multimed. Tools Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Resampling imbalanced data to detect fake reviews using machine learning classifiers and textual-based features
Multimedia Tools and Applications ( IF 3.0 ) Pub Date : 2021-01-13 , DOI: 10.1007/s11042-020-10299-5
Gregorius Satia Budhi , Raymond Chiong , Zuli Wang

Fraudulent online sellers often collude with reviewers to garner fake reviews for their products. This act undermines the trust of buyers in product reviews, and potentially reduces the effectiveness of online markets. Being able to accurately detect fake reviews is, therefore, critical. In this study, we investigate several preprocessing and textual-based featuring methods along with machine learning classifiers, including single and ensemble models, to build a fake review detection system. Given the nature of product review data, where the number of fake reviews is far less than that of genuine reviews, we look into the results of each class in detail in addition to the overall results. We recognise from our preliminary analysis that, owing to imbalanced data, there is a high imbalance between the accuracies for different classes (e.g., 1.3% for the fake review class and 99.7% for the genuine review class), despite the overall accuracy looking promising (around 89.7%). We propose two dynamic random sampling techniques that are possible for textual-based featuring methods to solve this class imbalance problem. Our results indicate that both sampling techniques can improve the accuracy of the fake review class—for balanced datasets, the accuracies can be improved to a maximum of 84.5% and 75.6% for random under and over-sampling, respectively. However, the accuracies for genuine reviews decrease to 75% and 58.8% for random under and over-sampling, respectively. We also discover that, for smaller datasets, the Adaptive Boosting ensemble model outperforms other single classifiers; whereas, for larger datasets, the performance improvement from ensemble models is insignificant compared to the best results obtained by single classifiers.

中文翻译：

使用机器学习分类器和基于文本的功能重新采样不平衡数据以检测假评论

欺诈性的在线卖家经常与评论者合谋为他们的产品收集虚假评论。此行为破坏了买家对产品评论的信任，并有可能降低在线市场的有效性。因此，能够准确检测到虚假评论至关重要。在这项研究中，我们研究了几种预处理和基于文本的特征方法，以及机器学习分类器（包括单个模型和集成模型），以构建伪造的评论检测系统。考虑到产品评论数据的性质，假评论的数量远远少于真实评论的数量，我们除了整体结果之外，还会详细调查每个类别的结果。我们从初步分析中认识到，由于数据不平衡，不同类别（例如1）的准确性之间存在很高的不平衡。尽管整体准确性看起来很有希望（约89.7％），但假评论类别为3％，正版评论类别为99.7％。我们提出了两种动态随机采样技术，它们可能适用于基于文本的特征方法来解决此类不平衡问题。我们的结果表明，两种采样技术都可以提高假评论类别的准确性-对于平衡的数据集，随机欠采样和过采样的准确性分别可以提高到84.5％和75.6％。但是，对于随机欠采样和过采样，真实评论的准确性分别降低到75％和58.8％。我们还发现，对于较小的数据集，自适应增强集成模型优于其他单个分类器。而对于较大的数据集，

更新日期：2021-01-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11