当前位置: X-MOL 学术Inform. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Detection of spam reviews through a hierarchical attention architecture with N-gram CNN and Bi-LSTM
Information Systems ( IF 3.7 ) Pub Date : 2021-07-29 , DOI: 10.1016/j.is.2021.101865
Yuxin Liu 1 , Li Wang 2 , Tengfei Shi 2 , Jinyan Li 3
Affiliation  

Spam reviews misguide decision makings of consumers and may seriously affect fair trading in the online markets. Existing methods for detecting spam reviews mainly focus on feature designs from linguistic and psychological clues, but they hardly reveal the potential semantics. Recent research works apply deep learning to capture semantics features, while these models fail to extract multi-granularity information of the text structures nor consider the mutual influence among the sentences. We propose a hierarchical attention network in which distinct attentions are purposely used at the two layers to capture important, comprehensive, and multi-granularity semantic information. At the first layer, we especially use an N-gram CNN to extract the multi-granularity semantics of the sentences. We then use a combination of convolution structure and Bi-LSTM to extract important and comprehensive semantics in a document at the second layer. Extensive experiments on public datasets demonstrate that our model has superior detection performance over the state-of-the-art baselines, improving F1 score in the mixed-domain to 89.3% (with 4.8 points absolute improvement), F1 score in the Doctor domain to 92.8% (with 9.9 points absolute improvement), F1 score in the Hotel domain to 86.1% (with 2.4 points absolute improvement) and F1 score in the cross-domain to 84.7% (with 10.4 points absolute improvement).



中文翻译:

通过带有 N-gram CNN 和 Bi-LSTM 的分层注意架构检测垃圾评论

垃圾邮件审查误导消费者的决策,并可能严重影响在线市场的公平交易。现有的垃圾评论检测方法主要集中在语言和心理线索的特征设计,但几乎没有揭示潜在的语义。最近的研究工作应用深度学习来捕捉语义特征,而这些模型未能提取文本结构的多粒度信息,也没有考虑句子之间的相互影响。我们提出了一个分层注意力网络,其中在两层特意使用不同的注意力来捕获重要、全面和多粒度的语义信息。在第一层,我们特别使用了一个 N-gram CNN 来提取句子的多粒度语义。然后我们使用卷积结构和 Bi-LSTM 的组合在第二层提取文档中重要且全面的语义。对公共数据集的大量实验表明,我们的模型在最先进的基线上具有卓越的检测性能,提高了F1 混合域得分达到 89.3%(绝对提升 4.8 分), F1 在医生领域的得分达到 92.8%(绝对提升 9.9 分), F1 酒店领域的得分达到 86.1%(绝对提升 2.4 分)和 F1 跨域得分达到 84.7%(绝对提升 10.4 分)。

更新日期:2021-07-29
down
wechat
bug