An unsupervised approach to detect review spam using duplicates of images, videos and Chinese texts

https://doi.org/10.1016/j.csl.2020.101186Get rights and content

Abstract

Intuitively, image- or video-based recommendations seem to be more reliable than those containing plain text, and these types of recommendations have recently become widely encouraged and commonly seen across opinion sharing platforms. Considering their potential for manipulation, graphs (e.g., images and videos) are more vulnerable to spam than scripts. However, most state-of-the-art solutions for opinion spam detection are exclusively devoted to natural language parsing, and less work has been done concerning photos or videos. After investigating the top two business-to-customer websites, i.e., JD.com and TMALL.com, we propose an unsupervised approach to label suspected spam based on different types of duplication across images, videos and Chinese texts. Experiments verified the effectiveness of this approach and obtained several conclusions: 1) the situation of image spam is more severe than that of video and text spam; 2) for manipulation, borrowing something from a marketing page is less attractive than stealing from other reviewers; 3) in addition to using identical texts, spammers also use fictitious rare incidents to influence customers; and 4) overlapping duplications of images, videos and texts are common.

Introduction

People tend to believe things that are supported by photos. Due to their intuitiveness, graphs (including images and videos) have long been encouraged by e-business and opinion-sharing websites. Recently, they have become commonly seen on such sites. For instance, over 8% of the product reviews on TMALL.com (one of the top two largest business-to-consumer (B2C) retailers in China which is owned by Alibaba; referred as TMALL in this paper) have images or videos. On average, each review on JD.com (one of the top two largest B2C retailers in China; referred as JD in the following) has 1.05 photos and 0.5 videos. Moreover, graphic experiences are often promoted by headlines or recommended with a higher level of priority. Similar to the ways in which text is adopted, people share pictures to guide others, to express their intelligence, to connect with people or just to earn platform credits/coupons (Dellarocas and Narayan, 2006; Hennig-Thurau et al., 2004; Hu et al., 2011; Zhu and Zhang, 2010). Likewise, images could also be used to lie with the objective of misleading potential customers; this is a new type of opinion spam that is not electronic word-of-mouth. Even worse, since graphs always seem to be more reliable than plain texts, the expected probability that people will be fooled by this type of spam is greater than that of text manipulation. Over the last decade, opinion spam has drawn a considerable amount of attention, and there have been substantial achievements on the topic (Deborah and Baron, 1988; Dellarocas and Narayan, 2006; Heydari et al., 2015; Jindal and Liu, 2007, 2008; Jindal et al., 2010; Li et al., 2019; Lim, 2010, 2010; Liu and Pang, 2018; Mayzlin, 2006; Mukherjee et al., 2013; Ott et al., 2012, 2011; Paul Rayson, 2001; Savage et al., 2015; Somayeh, 2013; Xie et al., 2012; Zhang et al., 2019; Zhang et al., 2018); nevertheless, most of these work were based on natural language processing and thus cannot fit graphic experiences. Investigation along reviews hosted by JD and TMALL shows that either graph- or text-oriented duplication is common and reviews duplicate in multiple ways. For instance, spammers tend to borrow photos from introduction pages, copy-and-paste videos from other posts and/or refer to a specific scenario in their texts. To further unveil this case, in this paper, we propose an approach that can address duplication of texts, images and videos simultaneously; recognizing different kinds of duplication and labeling spam especially compound spam are the top two challenges.

To our knowledge, this is the first time that images or videos have been fully addressed in the context of review spam detection. Although fields like image forensics or video faking have covered the graph tampering problem for years with plentiful techniques, the manipulation in review systems is different and we cannot directly adopt these solutions to conquer it. 1) for profit efficiency, spammers opt to steal and post someone's images/videos without pixel manipulating or frame editing; and 2) spammers prefer to borrow marketing pictures from the item's webpage, which are carefully designed by sellers and always have backgrounds in pure white (0xFFFFFF) or black (0x000000). Specifically, the contribution of this paper is threefold: 1) We focus on both texts and graphs to uncover any review spam, 2) we introduce reasonable criteria by which to detect different types of duplication and 3) we find some interesting phenomena.

The remainder of this paper is organized as follows. First, we survey state-of-the-art studies in Section 2, and then we introduce our proposal in Section 3. After that, we conduct and discuss some experiments in Section 4. Finally, we conclude this paper in Section 5.

Section snippets

Related work

The problem of review spam has attracted considerable attention in the past decade. A great number of studies have been conducted to detect spammed reviews, spammers or spammer groups. Here, we cluster and survey related work from two topics, text and graph.

Proposal

Based on previous studies (Hennig-Thurau et al., 2004; Jindal and Liu, 2007, 2008), we adopt duplication as the criterion by which to recognize opinion spam. Specifically, six kinds of duplication are considered across image-, video- and text-based reviews (see Table 2). Based on previous investigations, we propose a lightweight approach.

Dataset

For the dataset, we choose to crawl data from JD and TMALL, following their corresponding data policies. In China, JD and TMALL are the top two B2C websites. According to a recent report published by iiMedia Research, 83.8% of the retailing market was shared by them during the first half of 2018 (iiMedia, 2018). Additionally, we choose these companies because of their latent advanced regulations regarding review spam. To fully examine reviews hosted by them, we conduct a few investigations.

Of

Conclusions

To address the problem posed by the lack of serious consideration of graphic-based reviews in the field of review spam detection, we conduct a comprehensive study that covers six types of duplication pertaining to images, videos and texts. Through the datasets crawled from JD and TMALL, we verified the feasibility of our approach and arrived at some interesting conclusions: 1) graphic spam is as severe as text spam; 2) the replication of photos from other posts is more prevalent among

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China, under grant 61802247 and 61801285. We thank anonymous reviewers for their constructive comments.

References (55)

  • Anderson, E., & Simester, D. (2013). Deceptive reviews: the influential...
  • Bonomi, M., Pasquini, C., & Boato, G. (2020). Dynamic texture analysis for detecting fake faces in video sequences....
  • J.K. Burgoon et al.

    Detecting deception through linguistic analysis

  • G. Cao et al.

    Contrast enhancement-based forensics in digital images

    IEEE Trans. Inf. Forensics Secur.

    (2014)
  • R. Cristin et al.

    Illumination-based texture descriptor and fruitfly support vector neural network for image forgery detection in face images

    IET Image Process.

    (2018)
  • F. Deborah et al.

    Ambiguity and rationality

    J. Behav. Decis. Mak.

    (1988)
  • Dellarocas, C., & Narayan, R. (2006). What motivates consumers to review a product online? A study of the...
  • H. Farid

    Exposing digital forgeries from JPEG ghosts

    IEEE Trans. Inf. Forensics Secur.

    (2009)
  • Fxsjy. Retrieved from...
  • D. Güera et al.

    Deepfake video detection using recurrent neural networks

  • Y. Guo et al.

    Fake colorized image detection

    IEEE Trans. Inf. Forensics Secur.

    (2018)
  • V. Holub et al.

    Low-complexity features for JPEG steganalysis using undecimated DCT

    IEEE Trans. Inf. Forensics Secur.

    (2015)
  • M. Huh et al.

    Fighting fake news: image splice detection via learned self-consistency

  • China Retail Industry Market Research and Business Investment Decision Report

    (2018)
  • N. Jindal et al.

    Analyzing and detecting review spam

  • N. Jindal et al.

    Opinion spam and analysis

  • N. Jindal et al.

    Finding unusual review patterns using unexpected rules

  • View full text