Elsevier

Pattern Recognition Letters

Volume 152, December 2021, Pages 333-339
Pattern Recognition Letters

Experts perception-based system to detect misinformation in health websites

https://doi.org/10.1016/j.patrec.2021.11.008Get rights and content

Highlights

  • A novel system to detect misinformation in health-related websites is proposed.

  • Website reliability is calculated based on the text and visual design elements.

  • BERT-based model is used to classify the text.

  • Evaluations from health-expert users are gathered and used to fit the system.

  • Several experiments on real dataset validate the system.

Abstract

Misinformation is a recurring problem that has experienced a significant growth in recent years due to the rapid development of the Internet. This development has driven the emergence of websites where their content is shared without control. This is even more dangerous in the health domain, given its specific nature and the increasing number of users searching for health-related information on the Internet. For these reasons, this information should be handled with special attention. In this paper, a novel system to detect misinformation in websites related to the health domain is presented. The proposed system uses text mining techniques and visual design features to estimate the trustworthiness of the website. It has been trained using human experts’ knowledge in the selected domain and their visual perception of the website design. Promising results have been obtained during the evaluation in the experimental stage.

Introduction

The amount of information that can be consulted through the Internet has increased over the years. An extremely high proportion of this information is provided by unreliable sources and it is not verified by an auditor. These issues can lead to the dissemination of misleading information (also called misinformation) and its fraudulent use.

In recent years, fraudulent use of misinformation has been exploited in many areas. Political processes such as The Presidential Elections of the USA in 2016 [1] or the perception about the climate change [2] are some instances of such uses. These fraudulent uses also appear in the health domain. In such a case, the use of this type of misinformation is not so striking, but it is equally damaging for users. Moreover, the number of patients that use the Internet to obtain health information is growing [3]. Cases like the conspiracy theories about the COVID-19 pandemic [4] or the spread of misinformation related to miracle diets [5] bear out this point. Although the truthfulness of this information can be easily verified by an expert, non-expert users could be easily deluded. However, the large amount of health information makes manual verification impossible. All these issues lead to the development of automatic systems that label websites as trusted or not trusted according to specific features extracted from them.

This paper proposes a novel system to classify health-related websites. It uses the textual content and the visual design features perceived by users to achieve the task. These features have been selected following the literature of the domain [6] and the opinion of previously trained experts. The system is based on two different Machine Learning(ML) models: a model based on Multilingual Bidirectional Encoder Representations from Transformers(M-BERT) [7] that provides a probability of truthfulness for the website using its textual content, and a binary classifier that classifies the website using the previous estimation made by the M-BERT-based classifier and the visual design features.

This work is part of the Swarm Agent-Based Environment For Reputation in MEDicine(SABERMED) project funded by the Spanish Ministry of Economy and Competitiveness1. The main goal of SABERMED is to find a solution to the problem posed today by non-trusted digital content in the health domain. This project involves the collaboration of public institutions such as the Rey Juan Carlos University of Madrid and the Instituto de Investigacion Biomedica de Salamanca(IBSAL) and private institutions such as MedLab Media Group(MMG).

This paper presents the following contributions:

  • a system able to detect the misinformation in health-related websites automatically.

  • a study of the differences between expert and non-expert users’ perception of health-related websites.

  • to present an evaluation of the relative importance of visual design features in the perceived reliability of these websites.

The proposed system is based on the evaluations gathered on expert users in the health domain. Thus, they evaluated the degree of reliability of a set of websites, depending on their knowledge and their visual perception of the website. The evaluations of experts were compared with the evaluations of non-experts to determine the main differences between them. The evaluations of the expert users are used to train and validate the ML models of the proposed system. Then, the performance of the system is compared with the performance of the non-expert users in the website evaluation task.

The rest of the paper is structured as follows. Section 2 introduces the foundations of the system. The proposed system is presented in Section 3. Then, the data selection, processing and labeling are presented in Section 4. The experiments that illustrate the viability of the proposal are addressed in Section 5. Finally, Section 6 concludes this article and proposes future works.

Section snippets

Background

This section provides an overview of the foundations of the proposal. The related literature that presents relevant approaches is organized as follows. First, the problem of Misinformation on the Internet is addressed. Then, the Content-based misinformation detection is detailed, emphasizing the different developed techniques. Finally, the Visual design and credibility topic is considered to illustrate how the perception of information by individuals affects their opinion.

Experts perception-based system

This section introduces the proposed system. Its main objective is to classify a website as reliable or unreliable according to the knowledge provided by experts in the health domain. The system uses several sources of information to obtain the features of a website. The main feature to be considered is the textual content. Furthermore, visual features such as the presence of an institutional logo, advertisements, and information about the physical address of the owner of the website are

Data acquisition

This section describes the data used in the proposal. These data have been gathered by a framework specifically designed for the SABERMED project. The aim of this three years project is to provide a tool capable of assessing the reputation of digital content on the Internet, detecting fraudulent content by applying Data Science techniques. This project has the collaboration of health experts from IBSAL, one of the partners of this project. It is assumed that these experts have comprehensive

Experiments

The experts perception-based system proposed in Section 3 has been trained and tested using the dataset presented in Section 4. Three experiments are proposed. First, the health experts’ knowledge is validated by comparing their evaluations with those obtained from non-expert users. Next, the complete system is evaluated by using a Cross-Validation(CV) procedure and different binary classifiers. Finally, some tests are conducted to evaluate the relative importance of the features used in the

Conclusions and future work

This paper presents a novel system for detecting misinformation in websites related to the health domain. It implements two ML models that consider both textual content and visual features perceived by users. The first ML model is a M-BERT classifier that estimates the reliability of a website based on textual information. The second ML model is a Gaussian Process that uses the output of the M-BERT classifier and the visual design features to produce the final estimation.

A dataset for training

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

Research supported by grants from the Spanish Ministry of Economy and Competitiveness under the Retos-Colaboración program: SABERMED (Ref: RTC-2017-6253-1), Retos-Investigación program: MODAS-IN (Ref: RTI-2018-094269-B-I00); and donation of the Titan V GPU by NVIDIA Corporation.

References (30)

  • M. McMullan

    Patients using the internet to obtain health information: how this affects the patient–health professional relationship

    Patient Educ. Couns.

    (2006)
  • S. Lee et al.

    The effects of usability and web design attributes on user preference for e-commerce web sites

    Comput. Ind.

    (2010)
  • S. Kogan, T. J. Moskowitz, M. Niessner, Fake news: evidence from financial markets, Available at SSRN 3237763...
  • K.M.d. Treen et al.

    Online misinformation about climate change

    Wiley Interdiscip. Rev. Clim. Change

    (2020)
  • A. Mian et al.

    Coronavirus: the spread of misinformation

    BMC Med.

    (2020)
  • P. Williams

    Combating nutrition misinformation

    Proceedings of the Conference of New Zealand Dietetic Association Inc., Auckland, New Zealand, September 2001, No. 6

    (2001)
  • D. Robins et al.

    Consumer health information on the web: the relationship of visual design and perceptions of credibility

    J. Am. Soc. Inf. Sci. Technol.

    (2010)
  • J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: pre-training of deep bidirectional transformers for language...
  • A.J. Flanagin et al.

    Perceptions of internet information credibility

    J. Mass Commun. Q.

    (2000)
  • E. Zhuravskaya et al.

    Political effects of the internet and social media

    Annu. Rev. Econ.

    (2020)
  • A. Krishna et al.

    Misinformation about health: a review of health communication and misinformation scholarship

    Am. Behav. Sci.

    (2019)
  • M. Polak

    The misinformation effect in financial markets: an emerging issue in behavioural fianance

    e-Finanse Financ. Internet Q.s

    (2012)
  • M.D. Molina et al.

    ǣFake newsǥ is not simply false information: a concept explication and taxonomy of online content

    Am. Behav. Sci.

    (2019)
  • C. Escoffery et al.

    Internet use for health information among college students

    J. Am. Coll. Health

    (2005)
  • R. Hirasawa et al.

    Quality and accuracy of internet information concerning a healthy diet

    Int. J. Food Sci. Nutr.

    (2013)
  • Cited by (0)

    View full text