1 Introduction

The COVID-19 pandemic has led to an eruption of mathematical modelling efforts which have been used by many governments to guide their policy decisions for its containment. Though not unfamiliar to policy makers, this is the first time that infectious disease models have been used on such a global scale and with such urgency. In 2001, they were used most prominently in the UK to guide policy decisions related to an outbreak of foot and mouth disease (Woolhouse 2003). More recently, during the 2014 Ebola and the 2018 Zika epidemics, models were used to forecast the course of the epidemic, allocate resources and evaluate the potential impact of interventions (Chretien et al. 2015; Keegan et al. 2017).

What is striking in the COVID-19 pandemic is the parallel increased usage of social media as an integral part of many people’s lives. Use in July 2020 was 10.5% higher than in the same month in 2019 (Data Reportal 2020a) as people in many countries were urged to “stay home, stay safe”. Even after the initial easing of lockdowns and other social restrictions later in 2020, these acquired digital habits appeared to have become part of the “new normal” (DataReportal 2020b)

Hans Kluge (2020), WHO Regional Director for Europe, has argued that ‘Behavioural insights are valuable to inform the planning of appropriate pandemic response measures.’ In this regard, social media data offer a novel wealth of potentially useful information on public awareness, opinions, attitudes and beliefs, relevant to intentions and eventual behaviour. Given the major role of mathematical modelling in guiding health policy decisions in the COVID-19 pandemic at a national level, data from social media are thus, at least in principle, useful sources to help refine these models. Models, which though undoubtedly valuable, have been criticized for being too narrowly epidemiological and lacking input of social science data (Rhodes et al. 2020). Social media platforms provide one potential source of such social data.

COVID-19 is the sixth pandemic since the influenza pandemic of 1918. With more than five new diseases emerging in human populations every year—each with the potential to spread and become pandemic (IPBES 2020)—combined with our increasing use and dependence on technology, social media data offer an opportunity to enhance the capability of these models.

The use of social media data in compartmental infectious disease models was introduced by Sooknanan and Comissiong (2020). We extend this analysis to discuss in greater detail the challenges and opportunities inherent in the use of social media data in infectious disease modelling.

2 Behavioural Insights Obtained from Social Media Data

The most common forms of the compartmental models used to model infectious disease epidemics are variations of the basic susceptible-exposed-infected-removed (SEIR) or susceptible-infected-removed (SIR) models (Mohamadou et al. 2020). Their very simplicity—in line with Einstein’s maxim that everything should be made as simple as possible, but not simpler—means that they may be easily modified as information about disease transmission is updated.

During an emerging disease outbreak, in the absence of straightforward treatments or widely available vaccines, governments are likely to be entirely dependent for long periods on the use of non-pharmaceutical interventions such as isolation, physical distancing, personal protective equipment such as face coverings and hand hygiene, to stem transmission. These practices all involve changes in behaviour—voluntary or mandated—which a sufficiently large proportion of the population must undertake consistently for them to be truly effective. Even with a vaccine, the conspicuous presence of the anti-vaxxers movement and vaccine hesitancy online, has resulted in concerns about the likely levels of take-up of vaccination during an outbreak. These intentions and behaviours could potentially have a large impact on the future course of an outbreak and should be included in modelling efforts (Bae et al. 2021).

Models coupling behaviour–disease dynamics have been on the increase (Funk et al. 2010; Pananos et al. 2017). However, data are needed to “robustly estimate, and possibly predict, behavioural parameters” (Manfredi and D’Onofrio 2013). Behaviour is related to “attitudes, belief systems, opinions and awareness of a disease” (Funk et al. 2010). What modellers need access to continuously and in a timely manner is an indication of prevailing attitudes and perceptions that provide an indication of likely behaviour.

The participatory, informal and spontaneous nature of social media such as Twitter or Sina Weibo allows for almost real-time access to the uncensored feelings and opinions of users, their risk perceptions, intentions and their preventive behaviours—something that researchers would ordinarily have to collect painstakingly through surveys and/or focus groups.

Twitter, which is increasingly being recognized as an important source of such information, has been mined to study public opinion and sentiments regarding disease outbreaks such as influenza and COVID-19 (Signorini et al. 2011; (Boon-Itt and Skunkan 2020), as well as public opinion on vaccination (Tavoschi et al. 2020) and mask-wearing (He et al. 2021).

Data from social media indicative of trends in these behaviours or intentions may be used in models with suitable care. For example, researchers using Twitter found that 10% of all Tweets in the United States related to mask wearing were against (He et al. 2021). This information may then be used when modifying basic disease transmission models to reflect the prevailing sentiment (and related, inferred likely pattern of public behaviour) thereby increasing the accuracy of estimates of transmission dynamics.

However, some caution is advised when using these insights, since there may be a discrepancy between intentions and actual behaviour (Sheeran and Webb 2016), whereby social media sentiment which may be considered to reflect intent may differ from eventual behaviour offline (Smith et al. 2020; Social Media Research Group 2016). A comparison of the insights obtained from social media with subsequent behavioural data may then determine how these two sources relate to each other for eventual incorporation into the modelling process.

3 Incorporating Social Media Data and Its Characteristic Features into Models

While social media use allows for the indirect observation of trends in population behaviour, these media also interact with their users to influence their behaviour, for example, by increasing people’s awareness of non-pharmaceutical, treatment and preventive practices of others with whom they have a relationship and/or who are influential with them. Though behavioural change has been recognized as an important factor in reducing transmission, “the complex interplay of changing epidemiology, media attention, pandemic control measures, risk perception, and public health behaviour” presents significant challenges to modellers (Betsch et al. 2020).

Traditionally, the effects of media have been included in compartmental models in two ways. One approach is to add compartments representing aware and unaware subpopulations. Transitions between unaware and aware compartments are assumed to take place at constant rates with aware people having an assumed lower risk of infection (Agaba et al. 2017). This “awareness” may be explicitly included within a media compartment M(t) whose growth rate has generally been assumed to be proportional to the number of infected individuals (Greenhalgh et al. 2015; Misra et al. 2011 However, Kumar et al. (2021), while investigating the effects of social media on an influenza epidemic and the COVID-19 pandemic, considered a media compartment M(t) consisting of the daily normalized number of Tweets over a specified period of time. Using their model for COVID-19 as an example, these data consisted of daily Tweets with health-related keywords such as “corona”, “coronavirus”, “covid”, “quarantine” from March 22, 2020 to July 20, 2020. Interaction of susceptibles with this compartment resulted in “aware” individuals who were influenced positively by the Tweets and thus changed their behaviour to reduce their social contacts and thus reduce their risk of infection.

The second way to include the effects of media use is as a reduction in transmission with a contact rate of the form βSf (I) where f (I) represents a decreasing function of the number of infected, such as \({e}^{-mI}\) (Cui et al. 2008). In order to understand the relationship between the spread of H1N1 influenza and the number of Tweets, Huo et al. (2020) combined these two approaches by using a disease transmission rate reduced by e−αT and an additional compartment T(t) representing the number of Tweets about the H1N1 epidemic at time t. Unknown parameter values were estimated by data fitting to the percentage of Tweets that self-reported influenza and the officially recorded cases in England between May and December 2009.

These two contrasting approaches to modelling take advantage of the assumed beneficial effects on behaviour of social media. At the same time, social media are awash with conspiracy theories, disinformation and fake news which are likely to have a detrimental effect on control strategies (Towers et al. 2015; Chandrasekaran et al. 2017; Kumar et al. 2020). An investigation into the extent to which misinformation or unverifiable information about the COVID-19 pandemic is spread on Twitter showed that misinformation accounted for 25% of Tweets (Kouzy et al. 2020). This “anti-information” may be incorporated into models by further refining Tweets (and hence the media compartments) into posts representing positive and negative information in terms of their likely relative effect on transmission dynamics (Huo and Zhang 2016) in order to modify the contact rate.

While these models highlight the importance of media, they neglect the unique features associated with social media use. Though similar, in that they both serve to impart information, the interactive, immersive nature of social media may precipitate and capture emergent behaviour (via intentions) that does not occur in engagement with traditional mass media such as television and newspapers. Social media, with their AI-based recommendations and ease of use, encourage like-minded individuals to seek each other’s virtual and, sometimes face-to-face company (Spohr 2017). Social learning theory (Bandura, 1978), suggests that a person’s behaviour is affected by observing the behaviour of others. Consequently, users may reinforce and amplify each other’s opinions and attitudes—both positive and negative—on any course of action to form “online echo chambers” (Burki 2019) unrelated to their consumption of conventional media messages.

This confirmation bias effect has only recently been incorporated in the modelling process in the shape of SIR-Opinion models by coupling disease dynamics with opinion dynamics (Tyson et al. 2020). Here, the susceptible compartment is divided into four subpopulations (with different infection rates) characterized by the strength of their attitudes either positive or negative towards, for example, non-pharmaceutical interventions. The effects of echo chambers and amplification are captured by the interactions between these subpopulations via “influence functions” which can reinforce or weaken the strength of their opinions and attitudes, thus allowing for movement of individuals between these compartments as their opinions and attitudes change.

Though Tyson et al. (2020) acknowledged that “empirically measured influence functions” were not yet available to them, they used four different forms of these influence functions defined as linear, saturating, fixed-order saturating, and reverse-order saturating functions (all dependent on the number of infected people) to regulate these interactions. These influence functions were chosen so that as the disease prevalence increases, the influence of those with positive attitudes towards non-pharmaceutical interventions increases and that of those with negative attitudes decreases. While this model included salient features of social media, i.e. the effects of echo chambers and amplification, future models may also need to take account of the activities of “super-spreaders” of opinion. These are the small number of social media influencers with a wide reach who have the potential to alter the behaviour of large numbers of other people, either positively or negatively in terms of the impact on pandemic control. Though usually associated with commercial marketing and/or celebrity culture, their influence has been recognized by public health officials who have attempted to co-opt them to reinforce official public health messages (Bolat 2020; Archer et al. 2020). Modellers may need to do the same, especially when trying to estimate behavioural responses in specific population sub-groups.

4 Social Media-based Surveillance—Early Detection and Monitoring

Researchers have recognized the potential of social media data as an alternative data source for tracking public health trends for a number of years. With each post or conversation, users leave a digital trace which may be mined to identify attitudes and opinions (sentiment analysis), as well as health status indirectly, or comments about their health. Not only can posts be accessed for content, but they may be linked to the demographic data and geographic location of the account holder through geo-tagging.

Typically, researchers access posts or “likes” (Mackey et al. 2020; Gittelman et al. 2015) made over a period of time and then identify which of these contain health information. A combination of human and computational approaches—including keyword filtering, crowdsourcing, computational algorithms and machine learning—may then be used to process and filter the data. For example, in a recent study to characterize the self-reporting of symptoms, experiences with testing, and mentions of recovery related to COVID-19 via social media, 4,492,954 Tweets were captured using the public streaming Twitter application programming interface (API) over the period March 3-20, 2020 (Mackey et al. 2020). These included terms in the English language such as “covid19,” “corona,” and “coronavirus”. After further processing to identify relevant topic clusters with keywords such as “diagnosed,” “pneumonia,” “fever,” “test,” “testing kit,” “sharing,” “symptoms,” “isolating,” “cough,” “ER” (emergency room), and “emergency room, and to remove duplicate Tweets, this was refined to 3465 Tweets. This led to the observation that though many posters reported symptoms they thought related to COVID-19, they were unable to get tested to confirm their concerns.

The relative frequency of mention of keywords associated with a disease has been found to be strongly correlated with the subsequent number of doctor visits and later reports of the number of people infected (Jordan et al. 2018; Aramaki et al. 2011; Marques-Toledo et al. 2017; Huo 2020). Recent studies (Gharavi et al. 2020; Li et al. 2020) have found a comparable pattern during the COVID-19 pandemic. In China, internet searches and social media data (keywords and sentiments) have been shown to be strongly correlated with daily incidence and exhibit an online peak 10 to 14 days before the peak of daily incidence from official data (Li et al. 2020). Similar research using Twitter showed a lag of 5 to 19 days between social media reports and official COVID-19 statistics in the United States (Gharavi et al. 2020).

Thus these social media data may be used alongside traditional surveillance data in models. For example, by using the percentage of Tweets which included phrases like “have flu”, “have the flu”, “have swine flu”, and “have the swine flu” during the 2009 H1N1 outbreak to parameterize their SEI compartmental model, Pawelek et al. (2014) were able to reproduce the peaks of both the percentage of Tweets and of surveillance data showing the number of infections.

Therefore, social media data may provide not only the basis of an early warning system (ahead of official statistics) at the start of an outbreak but may be used subsequently in models to allow for an assessment of the likely progression of the disease during the outbreak. At the very least, trends in the data, and correspondingly in the outbreak, may be detected. From a practical point of view, this sort of lead time gives public health authorities vital opportunities to make policy decisions, inform the public and put in place the necessary arrangements, for example, to test, trace and isolate and/or treat an influx of cases. From a modelling point of view, it is also possible that such data may be used as a proxy for conventional data in countries where testing is limited or where there is a significant delay in reporting numbers due to test processing time and consequently reporting time. For example, the effect of a change in NPIs on incidence may be gauged by an analysis of social media data before and after the change. If these data are used in a model to estimate parameters or to determine the form of a function, the model may then be used to forecast the effects of further changes related to those NPIs—at least with respect to social media sentiment.

5 Social Media and Internet-based Data—the Good, the Bad and the Ugly

Though the world has faced disease outbreaks before, one of the major differences with current and future outbreaks is the widespread use of social media. On the surface, the almost instantaneous, open access availability of some social media data, whereby modellers can bypass or eliminate the formal structures that were used previously to restrict access or share data, seems like a real boon to infectious disease modellers, particularly in countries where other sources of information are restricted and/or censored.

However, due to their volume and complexity, these data are generally difficult to process using traditional applications and tools. The analytic techniques needed differ from traditional statistical methods which generally cannot be used for analysis of social media data such as audio, images, video, and unstructured text. Suggestions for ways of incorporating multiple data sources in models such as by weighting data sources may be found in recent papers by De Angelis et al. (2015) and Gandomi and Haider (2015). The use of these “novel data streams” (Althouse et al. 2015) comes with concerns about the data themselves, particularly in terms of their representativeness, privacy concerns and the filtering of information by algorithms (De Angelis et al. 2015; Althouse et al. 2015; Lee et al. 2016).

In terms of sample validity, though social media data may not be limited to a particular geographic location, they are ultimately bound by the popularity of each platform in the region under consideration. In addition, different demographic groups use social media in different ways and to different degrees—traditionally, social media platforms have been the preserve of younger people. Thus the data cannot be assumed to be fully representative of the general population (Mellon and Prosser 2016). However, they may be more representative of important population sub-groups such as younger people whose behaviour may be particularly important to learn about since they are likely to have the most daily social contacts and are thus among the groups most likely to spread infection during a pandemic. This will have a bearing on how the data can be interpreted and used in public health terms.

Although there is a correlation between the occurrence of a disease and the likelihood that an individual posts about it or searches the internet for related information, it is tempting to make inferences regarding disease trends based solely on this online information. Researchers using the public health tool Google Flu Trends made such a misstep when they sought to provide real-time monitoring of influenza-like illnesses (ILI) activity exclusively through Google searches for influenza-related information. However, they overestimated the prevalence of influenza in the 2012–2013 and 2011–2012 seasons by more than 50% when compared to surveillance reports from the Centers for Disease Control and Prevention (CDC) (Lazer et al. 2014).

Though this tool was eventually discontinued, the lessons learnt still resonate today. An oft cited reason for the inability of Google Flu Trends to accurately project the prevalence of influenza cases lay in the method of deriving the search terms. Algorithmic dynamics allowed for the recommendation of searches (autosuggest feature) to users after entering a term like fever or cough. Since people are more likely to interact with content suggested to them, this had the effect of skewing the terms people searched for.

Another possible contributor to this inaccuracy was attributed to a type of “echo chamber” effect. Online search behaviour is not restricted to the people suffering from a disease. Increased media reports predicting an active influenza season resulted in a spike in the number of searches for influenza related information in anticipation of becoming ill and a mistaken inference about the extent of seasonal influenza (Harris 2014).

Despite this, the use of data gleaned from online sources has been steadily increasing in models. A major challenge is that these data “are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis” (Lazer et al. 2014). However, hybrid systems are increasingly being utilized where Google searches (e.g. Google Trends) are used in conjunction with social media data—whose content is carefully analyzed for context (Panuganti et al. 2020)—alongside frequently updated, traditional, surveillance sources. For example, an increase in searches for a particular illness could signal an outbreak, while concomitant to this, social media may be scrutinized for references to this illness and then conventional surveillance sources mobilized to provide additional data.

The challenge lies in consolidating all this diverse information to discover and quantify causal relationships between phenomena such as public opinion, attitudes and reported behaviour, as well as in identifying relationships between indicators of the extent and severity of the outbreak and patterns of Twitter and other social media activity. This calls for collaboration between data scientists, epidemiologists, social scientists and modellers to unravel these relationships so that the data can be used in models with a clearer understanding of what they mean and how they should be interpreted.

6 Conclusion

With more than half of the world’s population currently using social media, now is an ideal time to rethink how we use data from these powerful platforms. Social media simultaneously reflect, forecast and shape behaviour. This spontaneous, “nontraditional” source of data may be mined to capture the public’s attitudes, beliefs, opinions, awareness, intentions and reported behaviour towards an infectious disease, which can then be incorporated with caution into disease models. However, social media platforms are not simply passive observers and reporters of trends, but they also interact with their users, for example, by increasing social awareness about non-pharmaceutical interventions and treatment practices. During these uncertain times when information changes quickly, it is imperative that modellers use the most up to date information sources to refine their models—social media data may be just the extra source needed to help accomplish this.