Co-occurrence networks of Twitter content after manual or automatic processing. A case-study on “gluten-free”

https://doi.org/10.1016/j.foodqual.2020.103993Get rights and content

Highlights

  • Co-occurrence networks of tweets after manual coding and just cleaned were compared.

  • Cleaning and coding text provided networks with similar structure and terms relevance.

  • Most tweets on gluten free mention products: bread, cake, cookie, beer, and pizza.

  • Users share how to get gluten-free products (buying or preparing) or eating situations.

Abstract

Gathering information from social networks such as Twitter has emerged to obtain spontaneous and direct opinions of users about a topic. This study focuses on using co-occurrence networks to analyse Twitter information. The objectives were to study the impact of text pre-treatment (codification based in qualitative analysis or just pre-cleaning) and to apply co-occurrence networks for analysing what is said on Twitter about specific topics like “gluten-free”. As such, 16,386 tweets in Spanish containing terms “sin-gluten” and “gluten-free” were collected. A subset of 3000 tweets was used to make co-occurrence networks two ways: i) from the manually coded text and ii) from pre-cleaned text. Results indicate that the co-occurrence network from pre-cleaned text provides meaningful information showing structure and relevance for terms like the network from coded text. The whole set of tweets was used to explore Twitter information on gluten-free, showing users share information about products, occasions, social situations, and places but also product characteristics, sensations, and diet or health issues related to the products. Five product categories, critical for the lack of gluten (bread, cake, cookie, beer, and pizza), occupied most tweets, and according to the related terms, were intended to recommend how to get (buying or cooking) these gluten-free products and to exhibit what (how, when, and where) they prepare and eat. These aspects were different among products, and separated co-occurrence networks allowed better identification.

Introduction

In recent years, an increase in consumer demand has been observed for gluten-free products (Christoph et al., 2018, Missbach et al., 2015, Molina-Rosell, 2013). Research on gluten-free products has focused on strategies dealing with the negative impact a lack of gluten has on the quality properties of these products. Manufacturing gluten-free cereal products is a challenging task for the food industry (Capriles et al., 2016, Houben et al., 2012). Besides, according to Naqash, Gani, Gani, and Masoodi (2017), most approaches include the addition of functional ingredients to the formulation (gluten-free flours, starches, hydrocolloids, proteins, fats, and fibres) or the adoption of alternative processing methods (high pressure, extrusion, and sourdough fermentation) to produce gluten-free products with good sensory quality, especially a texture comparable to those containing gluten (Marston et al., 2016, Matos and Rosell, 2012, O’Shea et al., 2013, Penjumras et al., 2019).

However, according to do Nascimento, Fiates, and Teixeira (2017), consumers concerns for gluten-free products include sensory quality of products and the issues they experience trying to have a “normal life”, especially in a social context. Still, information on the relevance of extrinsic properties of products, context aspects, and individual attitudes and opinions of gluten-free consumers is scarce. This is for the difficulty in finding coeliac participants for consumer studies, accounting for an estimated 1–2% of population (Sapone et al., 2012). Therefore, we believe what is said on social media networks may be a way to obtain opinions of this target group of consumers, allowing us to understand their motivations and interests when consuming gluten-free products.

Among social media platforms, Twitter is one of the most popular and dynamic microblogging services, with 500 million text-based messages, called “tweets”, generated by active users per day (Chae, 2015, Da Silva et al., 2014, Mention, 2018, Vidal et al., 2015). The informal and colloquial nature of tweets, together with the ease and instant access of the platform make its use widespread, giving rise to a huge volume of rapidly generated data (Fried et al., 2015, Moe and Schweidel, 2017). Unlike other opinion gathering methods for consumers (surveys), social media users spontaneously post what they want when they want, avoiding forced biases to express their opinion.

Food represents one of the key themes discussed on Twitter (Platania & Spadoni, 2018) and consequently, tweets are potentially valuable data sources for gaining insight on food-related consumer studies. To date, the exploration of user-generated content on Twitter has been useful to study food-related topics (food in general, influence of food choices, language of food, food chains, health food, different eating situations, and emotional responses to food and beverages) (Chen and Yang, 2014, Fried et al., 2015, He et al., 2013, Platania and Spadoni, 2018, Samoggia et al., 2019, Vidal et al., 2015, Vidal et al., 2016). However, no study has addressed the exploration and interpretation of a topic like gluten-free.

Different approaches have been made to analyse tweets; automatic word counting is the simplest method of gathering information from users. Calculating the frequency or occurrence of mentions for an individual word, is simple and rapid for summarising the text according to the terms that are frequently mentioned. Nevertheless, the frequency of occurrence of individual words has several important limitations. It may not represent the meaning of the word isolated in the dataset and can lead to misleading conclusions because of the loss of the words’ context (Hsieh and Shannon, 2005, Vidal et al., 2015, Zhao et al., 2013). Therefore, previous qualitative analysis of tweet contents, with individual reading, was proposed to analyse tweets in the context of which the words are mentioned. Thus, the content was classified into themes and sub-themes related to the specific topic (Nguyen et al., 2019, Platania and Spadoni, 2018, Samoggia et al., 2019, Vidal et al., 2015). Although implementing manual content analysis can be tedious and time consuming for the large amounts of text to be read, it proved to be successful at gaining better interpretation of Twitter content (He et al., 2013, He et al., 2017, Vidal et al., 2015). As an automatic alternative, text analysis based on machine learning algorithms has been used to extract meaningful information from the textual data, recording themes already established or commonly studied (Constantinides and Holleschovsky, 2016, Sengupta and Ghosh, 2020, van Zoonen and van Der Meer, 2016). However, for the correct performance of these models, machine learning algorithms usually require a large external source of coded dataset to analyse the text units (Vidal, Ares, & Jaeger, 2018). Thus, the development and adjustment of the algorithms for new topics is complex or require added information.

Co-occurrence networks have been proposed as an approach to facilitate the understanding and visualisation of the structure of different text items and their content. Co-occurrence networks graphically represent the relevance of terms and the relatedness among them, identifying and displaying patterns of co-occurrence within the text (Ruiz and Barnett, 2015, Su and Lee, 2010). Although broadly applied in studies of bibliometric analysis to identify and visualise the existing connections among data (Skaf et al., 2020, van Eck and Waltman, 2018, Wen et al., 2017), co-occurrence networks can also be used for exploring connections of terms in different text documents.

Co-occurrence networks can be obtained by specific software as VOSviewer and Gephi or by using the Python programming language. In the VOSviewer software used in this study, the construction of a map comprises three steps: i) A similarity matrix (association strength as a measure of similarity) is obtained from a co-occurrence matrix (van Eck and Waltman, 2007, van Eck et al., 2006). The similarity between two terms is calculated as the ratio: the number of co-occurrences of two terms i and j divided by the product of the total number of co-occurrences of i and the total number of co-occurrences of j. ii) The visualisation of similarities (VOS) mapping technique constructs a two-dimensional map in which the items are located in such a way that the distance between any pair of items reflects their similarity. The base for doing so is minimising a weighted sum of the squared Euclidean distances between all pairs of items. The higher the similarity between two items, the higher the weight of their squared distance in the sum. iii) The obtained map is translated, rotated, and reflected to obtain consistent results (always the same map) regardless of the different solutions that can be reached in the optimisation process.

In the obtained network, the size of the label representing a term is proportional to its frequency of appearance in the text (occurrence). The thickness of the line connecting two terms indicates how often they co-occur within the same text unit. The distance between two terms offers an approximate indication of the relatedness of the terms (Cunillera and Guilera, 2018, Marinho et al., 2017, Sharma et al., 2018, van Eck et al., 2010). A dataset of 70 text documents describing flowers has been created to illustrate the explanation (Fig. 1). The table in the figure includes the occurrences and co-occurrences of the seven terms. Each term is represented in the network as a circle of size proportional to the number of occurrences; for example, the labels and circles of terms red colour and pink colour are the largest and the smallest because they are the most and least mentioned terms, respectively. Distribution of terms on the map responds to the relationships between items. Spring was a general term, co-mentioned with many terms (red, rose, poppy, and jasmine) and thus, appears located in the centre. The links with the four terms have the same thickness because the number of co-occurrences is the same (five). Terms that do not show co-occurrences among them are separated in the extremes, and close to the terms with higher co-occurrence. In the top-left appears the term poppy related to red, while in the bottom right, the term jasmine relates to fragrance. Rose shows a strong link with red and fragrance, and appears in the bottom-left. The term pink links to rose but does not show co-occurrence with any other term, appearing separated on the bottom-left extreme of the network.

In this study, we propose using co-occurrence networks as a tool for analysing terms in tweets to give more structured information than word counting. Using of raw text directly from tweets would make the analysis almost automatic, however, a previous qualitative analysis of tweets and the corresponding coding of text could be necessary to provide the relevance of ideas expressed in many different terms to avoid misunderstanding of the text.

Therefore, the first aim was to study how pre-processing of tweet text (coding through qualitative analysis or just pre-cleaning) influences co-occurrence networks to determine if the process can be automated without losing relevant information. The second aim was to analyse tweets about “gluten-free” to gather information about the aspects that are relevant for this specific group of consumers, in general and in relation to specific products.

Section snippets

Retrieval of tweets

A total of 16,386 tweets containing “sin-gluten” or “gluten-free” terms posted by users writing in Spanish, between September 2017 and January 2018, were retrieved with the rtweet package (Kearney, 2017) from R software (R Core Team, 2016) via the Twitter’s Application Programming Interface (API). Re-tweets and repeated tweets were removed. Each retrieved tweet included an ID number, username of the person posting and the date and time when the tweet was published, among other information.

Themes and sub-themes in tweets on gluten-free

Table 1 shows the content of the subset of 3000 tweets summarised into nine main themes: products, places, culinary preparations, product-related characteristics, ingredients, occasions, social context, diet/health, and sensory characteristics/sensations. Tweets relating to the themes product, places, and culinary preparation were the most frequent (>30%). Although with lower frequency (<10%), other themes related to diet or health issues and sensory characteristics/sensations were also found.

Treatment of the information obtained from Twitter

In this study, the analysis of tweet information was conducted using co-occurrence networks. The tweets were pre-processed in two ways, either manually coding the content of tweet or using direct raw tweet text (after just a pre-cleaned step). The networks from the manual coded and pre-cleaned text from the subset of 3000 tweets revealed similar main ideas about the topic “gluten-free” but they were differently represented. When coding the information after reading, concepts and ideas were

Conclusions

Co-occurrence networks allow the understanding of the information on Twitter showing the relevance of terms and how they are structured through co-occurrence connections.

This study shows that co-occurrence networks can be used, almost directly, from pre-cleaned data without losing relevant information. Furthermore, the study highlighted the importance of the number of tweets when making relevant and dependable information.

This approach almost automatic based on co-occurrence networks from

CRediT authorship contribution statement

Patricia Puerta: Investigation, Formal analysis, Writing - original draft. Laura Laguna: Investigation, Writing - review & editing, Writing - original draft. Leticia Vidal: Methodology, Writing - review & editing, Writing - original draft. Gastón Ares: Conceptualization, Methodology, Writing - review & editing, Writing - original draft. Susana Fiszman: Conceptualization, Writing - review & editing, Writing - original draft. Amparo Tárrega: Conceptualization, Formal analysis, Methodology,

Acknowledgements

Authors are grateful to the Spanish Ministry of the Economy and Competitiveness for financial support (project AGL-2016-75403-R) and for the Juan de la Cierva contract for author Laura Laguna (IJCI-2016-27427). Furthermore, to Generalitat Valenciana (Project Prometeo 2017/189). Authors are also grateful to Dr. Waltman for his valuable advice related to using VOSviewer technique.

References (61)

  • G.J. Kang et al.

    Semantic network analysis of vaccine sentiment in online social media

    Vaccine

    (2017)
  • K. Marston et al.

    Effect of heat treatment of sorghum flour on the functional properties of gluten-free bread and cake

    LWT - Food Science and Technology

    (2016)
  • F. Naqash et al.

    Gluten-free baking: Combating the challenges – A review

    Trends in Food Science and Technology

    (2017)
  • B. Piqueras-Fiszman et al.

    Emotions associated to mealtimes: Memorable meals and typical evening meals

    Food Research International

    (2015)
  • J.B. Ruiz et al.

    Exploring the presentation of HPV information online: A semantic network analysis of websites

    Vaccine

    (2015)
  • L. Skaf et al.

    Applying network analysis to explore the global scientific literature on food security

    Ecological Informatics

    (2020)
  • S. Spinelli et al.

    Investigating preferred coffee consumption contexts using open-ended questions

    Food Quality and Preference

    (2017)
  • W. van Zoonen et al.

    Social media research: The application of supervised machine learning in organizational communication research

    Computers in Human Behavior

    (2016)
  • L. Vidal et al.

    Use of emoticon and emoji in tweets for food-related emotional expression

    Food Quality and Preference

    (2016)
  • L. Vidal et al.

    Using Twitter data for food-related consumer research: A case study on “what people say when tweeting about different eating situations”

    Food Quality and Preference

    (2015)
  • B. Zhao et al.

    Identification of collective viewpoints on microblogs

    Data and Knowledge Engineering

    (2013)
  • F.M. Begen et al.

    Consumer preferences for written and oral information about allergens when eating out

    PLoS One

    (2016)
  • Constantinides, E., & Holleschovsky, N. I. (2016). Impact of online product reviews on purchasing decisions. WEBIST...
  • T. Cunillera et al.

    Twenty years of statistical learning: From language, back to machine learning

    Scientometrics

    (2018)
  • K. Eriksson-Backa et al.

    Communicating diabetes and diets on Twitter – a semantic content analysis

    International Journal of Networking and Virtual Organisations

    (2016)
  • Feinerer, I., & Hornik, K. (2017). Package “tm”: Text Mining Package. Version 0.7-1. CRAN.R-Project....
  • D. Fried et al.

    Analyzing the language of food on social media

  • Gómez-Corona, C., Ares, G., Spinelli, S., Veflen, N., & Stathopoulou, N. (2019). Social media in sensory and consumer...
  • R.J.T. Hamshaw et al.

    Tweeting and eating: The effect of links and likes on food-hypersensitive consumers’ perceptions of Tweets

    Frontiers in Public Health

    (2018)
  • W. He et al.

    Application of social media analytics: A case of analyzing online hotel reviews

    Online Information Review

    (2017)
  • Cited by (20)

    • Coeliac consumers’ expectations and eye fixations on commercial gluten-free bread packages

      2022, LWT
      Citation Excerpt :

      Our results suggest that they fixated on the ingredients like other consumers suffering food intolerances, who frequently review the ingredients on food labels to avoid allergens (Cochrane, Gowland, Sheffield, & Crevel, 2013). However, coeliac consumers that are used to cook and bake at home, can recognize and appreciate the different flours used in gluten-free products (Puerta et al., 2020), thus fixations on the list of ingredients might be directed to the flour type to decide how much they would like the bread. This study presents some limitations.

    • Exploring public perceptions on alternative meat in China from social media data using transfer learning method

      2022, Food Quality and Preference
      Citation Excerpt :

      Most studies focus on simple word counting method based on frequency and occurrence (Carr et al., 2015), content analysis (Danner & Menapace, 2020; Jaeger & Rasmussen, 2021; Vidal, Ares, & Jaeger, 2016) which is very time-consuming especially for large amount of data, or a combination of both (Vidal, Ares, Machín, & Jaeger, 2015). Other studies investigate the usage of semantic networks (Grebitus & Bruhn, 2008), co-occurrence network (Puerta et al., 2020), and concept mapping approach (Peschel, Kazemi, Liebichová, Sarraf, & Aschemann-Witzel, 2019) for the analysis of associations and communications. Also, there are novel points like considering the use of emoticon and emoji in evaluating emotions (Jaeger, Roigard, & Ares, 2018, Jaeger, Vidal, & Ares, 2021; Jaeger, Lee, et al., 2017; Jaeger, Vidal, Kam, & Ares, 2017; Jaeger & Ares, 2017; Vidal et al., 2016, Vidal, Ares, Blond, Jin, & Jaeger, 2020) for tweets and open-ended questions.

    • Relevant elements on biscuits purchasing decision for coeliac children and their parents in a supermarket context

      2022, Food Quality and Preference
      Citation Excerpt :

      The list of ingredients was also more relevant for coeliac participants. The type of flour can provide information to check the suitability of the product but also about its sensory quality, as coeliac consumers are concerned or interested about alternative flours for elaborating gluten-free products (Puerta, Laguna, Vidal, Ares, Fiszman, & Tárrega, 2020). However, healthiness seems to be the reason behind this attention to the list of ingredients, as consumers only refer to this element in the laddering task to declare choosing the biscuits with fewer ingredients because they are good for their children’s health.

    • Importance of data preparation when analysing written responses to open-ended questions: An empirical assessment and comparison with manual coding

      2021, Food Quality and Preference
      Citation Excerpt :

      Data in food-related text mining studies come from a variety of sources. Since food is a popular theme on Twitter (Platania & Spadoni, 2018), tweets are regularly analysed (e.g., Puerta et al., 2020; Singh, Shukla, & Mishra, 2018; Vidal, Ares, Machín, & Jaeger, 2015). Online reviews and forums also provide raw data (e.g., Snyder & Barzilay, 2007; Moon & Kamakura, 2017; Nakayama & Wan, 2018; Agüero-Torales, Cobo, Herrera-Viedma, & López-Herrera, 2019; Danner & Menapace, 2020; Hamilton & Lahne, 2020).

    View all citing articles on Scopus
    View full text