Co-occurrence networks of Twitter content after manual or automatic processing. A case-study on “gluten-free”
Introduction
In recent years, an increase in consumer demand has been observed for gluten-free products (Christoph et al., 2018, Missbach et al., 2015, Molina-Rosell, 2013). Research on gluten-free products has focused on strategies dealing with the negative impact a lack of gluten has on the quality properties of these products. Manufacturing gluten-free cereal products is a challenging task for the food industry (Capriles et al., 2016, Houben et al., 2012). Besides, according to Naqash, Gani, Gani, and Masoodi (2017), most approaches include the addition of functional ingredients to the formulation (gluten-free flours, starches, hydrocolloids, proteins, fats, and fibres) or the adoption of alternative processing methods (high pressure, extrusion, and sourdough fermentation) to produce gluten-free products with good sensory quality, especially a texture comparable to those containing gluten (Marston et al., 2016, Matos and Rosell, 2012, O’Shea et al., 2013, Penjumras et al., 2019).
However, according to do Nascimento, Fiates, and Teixeira (2017), consumers concerns for gluten-free products include sensory quality of products and the issues they experience trying to have a “normal life”, especially in a social context. Still, information on the relevance of extrinsic properties of products, context aspects, and individual attitudes and opinions of gluten-free consumers is scarce. This is for the difficulty in finding coeliac participants for consumer studies, accounting for an estimated 1–2% of population (Sapone et al., 2012). Therefore, we believe what is said on social media networks may be a way to obtain opinions of this target group of consumers, allowing us to understand their motivations and interests when consuming gluten-free products.
Among social media platforms, Twitter is one of the most popular and dynamic microblogging services, with 500 million text-based messages, called “tweets”, generated by active users per day (Chae, 2015, Da Silva et al., 2014, Mention, 2018, Vidal et al., 2015). The informal and colloquial nature of tweets, together with the ease and instant access of the platform make its use widespread, giving rise to a huge volume of rapidly generated data (Fried et al., 2015, Moe and Schweidel, 2017). Unlike other opinion gathering methods for consumers (surveys), social media users spontaneously post what they want when they want, avoiding forced biases to express their opinion.
Food represents one of the key themes discussed on Twitter (Platania & Spadoni, 2018) and consequently, tweets are potentially valuable data sources for gaining insight on food-related consumer studies. To date, the exploration of user-generated content on Twitter has been useful to study food-related topics (food in general, influence of food choices, language of food, food chains, health food, different eating situations, and emotional responses to food and beverages) (Chen and Yang, 2014, Fried et al., 2015, He et al., 2013, Platania and Spadoni, 2018, Samoggia et al., 2019, Vidal et al., 2015, Vidal et al., 2016). However, no study has addressed the exploration and interpretation of a topic like gluten-free.
Different approaches have been made to analyse tweets; automatic word counting is the simplest method of gathering information from users. Calculating the frequency or occurrence of mentions for an individual word, is simple and rapid for summarising the text according to the terms that are frequently mentioned. Nevertheless, the frequency of occurrence of individual words has several important limitations. It may not represent the meaning of the word isolated in the dataset and can lead to misleading conclusions because of the loss of the words’ context (Hsieh and Shannon, 2005, Vidal et al., 2015, Zhao et al., 2013). Therefore, previous qualitative analysis of tweet contents, with individual reading, was proposed to analyse tweets in the context of which the words are mentioned. Thus, the content was classified into themes and sub-themes related to the specific topic (Nguyen et al., 2019, Platania and Spadoni, 2018, Samoggia et al., 2019, Vidal et al., 2015). Although implementing manual content analysis can be tedious and time consuming for the large amounts of text to be read, it proved to be successful at gaining better interpretation of Twitter content (He et al., 2013, He et al., 2017, Vidal et al., 2015). As an automatic alternative, text analysis based on machine learning algorithms has been used to extract meaningful information from the textual data, recording themes already established or commonly studied (Constantinides and Holleschovsky, 2016, Sengupta and Ghosh, 2020, van Zoonen and van Der Meer, 2016). However, for the correct performance of these models, machine learning algorithms usually require a large external source of coded dataset to analyse the text units (Vidal, Ares, & Jaeger, 2018). Thus, the development and adjustment of the algorithms for new topics is complex or require added information.
Co-occurrence networks have been proposed as an approach to facilitate the understanding and visualisation of the structure of different text items and their content. Co-occurrence networks graphically represent the relevance of terms and the relatedness among them, identifying and displaying patterns of co-occurrence within the text (Ruiz and Barnett, 2015, Su and Lee, 2010). Although broadly applied in studies of bibliometric analysis to identify and visualise the existing connections among data (Skaf et al., 2020, van Eck and Waltman, 2018, Wen et al., 2017), co-occurrence networks can also be used for exploring connections of terms in different text documents.
Co-occurrence networks can be obtained by specific software as VOSviewer and Gephi or by using the Python programming language. In the VOSviewer software used in this study, the construction of a map comprises three steps: i) A similarity matrix (association strength as a measure of similarity) is obtained from a co-occurrence matrix (van Eck and Waltman, 2007, van Eck et al., 2006). The similarity between two terms is calculated as the ratio: the number of co-occurrences of two terms i and j divided by the product of the total number of co-occurrences of i and the total number of co-occurrences of j. ii) The visualisation of similarities (VOS) mapping technique constructs a two-dimensional map in which the items are located in such a way that the distance between any pair of items reflects their similarity. The base for doing so is minimising a weighted sum of the squared Euclidean distances between all pairs of items. The higher the similarity between two items, the higher the weight of their squared distance in the sum. iii) The obtained map is translated, rotated, and reflected to obtain consistent results (always the same map) regardless of the different solutions that can be reached in the optimisation process.
In the obtained network, the size of the label representing a term is proportional to its frequency of appearance in the text (occurrence). The thickness of the line connecting two terms indicates how often they co-occur within the same text unit. The distance between two terms offers an approximate indication of the relatedness of the terms (Cunillera and Guilera, 2018, Marinho et al., 2017, Sharma et al., 2018, van Eck et al., 2010). A dataset of 70 text documents describing flowers has been created to illustrate the explanation (Fig. 1). The table in the figure includes the occurrences and co-occurrences of the seven terms. Each term is represented in the network as a circle of size proportional to the number of occurrences; for example, the labels and circles of terms red colour and pink colour are the largest and the smallest because they are the most and least mentioned terms, respectively. Distribution of terms on the map responds to the relationships between items. Spring was a general term, co-mentioned with many terms (red, rose, poppy, and jasmine) and thus, appears located in the centre. The links with the four terms have the same thickness because the number of co-occurrences is the same (five). Terms that do not show co-occurrences among them are separated in the extremes, and close to the terms with higher co-occurrence. In the top-left appears the term poppy related to red, while in the bottom right, the term jasmine relates to fragrance. Rose shows a strong link with red and fragrance, and appears in the bottom-left. The term pink links to rose but does not show co-occurrence with any other term, appearing separated on the bottom-left extreme of the network.
In this study, we propose using co-occurrence networks as a tool for analysing terms in tweets to give more structured information than word counting. Using of raw text directly from tweets would make the analysis almost automatic, however, a previous qualitative analysis of tweets and the corresponding coding of text could be necessary to provide the relevance of ideas expressed in many different terms to avoid misunderstanding of the text.
Therefore, the first aim was to study how pre-processing of tweet text (coding through qualitative analysis or just pre-cleaning) influences co-occurrence networks to determine if the process can be automated without losing relevant information. The second aim was to analyse tweets about “gluten-free” to gather information about the aspects that are relevant for this specific group of consumers, in general and in relation to specific products.
Section snippets
Retrieval of tweets
A total of 16,386 tweets containing “sin-gluten” or “gluten-free” terms posted by users writing in Spanish, between September 2017 and January 2018, were retrieved with the rtweet package (Kearney, 2017) from R software (R Core Team, 2016) via the Twitter’s Application Programming Interface (API). Re-tweets and repeated tweets were removed. Each retrieved tweet included an ID number, username of the person posting and the date and time when the tweet was published, among other information.
Themes and sub-themes in tweets on gluten-free
Table 1 shows the content of the subset of 3000 tweets summarised into nine main themes: products, places, culinary preparations, product-related characteristics, ingredients, occasions, social context, diet/health, and sensory characteristics/sensations. Tweets relating to the themes product, places, and culinary preparation were the most frequent (>30%). Although with lower frequency (<10%), other themes related to diet or health issues and sensory characteristics/sensations were also found.
Treatment of the information obtained from Twitter
In this study, the analysis of tweet information was conducted using co-occurrence networks. The tweets were pre-processed in two ways, either manually coding the content of tweet or using direct raw tweet text (after just a pre-cleaned step). The networks from the manual coded and pre-cleaned text from the subset of 3000 tweets revealed similar main ideas about the topic “gluten-free” but they were differently represented. When coding the information after reading, concepts and ideas were
Conclusions
Co-occurrence networks allow the understanding of the information on Twitter showing the relevance of terms and how they are structured through co-occurrence connections.
This study shows that co-occurrence networks can be used, almost directly, from pre-cleaned data without losing relevant information. Furthermore, the study highlighted the importance of the number of tweets when making relevant and dependable information.
This approach almost automatic based on co-occurrence networks from
CRediT authorship contribution statement
Patricia Puerta: Investigation, Formal analysis, Writing - original draft. Laura Laguna: Investigation, Writing - review & editing, Writing - original draft. Leticia Vidal: Methodology, Writing - review & editing, Writing - original draft. Gastón Ares: Conceptualization, Methodology, Writing - review & editing, Writing - original draft. Susana Fiszman: Conceptualization, Writing - review & editing, Writing - original draft. Amparo Tárrega: Conceptualization, Formal analysis, Methodology,
Acknowledgements
Authors are grateful to the Spanish Ministry of the Economy and Competitiveness for financial support (project AGL-2016-75403-R) and for the Juan de la Cierva contract for author Laura Laguna (IJCI-2016-27427). Furthermore, to Generalitat Valenciana (Project Prometeo 2017/189). Authors are also grateful to Dr. Waltman for his valuable advice related to using VOSviewer technique.
References (61)
- et al.
Connecting flavors in social media: A cross cultural study with beer pairing
Food Research International
(2019) - et al.
Food for thought: Exploring how people think and talk about food online
Appetite
(2018) - et al.
Gluten-free breadmaking: Improving nutritional and bioactive compounds
Journal of Cereal Science
(2016) - et al.
Social media in product development
Food Quality and Preference
(2015) Insights from hashtag #supplychain and Twitter analytics: Considering Twitter and Twitter data for supply chain practice and research
International Journal of Production Economics
(2015)- et al.
Does food environment influence food choices? A geographical analysis through “tweets”
Applied Geography
(2014) - et al.
Who values gluten-free? Dietary intake, behaviors, and sociodemographic characteristics of young adults who value gluten-free food
Journal of the Academy of Nutrition and Dietetics
(2018) - et al.
Tweet sentiment analysis with classifier ensembles
Decision Support Systems
(2014) - et al.
We want to be normal! Perceptions of a group of Brazilian consumers with coeliac disease on gluten-free bread buns
International Journal of Gastronomy and Food Science
(2017) - et al.
Social media competitive analysis and text mining: A case study in the pizza industry
International Journal of Information Management
(2013)
Semantic network analysis of vaccine sentiment in online social media
Vaccine
Effect of heat treatment of sorghum flour on the functional properties of gluten-free bread and cake
LWT - Food Science and Technology
Gluten-free baking: Combating the challenges – A review
Trends in Food Science and Technology
Emotions associated to mealtimes: Memorable meals and typical evening meals
Food Research International
Exploring the presentation of HPV information online: A semantic network analysis of websites
Vaccine
Applying network analysis to explore the global scientific literature on food security
Ecological Informatics
Investigating preferred coffee consumption contexts using open-ended questions
Food Quality and Preference
Social media research: The application of supervised machine learning in organizational communication research
Computers in Human Behavior
Use of emoticon and emoji in tweets for food-related emotional expression
Food Quality and Preference
Using Twitter data for food-related consumer research: A case study on “what people say when tweeting about different eating situations”
Food Quality and Preference
Identification of collective viewpoints on microblogs
Data and Knowledge Engineering
Consumer preferences for written and oral information about allergens when eating out
PLoS One
Twenty years of statistical learning: From language, back to machine learning
Scientometrics
Communicating diabetes and diets on Twitter – a semantic content analysis
International Journal of Networking and Virtual Organisations
Analyzing the language of food on social media
Tweeting and eating: The effect of links and likes on food-hypersensitive consumers’ perceptions of Tweets
Frontiers in Public Health
Application of social media analytics: A case of analyzing online hotel reviews
Online Information Review
Cited by (20)
A critical review of social media research in sensory-consumer science
2023, Food Research InternationalCoeliac consumers’ expectations and eye fixations on commercial gluten-free bread packages
2022, LWTCitation Excerpt :Our results suggest that they fixated on the ingredients like other consumers suffering food intolerances, who frequently review the ingredients on food labels to avoid allergens (Cochrane, Gowland, Sheffield, & Crevel, 2013). However, coeliac consumers that are used to cook and bake at home, can recognize and appreciate the different flours used in gluten-free products (Puerta et al., 2020), thus fixations on the list of ingredients might be directed to the flour type to decide how much they would like the bread. This study presents some limitations.
Exploring public perceptions on alternative meat in China from social media data using transfer learning method
2022, Food Quality and PreferenceCitation Excerpt :Most studies focus on simple word counting method based on frequency and occurrence (Carr et al., 2015), content analysis (Danner & Menapace, 2020; Jaeger & Rasmussen, 2021; Vidal, Ares, & Jaeger, 2016) which is very time-consuming especially for large amount of data, or a combination of both (Vidal, Ares, Machín, & Jaeger, 2015). Other studies investigate the usage of semantic networks (Grebitus & Bruhn, 2008), co-occurrence network (Puerta et al., 2020), and concept mapping approach (Peschel, Kazemi, Liebichová, Sarraf, & Aschemann-Witzel, 2019) for the analysis of associations and communications. Also, there are novel points like considering the use of emoticon and emoji in evaluating emotions (Jaeger, Roigard, & Ares, 2018, Jaeger, Vidal, & Ares, 2021; Jaeger, Lee, et al., 2017; Jaeger, Vidal, Kam, & Ares, 2017; Jaeger & Ares, 2017; Vidal et al., 2016, Vidal, Ares, Blond, Jin, & Jaeger, 2020) for tweets and open-ended questions.
Relevant elements on biscuits purchasing decision for coeliac children and their parents in a supermarket context
2022, Food Quality and PreferenceCitation Excerpt :The list of ingredients was also more relevant for coeliac participants. The type of flour can provide information to check the suitability of the product but also about its sensory quality, as coeliac consumers are concerned or interested about alternative flours for elaborating gluten-free products (Puerta, Laguna, Vidal, Ares, Fiszman, & Tárrega, 2020). However, healthiness seems to be the reason behind this attention to the list of ingredients, as consumers only refer to this element in the laddering task to declare choosing the biscuits with fewer ingredients because they are good for their children’s health.
Risk assessment method combining complex networks with MCDA for multi-facility risk chain and coupling in UUS
2022, Tunnelling and Underground Space TechnologyImportance of data preparation when analysing written responses to open-ended questions: An empirical assessment and comparison with manual coding
2021, Food Quality and PreferenceCitation Excerpt :Data in food-related text mining studies come from a variety of sources. Since food is a popular theme on Twitter (Platania & Spadoni, 2018), tweets are regularly analysed (e.g., Puerta et al., 2020; Singh, Shukla, & Mishra, 2018; Vidal, Ares, Machín, & Jaeger, 2015). Online reviews and forums also provide raw data (e.g., Snyder & Barzilay, 2007; Moon & Kamakura, 2017; Nakayama & Wan, 2018; Agüero-Torales, Cobo, Herrera-Viedma, & López-Herrera, 2019; Danner & Menapace, 2020; Hamilton & Lahne, 2020).