Highlighting keyphrases using senti-scoring and fuzzy entropy for unsupervised sentiment analysis
Introduction
Until recent years, determining and monitoring the sentiment of a person in all conversations that takes place on blogs, forums, and social media about any product, service, event or any other entity was a dream for every organization. Nowadays, Sentiment analysis (SA) has become a very popular technique, highly in demand in the field of text analytics. It is used to gain an understanding of the attitudes, opinions and emotions expressed within an online mention (Pang & Lee, 2008). It is a very useful tool in a lot of real-life situations: marketing, politics, finance, quality assurance, risk prevention, etc. Social media is a goldmine of consumer stories and opinion data. Online reviews have tremendous influence on consumers and companies compared to the traditional data. SA allows us to gain an overview of the wider public opinion behind certain topics. But these online social posts are full of complex abbreviations, acronyms, and emoticons (Vashishtha, Susan, 2019a). Also, the human language is highly evolved. Teaching a machine to analyse the various grammatical nuances, cultural variations, slang and misspellings that occur in online mentions is a difficult process. We have used Natural Language Processing (NLP) (Cambria & White, 2014) and sentiment lexicon- SentiWordNet (Baccianella et al., 2010) to transform mountains of hashtags, slang, poor grammar into structured data and extract useful insights carrying sentiments. There is also a shift towards multimodal social web, where users post their opinion in the form of text in Facebook and Twitter, audio or video clips in YouTube. These reviews can be classified by multimodal sentiment classification using a fusion of fuzzy logic with acoustic and linguistic features (Vashishtha and Susan, 2020a, Vashishtha and Susan, 2020c).
The online user reviews usually consist of many subjective sentiment phrases to express opinions about specific targets; identifying and utilizing the sentiments of such sentiment phrases are essential to opinion mining tasks (Liu, 2012). SA can be achieved in three different levels: document level, sentence level, and aspect level. In document level, the whole document, which contains many sentences, is classified as positive or negative; while in sentence level, sentiment is evaluated only for one sentence at a time. Aspect level focuses on all expressions of sentiments present within a given document and the aspect to which it refers (Pang & Lee, 2008). SA can be applied to text in two ways: word level and phrase level. Word level focuses on extracting and computing only words from text and in phrase level: - extraction and computation of phrases-combination of words is done (Cambria & White, 2014). In text-based SA, the system can determine sentiment expressed by examining the words (Vashishtha, Susan, 2020b), phrases and dependencies among them. Further, these words and phrases can classify the given text into different sentiment classes. In this paper, we have deployed the phrase level SA to classify online reviews into positive and negative polarity. There are primarily two types of machine learning techniques, generally used in SA, supervised and unsupervised learning techniques. In supervised learning technique, the dataset is labelled and subsequently trained to obtain a reasonable output which helps in proper decision making. Unlike supervised learning, unsupervised learning processes do not need any label data; hence they cannot be processed at ease. This study demonstrates an unsupervised phrase-level SA which comprehensively formulates phrases, computes their senti-scores (sentiment scores) and polarity using SentiWordNet lexicon and fuzzy linguistic hedges. These computed phrases are filtered out by the fuzzy measure- fuzzy entropy, and k-means clustering; finally, the senti-scores and polarity of selected phrases are used to determine the sentiment of the review.
The key contributions of the paper are as follows:
- 1.
An unsupervised phrase-level SA approach has been proposed to perform sentiment analysis on online reviews using n-gram techniques, specifically combination of unigram, bigram and trigram.
- 2.
Phrases are constructed comprehensively using part-of-speech (POS) Tagger, list of concentrators, dilators and negators. Their senti-scores and polarity are computed using SentiWordNet lexicon and fuzzy linguistic hedges.
- 3.
Document level SA on online reviews is executed by extracting high sentiment bearing keyphrases filtered out by fuzzy entropy and k-means clustering, and finally computing the sentiment of the review.
- 4.
The performance of our fuzzy technique is evaluated using the parameters of accuracy and f-score. The results indicate higher scores as compared to the state-of-the-art.
The rest of the paper is organized as follows. Section 2 discusses different research papers on sentiment analysis based on phrases and fuzzy logic. Section 3 describes the proposed fuzzy approach for phrase level SA. In Section 4.1, experimental setup of our approach is discussed. The results are presented in Section 4.2. The overall conclusions are drawn in Section 5.
Section snippets
Phrase-level sentiment analysis
Phrase-level sentiment analysis has been of great interest since the past decade because of its practical utility in social sentiment analysis. Turney presented an unsupervised algorithm for classification of reviews into two classes: recommended or not recommended (Turney, 2002). He presented phrase extraction patterns; then the semantic orientation of a phrase is computed using PMI-IR algorithm. PMI-IR is Pointwise Mutual Information (PMI) and Information Retrieval (IR), it measures the
Motivation
There are millions of online reviews on internet for various topics, events, products or services. Analyzing these reviews for SA is in demand for various organizations but it is a challenging task. Several works have tackled this issue by searching out different phrase patterns in text. They include all the extracted phrases for detecting the sentiment, some of these phrases are not important and thereby the wrong sentiment is detected. This motivated us to extract only important phrases i.e.
Experimental setup
In the experimental phase, the proposed system was executed in python with Intel Core i5 processor, 64-bit operating system and 8GB RAM. We have used two movie review datasets: polarity dataset v2.0 by Pang-Lee and the IMDB dataset. The first dataset contains 1000 positive and 1000 negative processed movie reviews. The IMDB dataset has a total of 50,000 reviews, with a training set of 25,000 labelled instances and a testing set of 25,000 labelled instances; we have merged both the sets since
Conclusion
Sentiment Analysis is the evaluation and study of people's opinions, attitudes and emotions towards an entity. The entity can represent individuals, events or topics. These topics are most likely to be covered by reviews. Public sentiments regarding any social issue can be analyzed easily using SA. In this paper, we have proposed an unsupervised sentiment classification system that comprehensively formulates phrases, compute their senti-scores (sentiment scores) and polarity using fuzzy
CRediT authorship contribution statement
Srishti Vashishtha: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization. Seba Susan: Conceptualization, Methodology, Investigation, Writing - review & editing, Supervision, Project administration, Funding acquisition.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Srishti Vashishtha has received B. Tech degree in Information Technology from Maharaja Surajmal Institute of Technology, GGSIPU, Delhi, M. Tech degree in Computer Science from University school of Information, Communication and Technology, GGSIPU, Delhi.
She is currently pursuing Ph.D. in Information Technology Department from Delhi Technological University, Delhi. Her area of interests includes Data Mining, Natural language Processing and Machine Learning.
References (40)
- et al.
A definition of a nonprobabilistic entropy in the setting of fuzzy sets theory
Information and Control
(1972) - et al.
Senti-N-gram: An n-gram lexicon for sentiment analysis
Expert Systems with Applications
(2018) - et al.
Finding significant keywords for document databases by two-phase maximum entropy partitioning
Pattern Recognition Letters
(2019) - et al.
Classification of sentiment reviews using n-gram machine learning approach
Expert Systems with Applications
(2016) - et al.
Fuzzy rule based unsupervised sentiment analysis from social media posts
Expert Systems with Applications
(2019) - et al.
Inferring sentiments from supervised classification of text and speech cues using fuzzy rules
Procedia Computer Science
(2020) The concept of a linguistic variable and its application to approximate reasoning-III
Information Sciences
(1975)Fuzzy logic—a personal perspective
Fuzzy Sets and Systems
(2015)- et al.
Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining
- et al.
How to evaluate opinionated keyphrase extraction?
Jumping NLP curves: A review of natural language processing research
IEEE Computational Intelligence Magazine
Opinion mining from online user reviews using fuzzy linguistic hedges
Applied computational intelligence and soft computing
Fuzzy sentiment analysis on microblogs for movie revenue prediction
Mining future spatiotemporal events and their sentiment from online news articles for location-aware recommendation system
Semi-supervised Keyphrase extraction on scientific article using fact-based sentiment
Telkomnika
Sentiment classification of movie reviews using contextual valence shifters
Computational Intelligence
Sentiment analysis and opinion mining
Synthesis lectures on human language technologies
NLTK: The natural language toolkit
Cited by (0)
Srishti Vashishtha has received B. Tech degree in Information Technology from Maharaja Surajmal Institute of Technology, GGSIPU, Delhi, M. Tech degree in Computer Science from University school of Information, Communication and Technology, GGSIPU, Delhi.
She is currently pursuing Ph.D. in Information Technology Department from Delhi Technological University, Delhi. Her area of interests includes Data Mining, Natural language Processing and Machine Learning.
Seba Susan has received her Ph.D. from the Indian Institute of Technology, Delhi in 2014. She is currently an Associate Professor in the Department of Information Technology, Delhi Technological University, Delhi. Her current research interests include statistical inferencing and soft computing tools for Pattern Recognition and Machine Learning.