Highlighting keyphrases using senti-scoring and fuzzy entropy for unsupervised sentiment analysis

https://doi.org/10.1016/j.eswa.2020.114323Get rights and content

Highlights

  • An unsupervised sentiment classification system using n-grams technique for online reviews

  • Formulation and senti-scoring of phrase patterns

  • Senti-scores of phrases are computed from SentiWordNet lexicon and fuzzy linguistic hedges.

  • Applied Fuzzy Entropy filter and k-means clustering for extracting and highlighting keyphrases

  • Results of comparison with other state-of-the-art indicate the higher scores of our system.

Abstract

Sentiment Analysis is a process that aids in assessing the performance of products or services from user generated online posts. In present time, there are various websites that allow customers to post reviews about movies, products, events or services, etc. This has led to cumulative aggregation of a lot of reviews written in natural language. Prevailing factors such as availability of online reviews and raised end-user expectations have motivated the evolution of opinion mining systems that can automatically classify customers' reviews. It is observed that in Sentiment Analysis (SA), to highlight the significant keyphrases which contribute towards correct sentiment cognition is a tedious task. In this paper, we have proposed an unsupervised sentiment classification system that comprehensively formulates phrases, computes their senti-scores (sentiment scores) and polarity using the SentiWordNet lexicon and fuzzy linguistic hedges. Further it extracts the keyphrases significant for SA using fuzzy entropy filter and k-means clustering. We have deployed document level SA on online reviews using n-gram techniques, specifically combination of unigram, bigram and trigram. Experiments on two benchmark movie review datasets- polarity dataset by Pang and Lee and IMDB dataset, achieve high accuracy for our approach as compared to the other state-of-the-art-methods for phrase-level SA.

Introduction

Until recent years, determining and monitoring the sentiment of a person in all conversations that takes place on blogs, forums, and social media about any product, service, event or any other entity was a dream for every organization. Nowadays, Sentiment analysis (SA) has become a very popular technique, highly in demand in the field of text analytics. It is used to gain an understanding of the attitudes, opinions and emotions expressed within an online mention (Pang & Lee, 2008). It is a very useful tool in a lot of real-life situations: marketing, politics, finance, quality assurance, risk prevention, etc. Social media is a goldmine of consumer stories and opinion data. Online reviews have tremendous influence on consumers and companies compared to the traditional data. SA allows us to gain an overview of the wider public opinion behind certain topics. But these online social posts are full of complex abbreviations, acronyms, and emoticons (Vashishtha, Susan, 2019a). Also, the human language is highly evolved. Teaching a machine to analyse the various grammatical nuances, cultural variations, slang and misspellings that occur in online mentions is a difficult process. We have used Natural Language Processing (NLP) (Cambria & White, 2014) and sentiment lexicon- SentiWordNet (Baccianella et al., 2010) to transform mountains of hashtags, slang, poor grammar into structured data and extract useful insights carrying sentiments. There is also a shift towards multimodal social web, where users post their opinion in the form of text in Facebook and Twitter, audio or video clips in YouTube. These reviews can be classified by multimodal sentiment classification using a fusion of fuzzy logic with acoustic and linguistic features (Vashishtha and Susan, 2020a, Vashishtha and Susan, 2020c).

The online user reviews usually consist of many subjective sentiment phrases to express opinions about specific targets; identifying and utilizing the sentiments of such sentiment phrases are essential to opinion mining tasks (Liu, 2012). SA can be achieved in three different levels: document level, sentence level, and aspect level. In document level, the whole document, which contains many sentences, is classified as positive or negative; while in sentence level, sentiment is evaluated only for one sentence at a time. Aspect level focuses on all expressions of sentiments present within a given document and the aspect to which it refers (Pang & Lee, 2008). SA can be applied to text in two ways: word level and phrase level. Word level focuses on extracting and computing only words from text and in phrase level: - extraction and computation of phrases-combination of words is done (Cambria & White, 2014). In text-based SA, the system can determine sentiment expressed by examining the words (Vashishtha, Susan, 2020b), phrases and dependencies among them. Further, these words and phrases can classify the given text into different sentiment classes. In this paper, we have deployed the phrase level SA to classify online reviews into positive and negative polarity. There are primarily two types of machine learning techniques, generally used in SA, supervised and unsupervised learning techniques. In supervised learning technique, the dataset is labelled and subsequently trained to obtain a reasonable output which helps in proper decision making. Unlike supervised learning, unsupervised learning processes do not need any label data; hence they cannot be processed at ease. This study demonstrates an unsupervised phrase-level SA which comprehensively formulates phrases, computes their senti-scores (sentiment scores) and polarity using SentiWordNet lexicon and fuzzy linguistic hedges. These computed phrases are filtered out by the fuzzy measure- fuzzy entropy, and k-means clustering; finally, the senti-scores and polarity of selected phrases are used to determine the sentiment of the review.

The key contributions of the paper are as follows:

  • 1.

    An unsupervised phrase-level SA approach has been proposed to perform sentiment analysis on online reviews using n-gram techniques, specifically combination of unigram, bigram and trigram.

  • 2.

    Phrases are constructed comprehensively using part-of-speech (POS) Tagger, list of concentrators, dilators and negators. Their senti-scores and polarity are computed using SentiWordNet lexicon and fuzzy linguistic hedges.

  • 3.

    Document level SA on online reviews is executed by extracting high sentiment bearing keyphrases filtered out by fuzzy entropy and k-means clustering, and finally computing the sentiment of the review.

  • 4.

    The performance of our fuzzy technique is evaluated using the parameters of accuracy and f-score. The results indicate higher scores as compared to the state-of-the-art.

The rest of the paper is organized as follows. Section 2 discusses different research papers on sentiment analysis based on phrases and fuzzy logic. Section 3 describes the proposed fuzzy approach for phrase level SA. In Section 4.1, experimental setup of our approach is discussed. The results are presented in Section 4.2. The overall conclusions are drawn in Section 5.

Section snippets

Phrase-level sentiment analysis

Phrase-level sentiment analysis has been of great interest since the past decade because of its practical utility in social sentiment analysis. Turney presented an unsupervised algorithm for classification of reviews into two classes: recommended or not recommended (Turney, 2002). He presented phrase extraction patterns; then the semantic orientation of a phrase is computed using PMI-IR algorithm. PMI-IR is Pointwise Mutual Information (PMI) and Information Retrieval (IR), it measures the

Motivation

There are millions of online reviews on internet for various topics, events, products or services. Analyzing these reviews for SA is in demand for various organizations but it is a challenging task. Several works have tackled this issue by searching out different phrase patterns in text. They include all the extracted phrases for detecting the sentiment, some of these phrases are not important and thereby the wrong sentiment is detected. This motivated us to extract only important phrases i.e.

Experimental setup

In the experimental phase, the proposed system was executed in python with Intel Core i5 processor, 64-bit operating system and 8GB RAM. We have used two movie review datasets: polarity dataset v2.0 by Pang-Lee and the IMDB dataset. The first dataset contains 1000 positive and 1000 negative processed movie reviews. The IMDB dataset has a total of 50,000 reviews, with a training set of 25,000 labelled instances and a testing set of 25,000 labelled instances; we have merged both the sets since

Conclusion

Sentiment Analysis is the evaluation and study of people's opinions, attitudes and emotions towards an entity. The entity can represent individuals, events or topics. These topics are most likely to be covered by reviews. Public sentiments regarding any social issue can be analyzed easily using SA. In this paper, we have proposed an unsupervised sentiment classification system that comprehensively formulates phrases, compute their senti-scores (sentiment scores) and polarity using fuzzy

CRediT authorship contribution statement

Srishti Vashishtha: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization. Seba Susan: Conceptualization, Methodology, Investigation, Writing - review & editing, Supervision, Project administration, Funding acquisition.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Srishti Vashishtha has received B. Tech degree in Information Technology from Maharaja Surajmal Institute of Technology, GGSIPU, Delhi, M. Tech degree in Computer Science from University school of Information, Communication and Technology, GGSIPU, Delhi.

She is currently pursuing Ph.D. in Information Technology Department from Delhi Technological University, Delhi. Her area of interests includes Data Mining, Natural language Processing and Machine Learning.

References (40)

  • E. Cambria et al.

    Jumping NLP curves: A review of natural language processing research

    IEEE Computational Intelligence Magazine

    (2014)
  • M.K. Dalal et al.

    Opinion mining from online user reviews using fuzzy linguistic hedges

    Applied computational intelligence and soft computing

    (2014)
  • N. Gupta et al.

    Fuzzy sentiment analysis on microblogs for movie revenue prediction

  • S.S. Ho et al.

    Mining future spatiotemporal events and their sentiment from online news articles for location-aware recommendation system

  • F.C. Jonathan et al.

    Semi-supervised Keyphrase extraction on scientific article using fact-based sentiment

    Telkomnika

    (2018)
  • A. Kennedy et al.

    Sentiment classification of movie reviews using contextual valence shifters

    Computational Intelligence

    (2006)
  • B. Liu

    Sentiment analysis and opinion mining

    Synthesis lectures on human language technologies

    (2012)
  • E. Loper et al.

    NLTK: The natural language toolkit

  • MacQueen, J. (1967, June). Some methods for classification and analysis of multivariate observations. In Proceedings of...
  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space....
  • Cited by (0)

    Srishti Vashishtha has received B. Tech degree in Information Technology from Maharaja Surajmal Institute of Technology, GGSIPU, Delhi, M. Tech degree in Computer Science from University school of Information, Communication and Technology, GGSIPU, Delhi.

    She is currently pursuing Ph.D. in Information Technology Department from Delhi Technological University, Delhi. Her area of interests includes Data Mining, Natural language Processing and Machine Learning.

    Seba Susan has received her Ph.D. from the Indian Institute of Technology, Delhi in 2014. She is currently an Associate Professor in the Department of Information Technology, Delhi Technological University, Delhi. Her current research interests include statistical inferencing and soft computing tools for Pattern Recognition and Machine Learning.

    View full text