Elsevier

Applied Soft Computing

Volume 98, January 2021, 106935
Applied Soft Computing

An ensemble-based hotel recommender system using sentiment analysis and aspect categorization of hotel reviews

https://doi.org/10.1016/j.asoc.2020.106935Get rights and content

Highlights

  • Designed a Hotel Recommendation System based on online review in English language.

  • Ensemble of BERT and Random Forest models for classifying sentiments on reviews.

  • Used textual features like Word2Vec embeddings and TF–IDF scores.

  • Categorized the reviews based on aspects using Fuzzy logic and Cosine similarity.

  • Prepared a sentiment tagged dataset from Tripadvisor consisting of hotel reviews.

Abstract

Finding a suitable hotel based on user’s need and affordability is a complex decision-making process. Nowadays, the availability of an ample amount of online reviews made by the customers helps us in this regard. This very fact gives us a promising research direction in the field of tourism called hotel recommendation system which also helps in improving the information processing of consumers. Real-world reviews may showcase different sentiments of the customers towards a hotel and each review can be categorized based on different aspects such as cleanliness, value, service, etc. Keeping these facts in mind, in the present work, we have proposed a hotel recommendation system using Sentiment Analysis of the hotel reviews, and aspect-based review categorization which works on the queries given by a user. Furthermore, we have provided a new rich and diverse dataset of online hotel reviews crawled from Tripadvisor.com. We have followed a systematic approach which first uses an ensemble of a binary classification called Bidirectional Encoder Representations from Transformers (BERT) model with three phases for positive–negative, neutral–negative, neutral–positive sentiments merged using a weight assigning protocol. We have then fed these pre-trained word embeddings generated by the BERT models along with other different textual features such as word vectors generated by Word2vec, TF–IDF of frequent words, subjectivity score, etc. to a Random Forest classifier. After that, we have also grouped the reviews into different categories using an approach that involves fuzzy logic and cosine similarity. Finally, we have created a recommender system by the aforementioned frameworks. Our model has achieved a Macro F1-score of 84% and test accuracy of 92.36% in the classification of sentiment polarities. Also, the results of the categorized reviews have formed compact clusters. The results are quite promising and much better compared to state-of-the-art models. The relevant codes and notebooks can be found here.

Introduction

Initiation of the second generation of World Wide Web that is, Web 2.0 and the exponential growth of social networks, enterprises, and individuals have led to an excessive increase in the usage of the content available in these web resources which, in turn, help us to make highly informative judgments. Information processing from web resources also opens up many new research domains. For example, tourists consider checking past experiences and opinions of other travelers, available on the different web platforms, when planning their own vacations. This rich and diverse publicly available data can be used by giant tourist organizations as part of their on-field market research. This may range from carrying out polls or focusing groups of probable customers as future endeavors. But the diversity in opinions present in the textual data, provided by the users, gives rise to unwanted complexity as the processing of such huge data is a next to impossible task for humans. To this end, computer scientists provide some data-mining tools/algorithms which can help the user to extract relevant information from the vast amount of data. Taking into consideration a somewhat similar problem of creating a recommendation framework from opined texts available on the internet, this work focuses on developing a Hotel Recommendation System which can help both the tourism industry and individuals looking for hotels. In doing so, we use some advanced and comparatively new algorithms in the domain of textual data mining. Hotel Recommendation System based on Sentiment Analysis of the reviews is a very new research topic that has attracted the researchers due to its tremendous application in the hotel industry and tourism. It has multiple aspects, for instance, a review may talk about different categories such as location, room, and staff of a hotel. There are several factors that add complexity to study as well as retrieve useful information from such data. For example, the requirements of one particular user vastly differ from another. Also the writing pattern of each reviewer is different from the other. For example, for the same hotel, different customers may give feedback in a completely diverse manner. Also the priority we give to some aspects varies a lot at a personal level. Some of us prefer food over hotel location, while some like to pay extra bucks for the window view. It varies both on gender as well as age basis. A millennial may prefer the availability of entertainment and spacious rooms. A typically old person may prefer better room service and cleanliness of rooms. Accordingly, we give the reviews. Another important factor is the size of the review set. Customers are keener to category wise personalized information found in the reviews and often use it as a basis for decision-making. We form an ensemble of different models of transfer learning using BERT and Random Forest classifier on different textual features of the reviews to classify the sentiments of the hotel reviews

In this work, we have focused on the Sentiment Analysis of the reviews crawled from the online Tripadvisor website made by online consumers. Then we have grouped the data into predefined categories. These categories like ‘Location’, ‘Cleanliness’, ‘Service’ etc. are the aspects that frequently recur in the review data, because topics often overlap with each other in real-world reviews. The remaining of the paper has been organized as follows. Section 2 provides a literature survey about the works already done on this topic along with a brief description of their performances. This is followed by Section 3 where we have discussed our motivation for the work and provided a brief decription of our contributions in the present work. Section 4 describes the datasets on which the proposed framework has been evaluated. The methodology that has been followed in designing our architecture is described in Section 5. This is followed by Section 6 where a detailed analysis of performance is shown. Finally, the concluding remarks are reported in Section 7.

Section snippets

Literature survey

Sentiment analysis indicates an area of natural language processing (NLP), computational linguistics, and text mining which aims to determine the emotions, personality, etc. of a writer analogous to specific topics. In recent years, many researchers have proposed various models on sentiment analysis of various topics of tourism, finance, social media, etc. Many types of research have been done in analyzing sentiments in the financial domain [1], [2], [3], [4]. In 2020 Zhao et al. [5] proposed a

Motivation and contributions

One of the major problems faced by the various researchers for designing a hotel recommender system is the lack of a properly labeled dataset. There are very few datasets containing hotel reviews. And even those datasets cannot be used for sentiment analysis or categorization task. There are no sentiment labels present in the dataset which are mandatory to train the dataset for the sentiment analysis task. This brings in the necessity for preparing a suitably labeled dataset containing the

Data crawling

High-quality datasets related to hotel recommender systems are not publicly available as such. Especially the purpose of Sentiment Analysis and availability of properly balanced data are major challenges. That is why, in this work, we have been motivated to develop our own dataset based on our requirements. Also, a new dataset is always a valuable resource for the research community. The crawling of data was carried out using the Tripadvisor API. The website Tripadvisor has a huge database of

Pre-processing of review text data

Data pre-processing tasks are executed after the collection of data. The review texts that the reviewer writes consists of various types of words and their different forms. All these words and their various forms do not have any such significance for classification purposes. So, to generalize the data and preventing unnecessary use of computational resources, these texts need to be pre-processed. The text pre-processing tasks carried out for the Sentiment Analysis classification include:

Results and analysis

In this work, we have proposed a hotel recommendation system based on Sentiment Analysis and categorization of hotel reviews. We have prepared our own dataset by crawling data, carried out using the Trip advisor API. The crawled dataset consists of 58612 reviews. The results for both the processes of Sentiment Analysis and review categorization are discussed in detail. The libraries used for data crawling are urllib, socket, and contextlib. The gensim [34] library’s efficient Word2Vec

Conclusion

Helping a user to choose a proper hotel based on his/her requirement and affordability from the online hotel reviews made by the customer gives us an interesting research field called the hotel recommendation system. This ensures that the customers can make optimal travel decisions based on the input query. In this work, we have presented a novel approach for a user query based recommendation system which gives hotels and reviews corresponding to them if required as output as per the user

CRediT authorship contribution statement

Biswarup Ray: Conceptualization, Data curation, Formal analysis, Resources, Software, Methodology, Writing - original draft. Avishek Garain: Conceptualization, Data curation, Formal analysis, Resources, Software, Methodology, Writing - original draft. Ram Sarkar: Methodology, Supervision, Writing - original draft.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (44)

  • DevittA. et al.

    Sentiment polarity identification in financial news: A cohesion-based approach

    (2007)
  • LiX. et al.

    News impact on stock price return via sentiment analysis

    Knowl.-Based Syst.

    (2014)
  • HiewJ.Z.G. et al.

    Bert-based financial sentiment index and LSTM-based stock return predictability

    (2019)
  • ZhaoL. et al.

    A bert based sentiment analysis and key entity detection approach for online financial texts

    (2020)
  • BhatM. et al.

    Sentiment analysis of social media response on the covid19 outbreak

    Brain Behav. Immun.

    (2020)
  • ManguriK. et al.

    Twitter sentiment analysis on worldwide COVID-19 outbreaks

    Kurdistan J. Appl. Res.

    (2020)
  • RuzG. et al.

    Sentiment analysis of Twitter data during critical events through Bayesian networks classifiers

    Future Gener. Comput. Syst.

    (2020)
  • AmplayoR.K. et al.

    An adaptable fine-grained sentiment analysis for summarization of multiple short online reviews

    Data Knowl. Eng.

    (2017)
  • AbdiA. et al.

    Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment

    Expert Syst. Appl.

    (2018)
  • GhoshD.

    A sentiment-based hotel review summarization

    (2020)
  • MostafaL.

    Machine learning-based sentiment analysis for analyzing the travelers reviews on Egyptian hotels

    (2020)
  • KasperW. et al.

    Sentiment analysis for hotel reviews

    (2011)
  • Cited by (109)

    View all citing articles on Scopus
    View full text