当前位置: X-MOL 学术Journal of Data and Information Science › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identification of Sarcasm in Textual Data: A Comparative Study
Journal of Data and Information Science ( IF 1.5 ) Pub Date : 2019-12-27 , DOI: 10.2478/jdis-2019-0021
Pulkit Mehndiratta 1 , Devpriya Soni 1
Affiliation  

Abstract Purpose Ever increasing penetration of the Internet in our lives has led to an enormous amount of multimedia content generation on the internet. Textual data contributes a major share towards data generated on the world wide web. Understanding people’s sentiment is an important aspect of natural language processing, but this opinion can be biased and incorrect, if people use sarcasm while commenting, posting status updates or reviewing any product or a movie. Thus, it is of utmost importance to detect sarcasm correctly and make a correct prediction about the people’s intentions. Design/methodology/approach This study tries to evaluate various machine learning models along with standard and hybrid deep learning models across various standardized datasets. We have performed vectorization of text using word embedding techniques. This has been done to convert the textual data into vectors for analytical purposes. We have used three standardized datasets available in public domain and used three word embeddings i.e Word2Vec, GloVe and fastText to validate the hypothesis. Findings The results were analyzed and conclusions are drawn. The key finding is: the hybrid models that include Bidirectional LongTerm Short Memory (Bi-LSTM) and Convolutional Neural Network (CNN) outperform others conventional machine learning as well as deep learning models across all the datasets considered in this study, making our hypothesis valid. Research limitations Using the data from different sources and customizing the models according to each dataset, slightly decreases the usability of the technique. But, overall this methodology provides effective measures to identify the presence of sarcasm with a minimum average accuracy of 80% or above for one dataset and better than the current baseline results for the other datasets. Practical implications The results provide solid insights for the system developers to integrate this model into real-time analysis of any review or comment posted in the public domain. This study has various other practical implications for businesses that depend on user ratings and public opinions. This study also provides a launching platform for various researchers to work on the problem of sarcasm identification in textual data. Originality/value This is a first of its kind study, to provide us the difference between conventional and the hybrid methods of prediction of sarcasm in textual data. The study also provides possible indicators that hybrid models are better when applied to textual data for analysis of sarcasm.

中文翻译:

文本数据中讽刺的识别:一项比较研究

摘要目的互联网在我们生活中的渗透日益增加,导致互联网上产生了大量的多媒体内容。文本数据在万维网上生成的数据中占主要份额。理解人们的情感是自然语言处理的重要方面,但是,如果人们在评论,发布状态更新或评论任何产品或电影时使用讽刺的话,这种观点可能会产生偏见和不正确。因此,正确检测讽刺并正确预测人们的意图至关重要。设计/方法/方法这项研究试图评估各种机器学习模型以及跨各种标准化数据集的标准和混合深度学习模型。我们已经使用词嵌入技术对文本进行了矢量化。这样做是为了将文本数据转换为矢量以进行分析。我们使用了三个可在公共领域使用的标准化数据集,并使用了三个单词嵌入(即Word2Vec,GloVe和fastText)来验证该假设。结果分析结果并得出结论。关键发现是:包含双向长期短时记忆(Bi-LSTM)和卷积神经网络(CNN)的混合模型在本研究中考虑的所有数据集上均优于其他常规机器学习以及深度学习模型,这使我们的假设有效。研究局限性使用来自不同来源的数据并根据每个数据集定制模型会稍微降低该技术的可用性。但,总体而言,该方法提供了一种有效的措施,以一个数据集的最低平均准确度达到80%或以上,并优于其他数据集的当前基线结果,以识别讽刺的存在。实际意义该结果为系统开发人员提供了扎实的见解,可以将该模型集成到对公共领域中发布的任何评论或评论的实时分析中。这项研究对依赖用户评级和公众意见的企业具有其他实际意义。该研究还为各种研究人员提供了一个启动平台,以研究文本数据中的讽刺​​识别问题。原创性/价值这是同类研究中的第一个,旨在为我们提供文本数据中嘲讽预测的传统方法与混合方法之间的区别。
更新日期:2019-12-27
down
wechat
bug