Sentiment Analysis in Drug Reviews using Supervised Machine Learning Algorithms,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Sentiment Analysis in Drug Reviews using Supervised Machine Learning Algorithms
arXiv - CS - Computation and Language Pub Date : 2020-03-21 , DOI: arxiv-2003.11643
Sairamvinay Vijayaraghavan, Debraj Basu

Sentiment Analysis is an important algorithm in Natural Language Processing which is used to detect sentiment within some text. In our project, we had chosen to work on analyzing reviews of various drugs which have been reviewed in form of texts and have also been given a rating on a scale from 1-10. We had obtained this data set from the UCI machine learning repository which had 2 data sets: train and test (split as 75-25\%). We had split the number rating for the drug into three classes in general: positive (7-10), negative (1-4) or neutral(4-7). There are multiple reviews for the drugs that belong to a similar condition and we decided to investigate how the reviews for different conditions use different words impact the ratings of the drugs. Our intention was mainly to implement supervised machine learning classification algorithms that predict the class of the rating using the textual review. We had primarily implemented different embeddings such as Term Frequency Inverse Document Frequency (TFIDF) and the Count Vectors (CV). We had trained models on the most popular conditions such as "Birth Control", "Depression" and "Pain" within the data set and obtained good results while predicting the test data sets.

中文翻译：

使用监督机器学习算法进行药物评论中的情感分析

情感分析是自然语言处理中的一种重要算法，用于检测某些文本中的情感。在我们的项目中，我们选择了分析各种药物的评论，这些评论以文本形式进行了评论，并在 1-10 的范围内给予了评分。我们从 UCI 机器学习存储库中获得了这个数据集，它有 2 个数据集：训练和测试（分成 75-25\%）。我们将药物的数字评级大致分为三类：阳性 (7-10)、阴性 (1-4) 或中性 (4-7)。对属于类似情况的药物有多个评论，我们决定调查不同情况的评论使用不同的词如何影响药物的评级。我们的目的主要是实现有监督的机器学习分类算法，该算法使用文本评论来预测评级的类别。我们主要实现了不同的嵌入，例如词频逆文档频率 (TFIDF) 和计数向量 (CV)。我们对数据集中最流行的条件（例如“节育”、“抑郁”和“疼痛”）训练了模型，并在预测测试数据集的同时获得了良好的结果。

更新日期：2020-03-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>