当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Utilizing Deep Learning to Identify Drug Use on Twitter Data
arXiv - CS - Computation and Language Pub Date : 2020-03-08 , DOI: arxiv-2003.11522
Joseph Tassone, Peizhi Yan, Mackenzie Simpson, Chetan Mendhe, Vijay Mago, Salimur Choudhury

The collection and examination of social media has become a useful mechanism for studying the mental activity and behavior tendencies of users. Through the analysis of collected Twitter data, models were developed for classifying drug-related tweets. Using topic pertaining keywords, such as slang and methods of drug consumption, a set of tweets was generated. Potential candidates were then preprocessed resulting in a dataset of 3,696,150 rows. The classification power of multiple methods was compared including support vector machines (SVM), XGBoost, and convolutional neural network (CNN) based classifiers. Rather than simple feature or attribute analysis, a deep learning approach was implemented to screen and analyze the tweets' semantic meaning. The two CNN-based classifiers presented the best result when compared against other methodologies. The first was trained with 2,661 manually labeled samples, while the other included synthetically generated tweets culminating in 12,142 samples. The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91. Additionally, association rule mining showed that commonly mentioned drugs had a level of correspondence with frequently used illicit substances, proving the practical usefulness of the system. Lastly, the synthetically generated set provided increased scores, improving the classification capability and proving the worth of this methodology.

中文翻译:

利用深度学习识别 Twitter 数据上的药物使用情况

社交媒体的收集和检查已成为研究用户心理活动和行为倾向的有用机制。通过对收集到的 Twitter 数据的分析,开发了用于对与毒品相关的推文进行分类的模型。使用与主题相关的关键字,例如俚语和吸毒方法,生成了一组推文。然后对潜在的候选对象进行预处理,产生 3,696,150 行的数据集。比较了多种方法的分类能力,包括支持向量机 (SVM)、XGBoost 和基于卷积神经网络 (CNN) 的分类器。不是简单的特征或属性分析,而是实施了深度学习方法来筛选和分析推文的语义。与其他方法相比,这两个基于 CNN 的分类器呈现了最佳结果。第一个使用 2,661 个手动标记的样本进行训练,而另一个包含合成生成的推文,最终生成 12,142 个样本。准确度得分分别为 76.35% 和 82.31%,AUC 分别为 0.90 和 0.91。此外,关联规则挖掘表明,常用药物与常用非法物质有一定程度的对应关系,证明了该系统的实用性。最后,综合生成的集合提供了更高的分数,提高了分类能力并证明了这种方法的价值。关联规则挖掘表明,常用药物与常用违禁物质有一定程度的对应关系,证明了该系统的实用性。最后,综合生成的集合提供了更高的分数,提高了分类能力并证明了这种方法的价值。关联规则挖掘表明,常用药物与常用违禁物质有一定程度的对应关系,证明了该系统的实用性。最后,综合生成的集合提供了更高的分数,提高了分类能力并证明了这种方法的价值。
更新日期:2020-03-26
down
wechat
bug