当前位置: X-MOL 学术arXiv.cs.SI › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A large-scale Twitter dataset for drug safety applications mined from publicly existing resources
arXiv - CS - Social and Information Networks Pub Date : 2020-03-31 , DOI: arxiv-2003.13900
Ramya Tekumalla and Juan M. Banda

With the increase in popularity of deep learning models for natural language processing (NLP) tasks, in the field of Pharmacovigilance, more specifically for the identification of Adverse Drug Reactions (ADRs), there is an inherent need for large-scale social-media datasets aimed at such tasks. With most researchers allocating large amounts of time to crawl Twitter or buying expensive pre-curated datasets, then manually annotating by humans, these approaches do not scale well as more and more data keeps flowing in Twitter. In this work we re-purpose a publicly available archived dataset of more than 9.4 billion Tweets with the objective of creating a very large dataset of drug usage-related tweets. Using existing manually curated datasets from the literature, we then validate our filtered tweets for relevance using machine learning methods, with the end result of a publicly available dataset of 1,181,993 million tweets for public use. We provide all code and detailed procedure on how to extract this dataset and the selected tweet ids for researchers to use.

中文翻译:

从公开现有资源中挖掘的用于药物安全应用的大规模 Twitter 数据集

随着用于自然语言处理 (NLP) 任务的深度学习模型越来越流行,在药物警戒领域,更具体地说是用于识别药物不良反应 (ADR),对大规模社交媒体数据集的内在需求针对此类任务。由于大多数研究人员分配大量时间来抓取 Twitter 或购买昂贵的预先策划的数据集,然后由人工手动注释,随着越来越多的数据在 Twitter 中不断流动,这些方法不能很好地扩展。在这项工作中,我们重新利用了一个包含超过 94 亿条推文的公开存档数据集,目的是创建一个非常大的与药物使用相关的推文数据集。使用文献中现有的手动策划数据集,然后我们使用机器学习方法验证我们过滤的推文的相关性,最终结果是一个公开可用的数据集,其中包含 11,819.93 亿条推文供公众使用。我们提供了有关如何提取此数据集和所选推文 ID 以供研究人员使用的所有代码和详细过程。
更新日期:2020-04-01
down
wechat
bug