A Biomedically oriented automatically annotated Twitter COVID-19 Dataset,arXiv - CS - Information Retrieval

当前位置： X-MOL 学术 › arXiv.cs.IR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Biomedically oriented automatically annotated Twitter COVID-19 Dataset
arXiv - CS - Information Retrieval Pub Date : 2021-07-27 , DOI: arxiv-2107.12565
Luis Alberto Robles Hernandez, Tiffany J. Callahan, Juan M. Banda

The use of social media data, like Twitter, for biomedical research has been gradually increasing over the years. With the COVID-19 pandemic, researchers have turned to more nontraditional sources of clinical data to characterize the disease in near real-time, study the societal implications of interventions, as well as the sequelae that recovered COVID-19 cases present (Long-COVID). However, manually curated social media datasets are difficult to come by due to the expensive costs of manual annotation and the efforts needed to identify the correct texts. When datasets are available, they are usually very small and their annotations do not generalize well over time or to larger sets of documents. As part of the 2021 Biomedical Linked Annotation Hackathon, we release our dataset of over 120 million automatically annotated tweets for biomedical research purposes. Incorporating best practices, we identify tweets with potentially high clinical relevance. We evaluated our work by comparing several SpaCy-based annotation frameworks against a manually annotated gold-standard dataset. Selecting the best method to use for automatic annotation, we then annotated 120 million tweets and released them publicly for future downstream usage within the biomedical domain.

中文翻译：

面向生物医学的自动注释 Twitter COVID-19 数据集

多年来，将社交媒体数据（如 Twitter）用于生物医学研究的情况逐渐增加。随着 COVID-19 大流行，研究人员已转向更多非传统的临床数据来源，以近乎实时地描述疾病特征，研究干预措施的社会影响，以及恢复存在的 COVID-19 病例的后遗症（Long-COVID ）。然而，由于人工标注的成本高昂以及识别正确文本所需的努力，人工策划的社交媒体数据集很难获得。当数据集可用时，它们通常非常小，并且它们的注释随着时间的推移或更大的文档集不能很好地泛化。作为 2021 年生物医学关联注释黑客马拉松的一部分，我们发布了超过 1.2 亿条自动注释推文的数据集，用于生物医学研究。结合最佳实践，我们确定具有潜在高临床相关性的推文。我们通过将几个基于 SpaCy 的注释框架与手动注释的黄金标准数据集进行比较来评估我们的工作。选择用于自动注释的最佳方法，然后我们对 1.2 亿条推文进行了注释，并将它们公开发布以供未来在生物医学领域的下游使用。

更新日期：2021-07-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文