RedMed: Extending drug lexicons for social media applications.,Journal of Biomedical informatics

当前位置： X-MOL 学术 › J. Biomed. Inform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

RedMed: Extending drug lexicons for social media applications.
Journal of Biomedical informatics ( IF 4.0 ) Pub Date : 2019-10-15 , DOI: 10.1016/j.jbi.2019.103307
Adam Lavertu ₁ , Russ B Altman ₂

Affiliation

Social media has been identified as a promising potential source of information for pharmacovigilance. The adoption of social media data has been hindered by the massive and noisy nature of the data. Initial attempts to use social media data have relied on exact text matches to drugs of interest, and therefore suffer from the gap between formal drug lexicons and the informal nature of social media. The Reddit comment archive represents an ideal corpus for bridging this gap. We trained a word embedding model, RedMed, to facilitate the identification and retrieval of health entities from Reddit data. We compare the performance of our model trained on a consumer-generated corpus against publicly available models trained on expert-generated corpora. Our automated classification pipeline achieves an accuracy of 0.88 and a specificity of >0.9 across four different term classes. Of all drug mentions, an average of 79% (±0.5%) were exact matches to a generic or trademark drug name, 14% (±0.5%) were misspellings, 6.4% (±0.3%) were synonyms, and 0.13% (±0.05%) were pill marks. We find that our system captures an additional 20% of mentions; these would have been missed by approaches that rely solely on exact string matches. We provide a lexicon of misspellings and synonyms for 2978 drugs and a word embedding model trained on a health-oriented subset of Reddit.

中文翻译：

RedMed：扩展社交媒体应用的药物词典。

社交媒体已被认为是药物警戒有前途的潜在信息来源。社交媒体数据的采用因数据的庞大和嘈杂而受到阻碍。使用社交媒体数据的最初尝试依赖于与感兴趣的药物的精确文本匹配，因此受到正式药物词典与社交媒体的非正式性质之间的差距的影响。 Reddit 评论档案是弥补这一差距的理想语料库。我们训练了一个词嵌入模型 RedMed，以方便从 Reddit 数据中识别和检索健康实体。我们将在消费者生成的语料库上训练的模型与在专家生成的语料库上训练的公开模型的性能进行比较。我们的自动分类管道在四个不同的术语类别中实现了 0.88 的准确度和 >0.9 的特异性。在所有提到的药物中，平均 79% (±0.5%) 与通用名或商标药物名称完全匹配，14% (±0.5%) 是拼写错误，6.4% (±0.3%) 是同义词，0.13% (±0.3%) 是同义词。 ±0.05%）为药丸痕迹。我们发现我们的系统捕获了额外 20% 的提及；仅依赖于精确字符串匹配的方法可能会错过这些。我们提供了 2978 种药物的拼写错误和同义词词典，以及在 Reddit 健康导向子集上训练的词嵌入模型。

更新日期：2019-10-15

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11