An unsupervised and customizable misspelling generator for mining noisy health-related text sources.,Journal of Biomedical informatics

当前位置： X-MOL 学术 › J. Biomed. Inform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An unsupervised and customizable misspelling generator for mining noisy health-related text sources.
Journal of Biomedical informatics ( IF 4.0 ) Pub Date : 2018-11-13 , DOI: 10.1016/j.jbi.2018.11.007
Abeed Sarker ₁ , Graciela Gonzalez-Hernandez ₁

Affiliation

BACKGROUND Data collection and extraction from noisy text sources such as social media typically rely on keyword-based searching/listening. However, health-related terms are often misspelled in such noisy text sources due to their complex morphology, resulting in the exclusion of relevant data for studies. In this paper, we present a customizable data-centric system that automatically generates common misspellings for complex health-related terms, which can improve the data collection process from noisy text sources. MATERIALS AND METHODS The spelling variant generator relies on a dense vector model learned from large, unlabeled text, which is used to find semantically close terms to the original/seed keyword, followed by the filtering of terms that are lexically dissimilar beyond a given threshold. The process is executed recursively, converging when no new terms similar (lexically and semantically) to the seed keyword are found. The weighting of intra-word character sequence similarities allows further problem-specific customization of the system. RESULTS On a dataset prepared for this study, our system outperforms the current state-of-the-art medication name variant generator with best F1-score of 0.69 and F14-score of 0.78. Extrinsic evaluation of the system on a set of cancer-related terms demonstrated an increase of over 67% in retrieval rate from Twitter posts when the generated variants are included. DISCUSSION Our proposed spelling variant generator has several advantages over past spelling variant generators-(i) it is capable of filtering out lexically similar but semantically dissimilar terms, (ii) the number of variants generated is low, as many low-frequency and ambiguous misspellings are filtered out, and (iii) the system is fully automatic, customizable and easily executable. While the base system is fully unsupervised, we show how supervision may be employed to adjust weights for task-specific customizations. CONCLUSION The performance and relative simplicity of our proposed approach make it a much-needed spelling variant generation resource for health-related text mining from noisy sources. The source code for the system has been made publicly available for research.

中文翻译：

一个无人监督且可定制的拼写错误生成器，用于挖掘与健康相关的嘈杂文本源。

背景技术从诸如社交媒体之类的嘈杂文本源收集和提取数据通常依赖于基于关键字的搜索/收听。然而，由于其复杂的形态，与健康相关的术语在这些嘈杂的文本源中经常被拼写错误，导致研究中的相关数据被排除。在本文中，我们提出了一个可定制的以数据为中心的系统，该系统自动生成复杂的健康相关术语的常见拼写错误，这可以改进从嘈杂的文本源收集数据的过程。材料和方法拼写变体生成器依赖于从大型未标记文本中学习的密集向量模型，该模型用于查找与原始/种子关键字语义相近的术语，然后过滤词汇上不相似且超出给定阈值的术语。该过程以递归方式执行，当没有找到与种子关键字（词汇上和语义上）相似的新术语时收敛。字内字符序列相似性的加权允许进一步针对特定问题定制系统。结果在为本研究准备的数据集上，我们的系统优于当前最先进的药物名称变体生成器，最佳 F1 分数为 0.69，F14 分数为 0.78。对系统对一组癌症相关术语的外部评估表明，当包含生成的变体时，Twitter 帖子的检索率增加了 67% 以上。讨论我们提出的拼写变体生成器比过去的拼写变体生成器有几个优点 -（i）它能够过滤掉词汇上相似但语义上不相似的术语，（ii）生成的变体数量很少，因为有许多低频和不明确的拼写错误被过滤掉，并且（iii）系统是全自动的、可定制的且易于执行。虽然基本系统完全不受监督，但我们展示了如何利用监督来调整特定于任务的定制的权重。结论我们提出的方法的性能和相对简单性使其成为从噪声源中进行健康相关文本挖掘急需的拼写变体生成资源。该系统的源代码已公开供研究。

更新日期：2018-11-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11