当前位置: X-MOL 学术Biometrics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Statistical inference for natural language processing algorithms with a demonstration using type 2 diabetes prediction from electronic health record notes
Biometrics ( IF 1.9 ) Pub Date : 2020-07-22 , DOI: 10.1111/biom.13338
Brian L Egleston 1 , Tian Bai 2 , Richard J Bleicher 3 , Stanford J Taylor 4 , Michael H Lutz 4 , Slobodan Vucetic 2
Affiliation  

The pointwise mutual information statistic (PMI), which measures how often two words occur together in a document corpus, is a cornerstone of recently proposed popular natural language processing algorithms such as word2vec. PMI and word2vec reveal semantic relationships between words and can be helpful in a range of applications such as document indexing, topic analysis, or document categorization. We use probability theory to demonstrate the relationship between PMI and word2vec. We use the theoretical results to demonstrate how the PMI can be modeled and estimated in a simple and straight forward manner. We further describe how one can obtain standard error estimates that account for within-patient clustering that arises from patterns of repeated words within a patient's health record due to a unique health history. We then demonstrate the usefulness of PMI on the problem of predictive identification of disease from free text notes of electronic health records. Specifically, we use our methods to distinguish those with and without type 2 diabetes mellitus in electronic health record free text data using over 400 000 clinical notes from an academic medical center.

中文翻译:

自然语言处理算法的统计推断,并使用电子健康记录笔记中的 2 型糖尿病预测进行演示

逐点互信息统计 (PMI) 衡量两个词在文档语料库中一起出现的频率,是最近提出的流行自然语言处理算法(如 word2vec)的基石。PMI 和 word2vec 揭示了单词之间的语义关系,并且可以在文档索引、主题分析或文档分类等一系列应用中提供帮助。我们使用概率论来证明 PMI 和 word2vec 之间的关系。我们使用理论结果来演示如何以简单直接的方式对 PMI 进行建模和估计。我们进一步描述了如何获得标准误差估计,以解释由于独特的健康史而导致患者健康记录中重复单词的模式产生的患者内部聚类。然后,我们证明了 PMI 在从电子健康记录的自由文本注释中预测疾病识别问题上的有用性。具体来说,我们使用我们的方法来区分那些使用来自学术医疗中心的 400 多份临床笔记的电子健康记录自由文本数据中存在和不存在 2 型糖尿病的人。
更新日期:2020-07-22
down
wechat
bug