当前位置: X-MOL 学术Methods Inf. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Is Multiclass Automatic Text De-Identification Worth the Effort?
Methods of Information in Medicine ( IF 1.7 ) Pub Date : 2018-09-01 , DOI: 10.3414/me18-01-0017
Duy Duc An Bui , David T. Redden , James J. Cimino

OBJECTIVES Automatic de-identification to remove protected health information (PHI) from clinical text can use a "binary" model that replaces redacted text with a generic tag (e.g., ""), or can use a "multiclass" model that retains more class information (e.g., ""). Binary models are easier to develop, but result in text that is potentially less informative. We investigated whether building a multiclass de-identification is worth the extra effort. METHODS Using the 2014 i2b2 dataset, we compared the accuracy and impact on document readability of two models. In the first experiment, we generated one binary and two multiclass versions trained with the same machine-learning algorithm Conditional Random Field (CRF). Accuracy (recall, precision, f-score) and secondary metrics (e.g, training time, testing time, minimum memory required) were measured. In the second experiment, three reviewers accessed the readability of two redacted documents using the binary and multiclass methods. We estimated a pooled Kappa to estimate the inter-rater agreement. RESULTS The multiclass model did not demonstrate a clear accuracy advantage, with lower recall (-1.9%) and only slightly better precision (+0.6%), despite requiring additional computing resources. Three raters reached a very high agreement (Kappa = 0.975, 95% Confidence Interval (0.946, 1.00), p < 0.0001) that both binary and multiclass models have the same impact on document readability. CONCLUSIONS This study suggests that the development of more sophisticated classification of PHI may not be worth the effort in terms of both system accuracy and the usefulness of the output.

中文翻译:

多类自动文本去识别是否值得努力?

目的从临床文本中删除受保护的健康信息(PHI)的自动取消身份验证可以使用“二进制”模型,该模型用通用标签(例如“”)替换已编辑的文本,或者可以使用保留更多类别的“多类”模型信息(例如“”)。二进制模型更易于开发,但可能导致文本信息较少。我们调查了构建多类取消标识是否值得付出额外的努力。方法使用2014 i2b2数据集,我们比较了两个模型的准确性和对文档可读性的影响。在第一个实验中,我们生成了一个二进制和两个使用同一机器学习算法条件随机场(CRF)训练的多类版本。准确性(召回率,精度,f分数)和次要指标(例如训练时间,测试时间,测量所需的最小内存)。在第二个实验中,三位审阅者使用二进制和多类方法访问了两个编辑过的文档的可读性。我们估计了合并的Kappa来估计评分者之间的协议。结果尽管需要更多的计算资源,但多类模型并未显示出明显的准确性优势,召回率较低(-1.9%),而精度仅稍好(+ 0.6%)。三个评分者达成了非常高的共识(Kappa = 0.975,置信区间为95%(0.946,1.00),p <0.0001),二进制和多类模型对文档的可读性具有相同的影响。结论这项研究表明,就系统准确性和输出的实用性而言,开发更复杂的PHI分类可能不值得付出努力。
更新日期:2018-09-01
down
wechat
bug