Accounting for dependent errors in predictors and time-to-event outcomes using electronic health records, validation samples and multiple imputation,Annals of Applied Statistics

当前位置： X-MOL 学术 › Ann. Appl. Stat. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Accounting for dependent errors in predictors and time-to-event outcomes using electronic health records, validation samples and multiple imputation
Annals of Applied Statistics ( IF 1.3 ) Pub Date : 2020-06-29 , DOI: 10.1214/20-aoas1343
Mark J Giganti ₁ , Pamela A Shaw ₂ , Guanhua Chen ₃ , Sally S Bebawy ₄ , Megan M Turner ₄ , Timothy R Sterling ₄ , Bryan E Shepherd ₁

Affiliation

Data from electronic health records (EHR) are prone to errors which are often correlated across multiple variables. The error structure is further complicated when analysis variables are derived as functions of two or more error-prone variables. Such errors can substantially impact estimates, yet we are unaware of methods that simultaneously account for errors in covariates and time-to-event outcomes. Using EHR data from 4217 patients, the hazard ratio for an AIDS-defining event associated with a 100 cell/mm$^{3}$ increase in CD4 count at ART initiation was 0.74 (95$\%$CI: 0.68–0.80) using unvalidated data and 0.60 (95$\%$CI: 0.53–0.68) using fully validated data. Our goal is to obtain unbiased and efficient estimates after validating a random subset of records. We propose fitting discrete failure time models to the validated subsample and then multiply imputing values for unvalidated records. We demonstrate how this approach simultaneously addresses dependent errors in predictors, time-to-event outcomes, and inclusion criteria. Using the fully validated dataset as a gold standard, we compare the mean squared error of our estimates with those from the unvalidated dataset and the corresponding subsample-only dataset for various subsample sizes. By incorporating reasonably sized validated subsamples and appropriate imputation models, our approach had improved estimation over both the naive analysis and the analysis using only the validation subsample.

中文翻译：

使用电子健康记录、验证样本和多重插补来解释预测变量和事件发生时间结果中的相关误差

电子健康记录 (EHR) 中的数据很容易出现错误，这些错误通常与多个变量相关。当分析变量作为两个或多个易出错变量的函数导出时，错误结构会更加复杂。此类错误可能会对估计产生重大影响，但我们不知道有哪些方法可以同时解释协变量和事件发生时间结果中的错误。使用 4217 名患者的 EHR 数据，与开始 ART 时 CD4 计数增加 100 个细胞/mm$^{3}$ 相关的 AIDS 定义事件的风险比为 0.74 (95$\%$CI: 0.68–0.80)使用未经验证的数据和 0.60 (95$\%$CI: 0.53–0.68) 使用完全验证的数据。我们的目标是在验证随机记录子集后获得无偏且有效的估计。我们建议将离散故障时间模型拟合到经过验证的子样本，然后乘以未经验证的记录的估算值。我们演示了这种方法如何同时解决预测变量、事件发生时间结果和纳入标准中的相关错误。使用完全验证的数据集作为黄金标准，我们将估计值的均方误差与未经验证的数据集和相应的仅子样本数据集（针对不同子样本大小）的均方误差进行比较。通过合并合理大小的经过验证的子样本和适当的插补模型，我们的方法改进了对朴素分析和仅使用验证子样本的分析的估计。

更新日期：2020-06-29

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文