当前位置: X-MOL 学术Stat › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Causal inference in the presence of missing data using a random forest‐based matching algorithm
Stat ( IF 1.7 ) Pub Date : 2020-10-23 , DOI: 10.1002/sta4.326
Tristan Hillis 1 , Maureen A. Guarcello 2 , Richard A. Levine 3, 4 , Juanjuan Fan 3, 4
Affiliation  

Observational studies require matching across groups over multiple confounding variables. Across the literature, matching algorithms fail to handle the issue of missing data. Consequently, missing values are regularly imputed prior to being considered in the matching process. However, imputing is not always practical, forcing us to drop an observation due to the deficiency of the chosen algorithm, decreasing the power of the study and possibly failing to capture crucial latent information. We propose a missing data mechanism to incorporate within an iterative multivariate matching method. The underlying framework utilizes random forest as a natural tool in constructing a distance matrix, implemented with surrogate splits where there might be missing values. The output is then easily fed into an optimal matching algorithm. We apply this method to evaluate the effectiveness of supplemental instruction (SI) sessions, a voluntary program where students seek additional help, in a large enrollment, bottleneck introductory business statistics course. This is an observational study with two groups, those who attend multiple SI sessions and those who do not, and, as typical in educational data mining, challenged by missing data. Additionally, we perform a data simulation on missingness to further demonstrate the efficacy of our proposed approach.

中文翻译:

使用基于森林的随机匹配算法在缺少数据的情况下进行因果推断

观察性研究需要在多个混淆变量之间进行跨组匹配。在整个文献中,匹配算法无法处理丢失数据的问题。因此,在匹配过程中考虑缺失值之前,会定期估算缺失值。但是,估算并非总是可行的,由于所选算法的不足,迫使我们放弃观察,降低了研究的能力,并可能无法捕获关键的潜在信息。我们提出一种缺失的数据机制,以将其纳入迭代多元匹配方法中。底层框架利用随机森林作为构建距离矩阵的自然工具,并通过可能存在缺失值的替代分割来实现。然后将输出轻松输入到最佳匹配算法中。我们采用这种方法来评估补充教学(SI)课程的有效性,这是一项自愿性计划,学生可以在大量招募中的瓶颈介绍性商业统计课程中寻求更多帮助。这是一项观察性研究,分为两组,一组参加多次SI会议,另一组没有参加,并且在教育数据挖掘中很典型,它们受到丢失数据的挑战。此外,我们对缺失进行了数据模拟,以进一步证明我们提出的方法的有效性。作为教育数据挖掘中的典型问题,受到数据丢失的挑战。此外,我们对缺失进行了数据模拟,以进一步证明我们提出的方法的有效性。作为教育数据挖掘中的典型问题,受到数据丢失的挑战。此外,我们对缺失进行了数据模拟,以进一步证明我们提出的方法的有效性。
更新日期:2020-10-23
down
wechat
bug