当前位置: X-MOL 学术J. R. Stat. Soc. A › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Removing the influence of group variables in high-dimensional predictive modelling
The Journal of the Royal Statistical Society, Series A (Statistics in Society) ( IF 1.5 ) Pub Date : 2021-04-15 , DOI: 10.1111/rssa.12613
Emanuele Aliverti 1 , Kristian Lum 2 , James E Johndrow 3 , David B Dunson 4
Affiliation  

In many application areas, predictive models are used to support or make important decisions. There is increasing awareness that these models may contain spurious or otherwise undesirable correlations. Such correlations may arise from a variety of sources, including batch effects, systematic measurement errors or sampling bias. Without explicit adjustment, machine learning algorithms trained using these data can produce out-of-sample predictions which propagate these undesirable correlations. We propose a method to pre-process the training data, producing an adjusted dataset that is statistically independent of the nuisance variables with minimum information loss. We develop a conceptually simple approach for creating an adjusted dataset in high-dimensional settings based on a constrained form of matrix decomposition. The resulting dataset can then be used in any predictive algorithm with the guarantee that predictions will be statistically independent of the nuisance variables. We develop a scalable algorithm for implementing the method, along with theory support in the form of independence guarantees and optimality. The method is illustrated on some simulation examples and applied to two case studies: removing machine-specific correlations from brain scan data, and removing ethnicity information from a dataset used to predict recidivism. That the motivation for removing undesirable correlations is quite different in the two applications illustrates the broad applicability of our approach.

中文翻译:


消除高维预测建模中群体变量的影响



在许多应用领域,预测模型用于支持或做出重要决策。人们越来越认识到这些模型可能包含虚假或其他不良相关性。这种相关性可能来自多种来源,包括批次效应、系统测量误差或抽样偏差。如果没有明确的调整,使用这些数据训练的机器学习算法可能会产生样本外预测,从而传播这些不良相关性。我们提出了一种预处理训练数据的方法,生成一个调整后的数据集,该数据集在统计上独立于干扰变量,且信息损失最小。我们开发了一种概念上简单的方法,用于基于矩阵分解的约束形式在高维设置中创建调整后的数据集。然后,生成的数据集可以用于任何预测算法,并保证预测在统计上独立于干扰变量。我们开发了一种可扩展的算法来实现该方法,并提供独立保证和最优性形式的理论支持。该方法在一些模拟示例中进行了说明,并应用于两个案例研究:从大脑扫描数据中删除机器特定的相关性,以及从用于预测累犯的数据集中删除种族信息。在这两个应用中消除不需要的相关性的动机是完全不同的,这说明了我们的方法的广泛适用性。
更新日期:2021-04-15
down
wechat
bug