当前位置: X-MOL 学术J. R. Stat. Soc. A › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Removing the influence of group variables in high-dimensional predictive modelling
The Journal of the Royal Statistical Society, Series A (Statistics in Society) ( IF 2 ) Pub Date : 2021-04-15 , DOI: 10.1111/rssa.12613
Emanuele Aliverti 1 , Kristian Lum 2 , James E Johndrow 3 , David B Dunson 4
Affiliation  

In many application areas, predictive models are used to support or make important decisions. There is increasing awareness that these models may contain spurious or otherwise undesirable correlations. Such correlations may arise from a variety of sources, including batch effects, systematic measurement errors or sampling bias. Without explicit adjustment, machine learning algorithms trained using these data can produce out-of-sample predictions which propagate these undesirable correlations. We propose a method to pre-process the training data, producing an adjusted dataset that is statistically independent of the nuisance variables with minimum information loss. We develop a conceptually simple approach for creating an adjusted dataset in high-dimensional settings based on a constrained form of matrix decomposition. The resulting dataset can then be used in any predictive algorithm with the guarantee that predictions will be statistically independent of the nuisance variables. We develop a scalable algorithm for implementing the method, along with theory support in the form of independence guarantees and optimality. The method is illustrated on some simulation examples and applied to two case studies: removing machine-specific correlations from brain scan data, and removing ethnicity information from a dataset used to predict recidivism. That the motivation for removing undesirable correlations is quite different in the two applications illustrates the broad applicability of our approach.

中文翻译:

去除高维预测建模中组变量的影响

在许多应用领域,预测模型用于支持或做出重要决策。人们越来越意识到这些模型可能包含虚假或其他不希望的相关性。这种相关性可能来自多种来源,包括批次效应、系统测量误差或抽样偏差。如果没有显式调整,使用这些数据训练的机器学习算法可以产生样本外预测,从而传播这些不良相关性。我们提出了一种预处理训练数据的方法,生成一个调整后的数据集,该数据集在统计上独立于令人讨厌的变量,并且信息损失最小。我们开发了一种概念上简单的方法,用于基于矩阵分解的约束形式在高维设置中创建调整后的数据集。然后可以在任何预测算法中使用生成的数据集,并保证预测在统计上独立于讨厌的变量。我们开发了一种可扩展的算法来实现该方法,以及独立性保证和最优性形式的理论支持。该方法在一些模拟示例中进行了说明,并应用于两个案例研究:从脑部扫描数据中删除机器特定的相关性,以及从用于预测累犯的数据集中删除种族信息。在这两个应用程序中消除不良相关性的动机完全不同,这说明了我们方法的广泛适用性。我们开发了一种可扩展的算法来实现该方法,以及独立性保证和最优性形式的理论支持。该方法在一些模拟示例中进行了说明,并应用于两个案例研究:从脑部扫描数据中删除机器特定的相关性,以及从用于预测累犯的数据集中删除种族信息。在这两个应用程序中消除不良相关性的动机完全不同,这说明了我们方法的广泛适用性。我们开发了一种可扩展的算法来实现该方法,以及独立性保证和最优性形式的理论支持。该方法在一些模拟示例中进行了说明,并应用于两个案例研究:从脑部扫描数据中删除机器特定的相关性,以及从用于预测累犯的数据集中删除种族信息。在这两个应用程序中消除不良相关性的动机完全不同,这说明了我们方法的广泛适用性。
更新日期:2021-04-15
down
wechat
bug