当前位置: X-MOL 学术Biometrics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Nonparametric variable importance assessment using machine learning techniques
Biometrics ( IF 1.4 ) Pub Date : 2020-12-08 , DOI: 10.1111/biom.13392
Brian D Williamson 1 , Peter B Gilbert 1, 2 , Marco Carone 1, 2 , Noah Simon 1
Affiliation  

In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often sub-optimal for predicting the response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a variable importance measure that can be used with any regression technique, and whose interpretation is agnostic to the technique used. This measure is a property of the true data-generating mechanism. Specifically, we discuss a generalization of the ANOVA variable importance measure, and discuss how it facilitates the use of machine learning techniques to flexibly estimate the variable importance of a single feature or group of features. The importance of each feature or group of features in the data can then be described individually, using this measure. We describe how to construct an efficient estimator of this measure as well as a valid confidence interval. Through simulations, we show that our proposal has good practical operating characteristics, and we illustrate its use with data from a study of risk factors for cardiovascular disease in South Africa. This article is protected by copyright. All rights reserved.

中文翻译:


使用机器学习技术的非参数变量重要性评估



在回归设置中,量化各种特征在预测响应中的重要性通常很有趣。通常,所使用的变量重要性度量是由所使用的回归技术确定的。出于这个原因,从业者通常只采用几种自然定义变量重要性度量的回归技术之一。不幸的是,这些回归技术对于预测响应通常不是最佳的。此外,由于不同回归技术本身的变量重要性度量通常具有不同的解释,因此跨技术的比较可能很困难。在这项工作中,我们研究了一种可与任何回归技术一起使用的变量重要性度量,其解释与所使用的技术无关。该度量是真实数据生成机制的属性。具体来说,我们讨论了方差分析变量重要性度量的推广,并讨论了它如何促进使用机器学习技术来灵活估计单个特征或特征组的变量重要性。然后可以使用此度量单独描述数据中每个特征或特征组的重要性。我们描述了如何构建该度量的有效估计器以及有效的置信区间。通过模拟,我们表明我们的建议具有良好的实际操作特性,并通过南非心血管疾病危险因素研究的数据来说明其用途。本文受版权保护。版权所有。
更新日期:2020-12-08
down
wechat
bug