Random Forest Missing Data Algorithms.,Statistical Analysis and Data Mining

当前位置： X-MOL 学术 › Stat. Anal. Data Min. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Random Forest Missing Data Algorithms.
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2017-06-13 , DOI: 10.1002/sam.11348
Fei Tang ₁ , Hemant Ishwaran ₁

Affiliation

Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting—the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

中文翻译：

随机森林缺失数据算法。

随机森林（RF）丢失数据算法是用于估算丢失数据的一种有吸引力的方法。它们具有能够处理混合类型的丢失数据的理想属性，可以适应交互作用和非线性，并且可以扩展到大数据设置。当前，有许多不同的RF插补算法，但是关于其功效的指导却很少。使用大量不同的数据集，在不同的缺失数据机制下评估了各种RF算法的插补性能。算法包括邻近插补，即时插补和使用多元无监督和有监督拆分的插补-后一类代表了一种新的有前途的插补算法MissForest的推广。我们的发现表明，RF估算通常具有较强的鲁棒性，并且性能随着相关性的提高而提高。在中度到高度缺失的情况下，甚至在某些情况下（不是在某些情况下）随机丢失数据时，性能也很好。

更新日期：2017-06-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11