Unsupervised random forests,Statistical Analysis and Data Mining

当前位置： X-MOL 学术 › Stat. Anal. Data Min. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Unsupervised random forests
Statistical Analysis and Data Mining ( IF 2.1 ) Pub Date : 2021-02-05 , DOI: 10.1002/sam.11498
Alejandro Mantero ₁ , Hemant Ishwaran ₁

Affiliation

sidClustering is a new random forests unsupervised machine learning algorithm. The first step in sidClustering involves what is called sidification of the features: staggering the features to have mutually exclusive ranges (called the staggered interaction data [SID] main features) and then forming all pairwise interactions (called the SID interaction features). Then a multivariate random forest (able to handle both continuous and categorical variables) is used to predict the SID main features. We establish uniqueness of sidification and show how multivariate impurity splitting is able to identify clusters. The proposed sidClustering method is adept at finding clusters arising from categorical and continuous variables and retains all the important advantages of random forests. The method is illustrated using simulated and real data as well as two in depth case studies, one from a large multi‐institutional study of esophageal cancer, and the other involving hospital charges for cardiovascular patients.

中文翻译：

无监督的随机森林

sidClustering是一种新的随机森林无监督机器学习算法。sidClustering的第一步涉及特征的sidification：将特征交错以具有互斥范围（称为交错交互数据[SID]主要特征），然后形成所有成对交互（称为SID交互特征）。然后，使用多元随机森林（能够处理连续变量和分类变量）来预测SID主要特征。我们建立了sidification的唯一性，并显示了多元杂质分裂如何能够识别簇。所提出的sidClustering方法擅长查找由分类变量和连续变量引起的聚类，并且保留了随机森林的所有重要优点。

更新日期：2021-03-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11