A comparison of two dissimilarity functions for mixed-type predictor variables in the $$\delta $$ δ -machine,Advances in Data Analysis and Classification

当前位置： X-MOL 学术 › Adv. Data Anal. Classif. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A comparison of two dissimilarity functions for mixed-type predictor variables in the $$\delta $$ δ -machine
Advances in Data Analysis and Classification ( IF 1.4 ) Pub Date : 2021-09-16 , DOI: 10.1007/s11634-021-00463-6
Beibei Yuan ₁ , Willem Heiser ₁ , Mark de Rooij ₁

Affiliation

The $\delta $-machine is a statistical learning tool for classification based on dissimilarities or distances between profiles of the observations to profiles of a representation set, which was proposed by Yuan et al. (J Claasif 36(3): 442–470, 2019). So far, the $\delta $-machine was restricted to continuous predictor variables only. In this article, we extend the $\delta $-machine to handle continuous, ordinal, nominal, and binary predictor variables. We utilized a tailored dissimilarity function for mixed type variables which was defined by Gower. This measure has properties of a Manhattan distance. We develop, in a similar vein, a Euclidean dissimilarity function for mixed type variables. In simulation studies we compare the performance of the two dissimilarity functions and we compare the predictive performance of the $\delta $-machine to logistic regression models. We generated data according to two population distributions where the type of predictor variables, the distribution of categorical variables, and the number of predictor variables was varied. The performance of the $\delta $-machine using the two dissimilarity functions and different types of representation set was investigated. The simulation studies showed that the adjusted Euclidean dissimilarity function performed better than the adjusted Gower dissimilarity function; that the $\delta $-machine outperformed logistic regression; and that for constructing the representation set, K-medoids clustering achieved fewer active exemplars than the one using K-means clustering while maintaining the accuracy. We also applied the $\delta $-machine to an empirical example, discussed its interpretation in detail, and compared the classification performance with five other classification methods. The results showed that the $\delta $-machine has a good balance between accuracy and interpretability.

中文翻译：

$$ \ delta $$ δ -machine中混合型预测变量的两个相异函数的比较

的\（\增量\） -machine是基于所述观测值的简档之间不相似性或距离的表示集合，这是提出Yuan等人的配置文件分类的统计学习工具。(J Claasif 36 (3): 442-470, 2019)。到目前为止，\ (\ delta \) -machine 仅限于连续预测变量。在本文中，我们扩展了$\delta$- 处理连续、有序、名义和二元预测变量的机器。我们对由 Gower 定义的混合类型变量使用了量身定制的相异函数。该度量具有曼哈顿距离的特性。我们以类似的方式开发了混合类型变量的欧几里德相异函数。在模拟研究中，我们比较了两个不同函数的性能，并将$\delta$ 机器的预测性能与逻辑回归模型进行了比较。我们根据两种人口分布生成数据，其中预测变量的类型、分类变量的分布和预测变量的数量是不同的。所述的性能\（\三角洲\）研究了使用两个不同函数和不同类型表示集的机器。仿真研究表明，调整后的欧几里德相异函数比调整后的高尔相异函数表现更好；该\（\增量\） -machine跑赢logistic回归; 在构建表示集方面，K- medoids 聚类比使用K- means 聚类的活跃示例更少，同时保持准确性。我们还将$\delta$ -machine应用于一个经验示例，详细讨论了它的解释，并将分类性能与其他五种分类方法进行了比较。结果表明$\delta$-machine 在准确性和可解释性之间取得了很好的平衡。

更新日期：2021-09-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11