当前位置: X-MOL 学术J. Comput. Graph. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Simultaneous Semiparametric Estimation of Clustering and Regression
Journal of Computational and Graphical Statistics ( IF 2.4 ) Pub Date : 2022-01-04 , DOI: 10.1080/10618600.2021.2000872
Matthieu Marbac 1 , Mohammed Sedki 2 , Christophe Biernacki 3 , Vincent Vandewalle 4
Affiliation  

ABSTRACT

We investigate the parameter estimation of regression models with fixed group effects, when the group variable is missing while group-related variables are available. This problem involves clustering to infer the missing group variable based on the group-related variables, and regression to build a model on the target variable given the group and eventually some additional variables. Thus, this problem can be formulated as the joint distribution modeling of the target and of the group-related variables. The usual parameter estimation strategy for this joint model is a two-step approach starting by learning the group variable (clustering step) and then plugging in its estimator for fitting the regression model (regression step). However, this approach is suboptimal (providing in particular biased regression estimates) since it does not make use of the target variable for clustering. Thus, we advise the use of a simultaneous estimation approach of both clustering and regression, in a semiparametric framework. Numerical experiments illustrate the benefits of our proposition by considering wide ranges of distributions and regression models. The relevance of our new method is illustrated on real data dealing with problems associated with high blood pressure prevention. The proposed approach is implemented in the R package ClusPred available on CRAN. Supplementary materials containing the technical details and the R codes are available online.



中文翻译:

聚类和回归的同时半参数估计

摘要

当组变量缺失而组相关变量可用时,我们研究了具有固定组效应的回归模型的参数估计。这个问题涉及聚类以根据组相关变量推断缺失的组变量,以及回归以在给定组的目标变量和最终一些附加变量上建立模型。因此,这个问题可以表述为目标和组相关变量的联合分布建模。该联合模型的常用参数估计策略是一种两步方法,首先学习组变量(聚类步骤),然后插入其估计器以拟合回归模型(回归步骤)。然而,这种方法不是最理想的(特别是提供有偏回归估计),因为它没有使用目标变量进行聚类。因此,我们建议在半参数框架中使用聚类和回归的同时估计方法。数值实验通过考虑广泛的分布和回归模型来说明我们的提议的好处。我们的新方法的相关性在处理与高血压预防相关的问题的真实数据上得到了说明。建议的方法在 CRAN 上可用的 R 包 ClusPred 中实现。包含技术细节和 R 代码的补充材料可在线获取。在半参数框架中。数值实验通过考虑广泛的分布和回归模型来说明我们的提议的好处。我们的新方法的相关性在处理与高血压预防相关的问题的真实数据上得到了说明。建议的方法在 CRAN 上可用的 R 包 ClusPred 中实现。包含技术细节和 R 代码的补充材料可在线获取。在半参数框架中。数值实验通过考虑广泛的分布和回归模型来说明我们的提议的好处。我们的新方法的相关性在处理与高血压预防相关的问题的真实数据上得到了说明。建议的方法在 CRAN 上可用的 R 包 ClusPred 中实现。包含技术细节和 R 代码的补充材料可在线获取。

更新日期:2022-01-04
down
wechat
bug