Simultaneous feature selection and outlier detection with optimality guarantees,Biometrics

当前位置： X-MOL 学术 › Biometrics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Simultaneous feature selection and outlier detection with optimality guarantees
Biometrics ( IF 1.4 ) Pub Date : 2021-08-26 , DOI: 10.1111/biom.13553
Luca Insolia _{1,

2} , Ana Kenney ₃ , Francesca Chiaromonte _{2,

3} , Giovanni Felici ₄

Affiliation

Biomedical research is increasingly data rich, with studies comprising ever growing numbers of features. The larger a study, the higher the likelihood that a substantial portion of the features may be redundant and/or contain contamination (outlying values). This poses serious challenges, which are exacerbated in cases where the sample sizes are relatively small. Effective and efficient approaches to perform sparse estimation in the presence of outliers are critical for these studies, and have received considerable attention in the last decade. We contribute to this area considering high-dimensional regressions contaminated by multiple mean-shift outliers affecting both the response and the design matrix. We develop a general framework and use mixed-integer programming to simultaneously perform feature selection and outlier detection with provably optimal guarantees. We prove theoretical properties for our approach, that is, a necessary and sufficient condition for the robustly strong oracle property, where the number of features can increase exponentially with the sample size; the optimal estimation of parameters; and the breakdown point of the resulting estimates. Moreover, we provide computationally efficient procedures to tune integer constraints and warm-start the algorithm. We show the superior performance of our proposal compared to existing heuristic methods through simulations and use it to study the relationships between childhood obesity and the human microbiome.

中文翻译：

同时进行特征选择和异常值检测并保证最优性

生物医学研究的数据越来越丰富，研究包含的特征数量也越来越多。研究规模越大，大部分特征可能是冗余的和/或包含污染（异常值）的可能性就越高。这带来了严峻的挑战，在样本量相对较小的情况下，挑战会更加严重。在存在异常值的情况下执行稀疏估计的有效且高效的方法对于这些研究至关重要，并且在过去十年中受到了相当多的关注。考虑到受到影响响应和设计矩阵的多个均值漂移异常值污染的高维回归，我们对此领域做出了贡献。我们开发一个通用框架并使用混合整数编程以可证明的最佳保证同时执行特征选择和异常值检测。我们证明了我们的方法的理论特性，即鲁棒性强的预言机特性的充分必要条件，其中特征的数量可以随着样本大小呈指数增长；参数的最优估计；以及由此产生的估计的细分点。此外，我们提供计算高效的程序来调整整数约束和热启动算法。与现有的启发式方法相比，我们通过模拟展示了我们的提案的优越性能，并用它来研究儿童肥胖与人类微生物组之间的关系。

更新日期：2021-08-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11