当前位置: X-MOL 学术Mach. Learn. Sci. Technol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Improving sample and feature selection with principal covariates regression
Machine Learning: Science and Technology ( IF 6.3 ) Pub Date : 2021-07-13 , DOI: 10.1088/2632-2153/abfe7c
Rose K Cersonsky 1 , Benjamin A Helfrecht 1 , Edgar A Engel 2 , Sergei Kliavinek 1 , Michele Ceriotti 1
Affiliation  

Selecting the most relevant features and samples out of a large set of candidates is a task that occurs very often in the context of automated data analysis, where it improves the computational performance and often the transferability of a model. Here we focus on two popular subselection schemes applied to this end: CUR decomposition, derived from a low-rank approximation of the feature matrix, and farthest point sampling (FPS), which relies on the iterative identification of the most diverse samples and discriminating features. We modify these unsupervised approaches, incorporating a supervised component following the same spirit as the principal covariates (PCov) regression method. We show how this results in selections that perform better in supervised tasks, demonstrating with models of increasing complexity, from ridge regression to kernel ridge regression and finally feed-forward neural networks. We also present adjustments to minimise the impact of any subselection when performing unsupervised tasks. We demonstrate the significant improvements associated with PCov-CUR and PCov-FPS selections for applications to chemistry and materials science, typically reducing by a factor of two the number of features and samples required to achieve a given level of regression accuracy.



中文翻译:

使用主协变量回归改进样本和特征选择

从大量候选中选择最相关的特征和样本是一项在自动化数据分析环境中经常发生的任务,它提高了计算性能,通常还提高了模型的可转移性。在这里,我们重点关注应用于此目的的两种流行的子选择方案:CUR 分解,源自特征矩阵的低秩近似,以及最远点采样 (FPS),它依赖于对最多样化样本和区分特征的迭代识别. 我们修改了这些无监督方法,结合了与主协变量 (PCov) 回归方法相同的精神的监督组件。我们展示了这如何导致在监督任务中表现更好的选择,展示了越来越复杂的模型,从岭回归到核岭回归,最后是前馈神经网络。我们还提出了一些调整,以最大限度地减少执行无监督任务时任何子选择的影响。我们展示了与 PCov-CUR 和 PCov-FPS 选择相关的显着改进,适用于化学和材料科学,通常将实现给定回归精度水平所需的特征和样本数量减少两倍。

更新日期:2021-07-13
down
wechat
bug