当前位置: X-MOL 学术J. Comput. Graph. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Nonstationary Gaussian Process Discriminant Analysis With Variable Selection for High-Dimensional Functional Data
Journal of Computational and Graphical Statistics ( IF 2.4 ) Pub Date : 2022-09-14 , DOI: 10.1080/10618600.2022.2098136
Weichang Yu 1 , Sara Wade 2 , Howard D. Bondell 3 , Lamiae Azizi 4
Affiliation  

ABSTRACT

High-dimensional classification and feature selection tasks are ubiquitous with the recent advancement in data acquisition technology. In several application areas such as biology, genomics, and proteomics, the data are often functional in their nature and exhibit a degree of roughness and nonstationarity. These structures pose additional challenges to commonly used methods that rely mainly on a two-stage approach performing variable selection and classification separately. We propose in this work a novel Gaussian process discriminant analysis (GPDA) that combines these steps in a unified framework. Our model is a two-layer nonstationary Gaussian process coupled with an Ising prior to identify differentially-distributed locations. Scalable inference is achieved via developing a variational scheme that exploits advances in the use of sparse inverse covariance matrices. We demonstrate the performance of our methodology on simulated datasets and two proteomics datasets: breast cancer and SARS-CoV-2. Our approach distinguishes itself by offering explainability as well as uncertainty quantification in addition to low computational cost, which are crucial to increase trust and social acceptance of data-driven tools. Supplementary materials for this article are available online.



中文翻译:

高维函数数据的非平稳高斯过程判别分析与变量选择

摘要

随着数据采集技术的最新进步,高维分类和特征选择任务无处不在。在生物学、基因组学和蛋白质组学等多个应用领域中,数据通常具有功能性,并且表现出一定程度的粗糙性和非平稳性。这些结构对主要依赖于分别执行变量选择和分类的两阶段方法的常用方法提出了额外的挑战。我们在这项工作中提出了一种新颖的高斯过程判别分析(GPDA),它将这些步骤结合在一个统一的框架中。我们的模型是一个两层非平稳高斯过程,加上伊辛先验来识别差异分布的位置。可扩展的推理是通过开发一种变分方案来实现的,该方案利用稀疏逆协方差矩阵的使用方面的进步。我们展示了我们的方法在模拟数据集和两个蛋白质组数据集(乳腺癌和 SARS-CoV-2)上的性能。我们的方法的独特之处在于,除了低计算成本之外,还提供可解释性和不确定性量化,这对于提高数据驱动工具的信任和社会接受度至关重要。本文的补充材料可在线获取。这对于提高数据驱动工具的信任和社会接受度至关重要。本文的补充材料可在线获取。这对于提高数据驱动工具的信任和社会接受度至关重要。本文的补充材料可在线获取。

更新日期:2022-09-14
down
wechat
bug