当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Latent-space embedding of expression data identifies gene signatures from sputum samples of asthmatic patients
BMC Bioinformatics ( IF 2.9 ) Pub Date : 2020-10-15 , DOI: 10.1186/s12859-020-03785-y
Shaoke Lou 1, 2 , Tianxiao Li 1, 2 , Daniel Spakowicz 1, 2, 3 , Xiting Yan 4 , Geoffrey Lowell Chupp 4 , Mark Gerstein 1, 2
Affiliation  

The pathogenesis of asthma is a complex process involving multiple genes and pathways. Identifying biomarkers from asthma datasets, especially those that include heterogeneous subpopulations, is challenging. Potentially, autoencoders provide ideal frameworks for such tasks as they can embed complex, noisy high-dimensional gene expression data into a low-dimensional latent space in an unsupervised fashion, enabling us to extract distinguishing features from expression data. Here, we developed a framework combining a denoising autoencoder and a supervised learning classifier to identify gene signatures related to asthma severity. Using the trained autoencoder with 50 hidden units, we found that hierarchical clustering on the low-dimensional embedding corresponds well with previously defined and clinically relevant clusters of patients. Moreover, each hidden unit has contributions from each of the genes, and pathway analysis of these contributions shows that the hidden units are significantly enriched in known asthma-related pathways. We then used genes that contribute most to the hidden units to develop a secondary random-forest classifier for directly predicting asthma severity. The feature importance metric from this classifier identified a signature based on 50 key genes, which are associated with severity. Furthermore, we can use these key genes to successfully estimate FEV1/FVC ratios across patients, via support-vector-machine regression. We found that the denoising autoencoder framework can extract meaningful patterns corresponding to functional gene groups and patient clusters from the gene expression of asthma patients.

中文翻译:

表达数据的潜在空间嵌入可识别哮喘患者痰液样本的基因特征

哮喘的发病机制是一个涉及多个基因和通路的复杂过程。从哮喘数据集中识别生物标志物,尤其是那些包含异质亚群的生物标志物,具有挑战性。潜在地,自动编码器为此类任务提供了理想的框架,因为它们可以以无监督的方式将复杂、嘈杂的高维基因表达数据嵌入到低维潜在空间中,使我们能够从表达数据中提取可区分的特征。在这里,我们开发了一个结合去噪自动编码器和监督学习分类器的框架,以识别与哮喘严重程度相关的基因特征。使用具有 50 个隐藏单元的训练有素的自动编码器,我们发现低维嵌入上的层次聚类与先前定义的临床相关患者聚类很好地对应。而且,每个隐藏单元都有来自每个基因的贡献,对这些贡献的通路分析表明,隐藏单元在已知的哮喘相关通路中显着丰富。然后,我们使用对隐藏单元贡献最大的基因来开发用于直接预测哮喘严重程度的二级随机森林分类器。该分类器的特征重要性度量基于与严重性相关的 50 个关键基因确定了一个特征。此外,我们可以使用这些关键基因通过支持向量机回归成功估计患者的 FEV1/FVC 比率。我们发现去噪自编码器框架可以从哮喘患者的基因表达中提取与功能基因组和患者簇相对应的有意义的模式。对这些贡献的通路分析表明,隐藏单元在已知的哮喘相关通路中显着丰富。然后,我们使用对隐藏单元贡献最大的基因来开发用于直接预测哮喘严重程度的二级随机森林分类器。该分类器的特征重要性度量基于与严重性相关的 50 个关键基因确定了一个特征。此外,我们可以使用这些关键基因通过支持向量机回归成功估计患者的 FEV1/FVC 比率。我们发现去噪自编码器框架可以从哮喘患者的基因表达中提取与功能基因组和患者簇相对应的有意义的模式。对这些贡献的通路分析表明,隐藏单元在已知的哮喘相关通路中显着丰富。然后,我们使用对隐藏单元贡献最大的基因来开发用于直接预测哮喘严重程度的二级随机森林分类器。该分类器的特征重要性度量基于与严重性相关的 50 个关键基因确定了一个特征。此外,我们可以使用这些关键基因通过支持向量机回归成功估计患者的 FEV1/FVC 比率。我们发现去噪自编码器框架可以从哮喘患者的基因表达中提取与功能基因组和患者簇相对应的有意义的模式。
更新日期:2020-10-16
down
wechat
bug