当前位置: X-MOL 学术bioRxiv. Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Latent Feature Representations for Human Gene Expression Data Improve Phenotypic Predictions
bioRxiv - Bioinformatics Pub Date : 2020-10-16 , DOI: 10.1101/2020.10.15.340802
Yannis Pantazis , Christos Tselas , Kleanthi Lakiotaki , Vincenzo Lagani , Ioannis Tsamardinos

High-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets, despite the limited sample size of each dataset and the biological / technological heterogeneity across studies. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.

中文翻译:

人类基因表达数据的潜在特征表示改善了表型预测

诸如微阵列和RNA测序(RNA-seq)等高通量技术可以精确定量转录组谱,从而生成不可避免的高维数据集。在这项工作中,我们调查了整个人类转录组是否可以在压缩的,低维的潜在空间中表达而不会丢失相关信息。因此,我们通过利用三维降维方法和一组精选的数据集,构建了人类基因组的低维潜在特征空间。我们对来自四种不同测量技术的1360个数据集应用了标准的主成分分析(PCA),内核PCA和自动编码器神经网络。测试潜在特征空间的能力(a)重建原始数据,(b)提高在特征空间创建期间未使用的验证数据集上的预测性能。虽然线性技术显示出更好的重建性能,但非线性方法(尤其是基于神经的模型)似乎能够捕获非累加的交互作用,因此具有更强的预测能力。我们的结果表明,尽管每个数据集的样本量有限且研究之间存在生物学/技术异质性,但通过整合数百个数据集可以实现人类转录组的低维表示。与原始数据相比,创建的空间要小两到三个数量级,
更新日期:2020-10-17
down
wechat
bug