当前位置: X-MOL 学术Genome Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Compressing gene expression data using multiple latent space dimensionalities learns complementary biological representations
Genome Biology ( IF 12.3 ) Pub Date : 2020-05-11 , DOI: 10.1186/s13059-020-02021-3
Gregory P Way 1, 2, 3 , Michael Zietz 2 , Vincent Rubinetti 2 , Daniel S Himmelstein 2 , Casey S Greene 2, 4
Affiliation  

Background Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses. Results We compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities. Conclusions There is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.

中文翻译:

使用多个潜在空间维度压缩基因表达数据可以学习互补的生物表征

背景 应用于基因表达数据的无监督压缩算法提取代表技术和生物学变异来源的潜在或隐藏信号。然而,这些算法需要用户选择生物学上合适的潜在空间维度。在实践中,大多数研究人员适合单一算法和潜在维度。我们试图确定仅选择一种拟合在多大程度上限制了潜在表示中捕获的生物学特征,因此,限制了后续分析可以发现的内容。结果 我们压缩了来自三个大型数据集的基因表达数据,这些数据集包括成人正常组织、成人癌组织和儿童癌组织。我们在大范围的潜在空间维度上训练了许多不同的模型,并观察了各种性能差异。我们在去噪自编码器和使用中间数量的潜在维度训练的变分自编码器模型中识别出更多与个体维度显着相关的精选通路基因集。跨算法和维度组合压缩特征可以捕获与路径关联最多的表示。当使用不同的潜在维度进行训练时,模型会学习到强烈相关且可概括的生物学表征,包括性别、神经母细胞瘤 MYCN 扩增和细胞类型。更强的信号,如肿瘤类型,最好在较低维度训练的模型中捕获,而更细微的信号,如通路活动,最好在用更多潜在维度训练的模型中识别。结论 没有用于分析基因表达数据的单一最佳潜在维度或压缩算法。相反,使用来自多个潜在空间维度的不同压缩模型的特征可以增强生物表征。
更新日期:2020-05-11
down
wechat
bug