Weighted dimensionality reduction and robust Gaussian mixture model based cancer patient subtyping from gene expression data,Journal of Biomedical informatics

当前位置： X-MOL 学术 › J. Biomed. Inform. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Weighted dimensionality reduction and robust Gaussian mixture model based cancer patient subtyping from gene expression data
Journal of Biomedical informatics ( IF 4.0 ) Pub Date : 2020-11-11 , DOI: 10.1016/j.jbi.2020.103620
Omar Rafique ₁ , A H Mir ₁

Affiliation

Background:

The heterogeneous nature of cancer necessitates subtyping of cancer patients into distinct and well separated subgroups. However, computational issues arise because gene expression data is noisy and contains outliers apart from being high dimensional. As such, an attempt to subtype cancer patients from gene expression data leads to highly overlapping Kaplan–Meier (KM) survival plots and thus clear distinction among the discovered subtypes becomes difficult. Here we attempt to achieve a greater separation among the subtypes through a robust clustering pipeline.

Methods:

We propose a robust framework to achieve a better separation among the discovered subtypes. Our framework is based on dimensionality reduction of a weighted gene expression matrix using t-distributed Stochastic Neighbor Embedding (t-SNE) and a robust Gaussian mixture model based clustering approach. Every gene is weighted according to the median absolute deviation (MAD) of the gene before dimensionality reduction. The results are quantified by measuring the minimum pairwise separation among the KM plots and minimum hazard ratio among the subtypes. We also introduce a novel method, called cumulative survival separation, to quantify the separation among the discovered subtypes.

Results:

To validate the proposed methodology we obtained five cancer gene expression datasets from The Cancer Genome Atlas (TCGA) and comparisons with Consensus Clustering (CC), Consensus non-negative matrix factorization (CNMF), fast density-aware spectral clustering (Spectrum) and Neighborhood based Multi-Omics clustering (NEMO) methodologies show that the proposed method is able to achieve a greater separation compared to the aforementioned methods in literature. For instance, the minimum pairwise life expectancy difference (in days) between the discovered subtypes for GBM is 61 days for the proposed methodology with MAD scores, whereas it is approximately 33, 19, 49 and 33 days only for CC, Spectrum, Nemo and CNMF respectively. Comparisons are also shown for the proposed framework with and without using the MAD scores and it is observed that MAD score significantly improves the subtype separation. Hazard ratio analysis also shows that the proposed methodology performs better. Furthermore, pathway over-representation analyses were carried to identify relevant genetic pathways which can be possible targets for treatment.

Conclusion:

The results suggest that the use of median absolute deviation and a robust clustering methodology are helpful in achieving greater separation among the subtypes with better statistical and clinical significance.

中文翻译：

基于基因表达数据的加权降维和鲁棒高斯混合模型基于癌症患者的亚型

背景：

癌症的异质性使得必须将癌症患者分为不同的和完全分开的亚组。但是，由于基因表达数据嘈杂并且包含高维之外的离群值，因此出现了计算问题。因此，从基因表达数据中将癌症患者亚型化的尝试导致了高度重叠的Kaplan-Meier（KM）生存图，因此，要在发现的亚型之间进行明确区分变得困难。在这里，我们尝试通过健壮的聚类流水线在子类型之间实现更大的分离。

方法：

我们提出了一个健壮的框架，以实现更好地分离所发现的亚型。我们的框架基于使用t分布随机邻居嵌入（t-SNE）的加权基因表达矩阵的降维和基于鲁棒高斯混合模型的聚类方法。在降维之前，根据基因的中值绝对偏差（MAD）对每个基因加权。通过测量KM图之间的最小成对间隔和亚型之间的最小危险比来量化结果。我们还介绍了一种称为累积生存分离的新方法，以量化发现的亚型之间的分离。

结果：

为了验证所提出的方法，我们从癌症基因组图谱（TCGA）获得了五个癌症基因表达数据集，并与共识聚类（CC），共识非负矩阵分解（CNMF），快速密度感知光谱聚类（Spectrum）和邻域进行了比较。基于Multi-Omics聚类（NEMO）的方法论表明，与文献中的上述方法相比，该方法能够实现更大的分离。例如，对于带有MAD分数的拟议方法，GBM的发现亚型之间的最小成对预期寿命差异（以天为单位）为61天，而CC，Spectrum，Nemo和CC仅为33、19、49和33天。 CNMF。还显示了使用和不使用MAD分数的拟议框架的比较，并且观察到MAD分数显着改善了亚型分离。危险比分析还表明，所提出的方法效果更好。此外，进行了通路过度表达分析，以鉴定可能是治疗靶点的相关遗传通路。