当前位置: X-MOL 学术Genes Genom. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest
Genes & Genomics ( IF 2.1 ) Pub Date : 2021-06-07 , DOI: 10.1007/s13258-021-01057-4
Weiwen Zhang 1 , Lianglun Cheng 1 , Guoheng Huang 1
Affiliation  

Background

Population stratification modeling is essential in Genome-Wide Association Studies.

Objective

In this paper, we aim to build a fine-scale population stratification model to efficiently infer individual genetic ancestry.

Methods

Kernel Principal Component Analysis (PCA) and random forest are adopted to build the population stratification model, together with parameter optimization. We explore different PCA methods, including standard PCA and kernel PCA to extract relevant features from the genotype data that is transformed by vcf2geno, a pipeline from LASER software. These extracted features are fed into a random forest for ensemble learning. Parameter tuning is performed to jointly find the optimal number of principal components, kernel function for PCA and parameters of the random forest.

Results

Experiments based on HGDP dataset show that kernel PCA with Sigmoid function and Gaussian function can achieve higher prediction accuracy than the standard PCA. Compared to standard PCA with the two principal components, the accuracy by using KPCA-Sigmoid with the optimal number of principal components can achieve around 100% and 200% improvement for East Asian and European populations, respectively.

Conclusion

With the optimal parameter configuration on both PCA and random forest, our proposed method can infer the individual genetic ancestry more accurately, given their variants.



中文翻译:

基于核主成分分析和随机森林的精细人口分层建模

背景

群体分层建模在全基因组关联研究中是必不可少的。

客观的

在本文中,我们旨在建立一个精细的种群分层模型,以有效地推断个体遗传祖先。

方法

采用核主成分分析(PCA)和随机森林建立种群分层模型,并进行参数优化。我们探索不同的 PCA 方法,包括标准 PCA 和内核 PCA,以从由 LASER 软件的管道 vcf2geno 转换的基因型数据中提取相关特征。这些提取的特征被输入随机森林进行集成学习。执行参数调整以共同找到主成分的最佳数量、PCA 的核函数和随机森林的参数。

结果

基于 HGDP 数据集的实验表明,具有 Sigmoid 函数和 Gaussian 函数的核 PCA 可以实现比标准 PCA 更高的预测精度。与具有两个主成分的标准 PCA 相比,使用具有最佳主成分数量的 KPCA-Sigmoid 对东亚和欧洲人群的准确度可以分别提高约 100% 和 200%。

结论

通过 PCA 和随机森林上的最优参数配置,我们提出的方法可以更准确地推断个体遗传祖先,给定它们的变体。

更新日期:2021-06-07
down
wechat
bug