当前位置: X-MOL 学术J. Anim. Sci. Biotechnol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Breed identification using breed-informative SNPs and machine learning based on whole genome sequence data and SNP chip data
Journal of Animal Science and Biotechnology ( IF 7 ) Pub Date : 2023-06-01 , DOI: 10.1186/s40104-023-00880-x
Changheng Zhao 1 , Dan Wang 1 , Jun Teng 1 , Cheng Yang 1 , Xinyi Zhang 1 , Xianming Wei 1 , Qin Zhang 1
Affiliation  

Breed identification is useful in a variety of biological contexts. Breed identification usually involves two stages, i.e., detection of breed-informative SNPs and breed assignment. For both stages, there are several methods proposed. However, what is the optimal combination of these methods remain unclear. In this study, using the whole genome sequence data available for 13 cattle breeds from Run 8 of the 1,000 Bull Genomes Project, we compared the combinations of three methods (Delta, FST, and In) for breed-informative SNP detection and five machine learning methods (KNN, SVM, RF, NB, and ANN) for breed assignment with respect to different reference population sizes and difference numbers of most breed-informative SNPs. In addition, we evaluated the accuracy of breed identification using SNP chip data of different densities. We found that all combinations performed quite well with identification accuracies over 95% in all scenarios. However, there was no combination which performed the best and robust across all scenarios. We proposed to integrate the three breed-informative detection methods, named DFI, and integrate the three machine learning methods, KNN, SVM, and RF, named KSR. We found that the combination of these two integrated methods outperformed the other combinations with accuracies over 99% in most cases and was very robust in all scenarios. The accuracies from using SNP chip data were only slightly lower than that from using sequence data in most cases. The current study showed that the combination of DFI and KSR was the optimal strategy. Using sequence data resulted in higher accuracies than using chip data in most cases. However, the differences were generally small. In view of the cost of genotyping, using chip data is also a good option for breed identification.

中文翻译:

使用品种信息 SNP 和基于全基因组序列数据和 SNP 芯片数据的机器学习进行品种鉴定

品种鉴定在各种生物学环境中都很有用。品种鉴定通常包括两个阶段,即品种信息SNPs的检测和品种分配。对于这两个阶段,提出了几种方法。然而,这些方法的最佳组合是什么仍不清楚。在这项研究中,我们使用 1,000 头公牛基因组计划第 8 轮的 13 个牛品种的全基因组序列数据,比较了三种方法(Delta、FST 和 In)的组合,用于品种信息 SNP 检测和五种机器学习方法(KNN、SVM、RF、NB 和 ANN)针对不同参考种群大小和大多数品种信息 SNP 的差异数量进行品种分配。此外,我们还评估了使用不同密度的SNP芯片数据进行品种鉴定的准确性。我们发现所有组合都表现得很好,在所有场景中识别准确率都超过 95%。但是,没有一种组合能够在所有场景中表现最佳和稳健。我们提出整合三种品种信息检测方法,命名为DFI,整合三种机器学习方法,KNN、SVM和RF,命名为KSR。我们发现这两种集成方法的组合优于其他组合,在大多数情况下准确率超过 99%,并且在所有场景中都非常稳健。在大多数情况下,使用 SNP 芯片数据的准确性仅略低于使用序列数据的准确性。目前的研究表明,DFI 和 KSR 的组合是最优策略。在大多数情况下,使用序列数据比使用芯片数据具有更高的准确性。然而,差异通常很小。考虑到基因分型的成本,利用芯片数据进行品种鉴定也是一个不错的选择。
更新日期:2023-06-01
down
wechat
bug