当前位置: X-MOL 学术Cognit. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Principal Component Analysis Applications in COVID-19 Genome Sequence Studies
Cognitive Computation ( IF 4.3 ) Pub Date : 2021-01-13 , DOI: 10.1007/s12559-020-09790-w
Bo Wang 1 , Lin Jiang 2
Affiliation  

RNA genomes from coronavirus have a length as long as 32 kilobases, and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that caused the outbreak of coronavirus disease 2019 (COVID-19) pandemic has long sequences which made the analysis difficult. Over 20,000 sequences have been submitted to GISAID, and the number is growing fast each day which increased the difficulties in data analysis; however, genome sequence analysis is critical in understanding the COVID-19 and preventing the spread of the disease. In this study, a principal component analysis (PCA) was applied to the aligned large size genome sequences and the numerical numbers were converted from the letters using a published method designed for protein sequence cluster analysis. The study initialized with a shortlist sequence testing, and the PCA score plot showed high tolerance with low-quality data, and the major virus sequences from humans were separated from the pangolin and bat samples. Our study also successfully built a model for a large number of sequences with more than 20,000 sequences which indicate the potential mutation directions for the COVID-19 which can be served as a pretreatment method for detailed studies such as decision tree-based methods. In summary, our study provided a fast tool to analyze the high-volume genome sequences such as the COVID-19 and successfully applied to more than 20,000 sequences which may provide mutation direction information for COVID-19 studies.



中文翻译:


主成分分析在 COVID-19 基因组序列研究中的应用



冠状病毒的RNA基因组长度长达32个碱基,导致2019年冠状病毒病(COVID-19)大流行的严重急性呼吸综合征冠状病毒2(SARS-CoV-2)的序列很长,这使得分析变得困难。已向GISAID提交超过20,000个序列,并且数量每天都在快速增长,这增加了数据分析的难度;然而,基因组序列分析对于了解 COVID-19 和预防该疾病的传播至关重要。在这项研究中,将主成分分析(PCA)应用于比对的大尺寸基因组序列,并使用专为蛋白质序列聚类分析设计的已发布方法从字母转换为数字。该研究以入围序列测试开始,PCA评分图显示出对低质量数据的高度耐受性,并且来自人类的主要病毒序列从穿山甲和蝙蝠样本中分离出来。我们的研究还成功建立了超过20,000个序列的大量序列模型,这些序列表明了COVID-19的潜在突变方向,可以作为基于决策树的方法等详细研究的预处理方法。总之,我们的研究提供了一种快速工具来分析 COVID-19 等大容量基因组序列,并成功应用于超过 20,000 个序列,这可能为 COVID-19 研究提供突变方向信息。

更新日期:2021-01-13
down
wechat
bug