当前位置: X-MOL 学术J. Appl. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The application of deep learning for the classification of correct and incorrect SNP genotypes from whole-genome DNA sequencing pipelines
Journal of Applied Genetics ( IF 2.4 ) Pub Date : 2020-09-29 , DOI: 10.1007/s13353-020-00586-0
Krzysztof Kotlarz , Magda Mielczarek , Tomasz Suchocki , Bartosz Czech , Bernt Guldbrandtsen , Joanna Szyda

A downside of next-generation sequencing technology is the high technical error rate. We built a tool, which uses array-based genotype information to classify next-generation sequencing–based SNPs into the correct and the incorrect calls. The deep learning algorithms were implemented via Keras. Several algorithms were tested: (i) the basic, naïve algorithm, (ii) the naïve algorithm modified by pre-imposing different weights on incorrect and correct SNP class in calculating the loss metric and (iii)–(v) the naïve algorithm modified by random re-sampling (with replacement) of the incorrect SNPs to match 30%/60%/100% of the number of correct SNPs. The training data set was composed of data from three bulls and consisted of 2,227,995 correct (97.94%) and 46,920 incorrect SNPs, while the validation data set consisted of data from one bull with 749,506 correct (98.05%) and 14,908 incorrect SNPs. The results showed that for a rare event classification problem, like incorrect SNP detection in NGS data, the most parsimonious naïve model and a model with the weighting of SNP classes provided the best results for the classification of the validation data set. Both classified 19% of truly incorrect SNPs as incorrect and 99% of truly correct SNPs as correct and resulted in the F1 score of 0.21 — the highest among the compared algorithms. We conclude the basic models were less adapted to the specificity of a training data set and thus resulted in better classification of the independent, validation data set, than the other tested models.



中文翻译:

深度学习在全基因组DNA测序流程中正确和错误SNP基因型分类中的应用

下一代测序技术的缺点是技术错误率高。我们构建了一个工具,该工具使用基于数组的基因型信息将基于下一代测序的SNP分为正确和错误的调用。深度学习算法是通过Keras实现的。测试了几种算法:(i)基本的朴素算法,(ii)通过在计算损失度量时对不正确和正确的SNP类预先赋予不同的权重而修改的朴素算法,以及(iii)–(v)朴素的算法经过修改通过对不正确的SNP进行随机重新采样(替换)以匹配正确SNP数量的30%/ 60%/ 100%。训练数据集由来自三头公牛的数据组成,包括2,227,995正确(97.94%)和46,920不正确的SNP,而验证数据集包括来自一头公牛的749,506个正确(98.05%)和14,908个错误的SNP。结果表明,对于罕见事件分类问题,例如NGS数据中SNP检测不正确,最简约的朴素模型和带有SNP类加权的模型为验证数据集的分类提供了最佳结果。两者都将19%的真正不正确的SNP归类为不正确,将99%的真正正确的SNP归为正确,并且F1得分为0.21 —在比较算法中最高。我们得出结论,与其他测试模型相比,基本模型不太适合训练数据集的特殊性,因此可以对独立的验证数据集进行更好的分类。最简约的朴素模型和具有SNP类权重的模型为验证数据集的分类提供了最佳结果。两者都将19%的真正不正确的SNP归类为不正确,将99%的真正正确的SNP归为正确,并且F1得分为0.21 —在比较算法中最高。我们得出结论,与其他测试模型相比,基本模型不太适合训练数据集的特殊性,因此可以对独立的验证数据集进行更好的分类。最简约的朴素模型和具有SNP类权重的模型为验证数据集的分类提供了最佳结果。两者都将19%的真正不正确的SNP归类为不正确,将99%的真正正确的SNP归为正确,并且F1得分为0.21 —在比较算法中最高。我们得出结论,与其他测试模型相比,基本模型不太适合训练数据集的特殊性,因此可以对独立的验证数据集进行更好的分类。

更新日期:2020-09-30
down
wechat
bug