当前位置: X-MOL 学术BMC Bioinform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life.
BMC Bioinformatics ( IF 3 ) Pub Date : 2020-09-21 , DOI: 10.1186/s12859-020-03744-7
Zhengqiao Zhao 1 , Alexandru Cristian 2 , Gail Rosen 1
Affiliation  

It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of “incremental learning” addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model’s knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4th of the non-incremental time with no accuracy loss. It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.

中文翻译:

跟上基因组的步伐:有效地学习我们不断增加的生命树知识。

对于当前的宏基因组分类器来说,跟上从基因组测序项目(例如呈指数增长的NCBI RefSeq细菌基因组数据库)生成的训练数据的步伐,这是一个计算难题。当将新的参考序列添加到训练数据时,必须对所有数据重新运行静态训练的分类器,从而导致流程效率极低。与使用所有数据重新训练分类器相比,“增量学习”的丰富文献解决了更新现有分类器以容纳新数据的需求,而又不牺牲太多准确性。我们通过逐步训练渐进式RefSeq快照上的分类器并在以下条件上对其进行测试来证明分类如何随着时间的推移而改进:(a)所有已知的当前基因组(作为基本事实集)和(b)实际的实验性宏基因组肠道样本。我们证明,随着分类器模型对基因组的了解增加,分类准确性也会提高。每年更新一次的概念验证朴素贝叶斯实现现在可以在非增量时间的1/4内运行,而不会造成精度损失。显然,分类可以通过掌握最新知识来提高。因此,使分类器在计算上易于处理以跟上数据的泛滥是至关重要的。增量学习分类器可以有效地更新,而无需重新处理或访问现有数据库的成本,因此可以节省存储空间和计算资源。现在可以在非增量时间的1/4内运行,而不会降低精度。显然,分类可以通过掌握最新知识来提高。因此,使分类器在计算上易于处理以跟上数据的泛滥是至关重要的。增量学习分类器可以有效地更新,而无需重新处理或访问现有数据库的成本,因此可以节省存储空间和计算资源。现在可以在非增量时间的1/4内运行,而不会降低精度。显然,分类可以通过掌握最新知识来提高。因此,使分类器在计算上易于处理以跟上数据的泛滥是至关重要的。增量学习分类器可以有效地更新,而无需重新处理或访问现有数据库的成本,因此可以节省存储空间和计算资源。
更新日期:2020-09-21
down
wechat
bug