当前位置: X-MOL 学术bioRxiv. Synth. Biol. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The NK Landscape as a Versatile Benchmark for Machine Learning Driven Protein Engineering
bioRxiv - Synthetic Biology Pub Date : 2020-10-06 , DOI: 10.1101/2020.09.30.319780
Adam C. Mater , Mahakaran Sandhu , Colin Jackson

Machine learning (ML) has the potential to revolutionize protein engineering. However, the field currently lacks standardized and rigorous evaluation benchmarks for sequence-fitness prediction, which makes accurate evaluation of the performance of different architectures difficult. Here we propose a unifying framework for ML-driven sequence-fitness prediction. Using simulated (the NK model) and empirical sequence landscapes, we define four key performance metrics: interpolation within the training domain, extrapolation outside the training domain, robustness to sparse training data, and ability to cope with epistasis/ruggedness. We show that architectural differences between algorithms consistently affect performance against these metrics across both experimental and theoretical landscapes. Moreover, landscape ruggedness is revealed to be the greatest determinant of the accuracy of sequence-fitness prediction. We hope that this benchmarking method and the code that accompanies it will enable robust evaluation and comparison of novel architectures in this emerging field and assist in the adoption of ML for protein engineering.

中文翻译:

NK景观作为机器学习驱动的蛋白质工程的多功能基准

机器学习(ML)具有革新蛋白质工程的潜力。然而,该领域目前缺乏用于序列适应性预测的标准化和严格的评估基准,这使得难以准确评估不同架构的性能。在这里,我们提出了用于ML驱动的序列适合度预测的统一框架。使用模拟(NK模型)和经验序列格局,我们定义了四个关键的性能指标:训练域内的内插,训练域外的外推,稀疏训练数据的鲁棒性以及应对上位性/坚固性的能力。我们表明,算法之间的体系结构差异会持续影响实验和理论环境下针对这些指标的性能。此外,地形坚固性是决定序列适合度预测准确性的最大决定因素。我们希望这种基准测试方法及其随附的代码能够在这个新兴领域中对新颖的体系结构进行可靠的评估和比较,并有助于将ML用于蛋白质工程。
更新日期:2020-10-07
down
wechat
bug