当前位置: X-MOL 学术Bioinformatics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Redundancy-weighting the PDB for detailed secondary structure prediction using deep-learning models.
Bioinformatics ( IF 4.4 ) Pub Date : 2020-03-18 , DOI: 10.1093/bioinformatics/btaa196
Tomer Sidi 1 , Chen Keasar 1
Affiliation  

Motivation
The Protein Data Bank (PDB), the ultimate source for data in structural biology, is inherently imbalanced. To alleviate biases, virtually all structural biology studies use non-redundant subsets of the PDB, which include only a fraction of the available data. An alternative approach, dubbed redundancy-weighting, down-weights redundant entries rather than discarding them. This approach may be particularly helpful for Machine Learning (ML) methods that use the PDB as their source for data.Methods for Secondary Structure Prediction (SSP) have greatly improved over the years with recent studies achieving above 70% accuracy for 8-class (DSSP) prediction. As these methods typically incorporate machine learning techniques, training on redundancy-weighted datasets might improve accuracy, as well as pave the way toward larger and more informative secondary structure alphabets.
Results
This article compares the SSP performances of Deep Learning (DL) models trained on either redundancy-weighted or non-redundant datasets. We show that training on redundancy-weighted sets consistently results in better prediction of 3-class (HCE), 8-class (DSSP) and 13-class (STR2) secondary structures.
Availability
Data and DL models are available in http://meshi1.cs.bgu.ac.il/rw.


中文翻译:

使用深度学习模型对PDB进行冗余加权,以进行详细的二级结构预测。

动机
蛋白质数据库(PDB)是结构生物学数据的最终来源,其本质上是不平衡的。为了减轻偏见,几乎所有结构生物学研究都使用了PDB的非冗余子集,这些子集仅包含一部分可用数据。称为冗余加权的另一种方法是降低冗余条目的权重,而不是丢弃它们。对于使用PDB作为数据源的机器学习(ML)方法而言,这种方法可能特别有用。多年来,二级结构预测(SSP)方法得到了极大的改进,最近的研究对8级(70%)精度达到了70%以上( DSSP)预测。由于这些方法通常结合了机器学习技术,因此对冗余加权数据集进行训练可能会提高准确性,
结果
本文比较了在冗余加权或非冗余数据集上训练的深度学习(DL)模型的SSP性能。我们表明,对冗余加权集的训练始终可以更好地预测3类(HCE),8类(DSSP)和13类(STR2)二级结构。
可用性
数据和DL模型可在http://meshi1.cs.bgu.ac.il/rw中获得。
更新日期:2020-03-19
down
wechat
bug