当前位置: X-MOL 学术Acta Biotheor. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Mathematical Basis of Predicting Dominant Function in Protein Sequences by a Generic HMM–ANN Algorithm
Acta Biotheoretica ( IF 1.3 ) Pub Date : 2018-04-26 , DOI: 10.1007/s10441-018-9327-x
Siddhartha Kundu 1, 2
Affiliation  

The accurate annotation of an unknown protein sequence depends on extant data of template sequences. This could be empirical or sets of reference sequences, and provides an exhaustive pool of probable functions. Individual methods of predicting dominant function possess shortcomings such as varying degrees of inter-sequence redundancy, arbitrary domain inclusion thresholds, heterogeneous parameterization protocols, and ill-conditioned input channels. Here, I present a rigorous theoretical derivation of various steps of a generic algorithm that integrates and utilizes several statistical methods to predict the dominant function in unknown protein sequences. The accompanying mathematical proofs, interval definitions, analysis, and numerical computations presented are meant to offer insights not only into the specificity and accuracy of predictions, but also provide details of the operatic mechanisms involved in the integration and its ensuing rigor. The algorithm uses numerically modified raw hidden markov model scores of well defined sets of training sequences and clusters them on the basis of known function. The results are then fed into an artificial neural network, the predictions of which can be refined using the available data. This pipeline is trained recursively and can be used to discern the dominant principal function, and thereby, annotate an unknown protein sequence. Whilst, the approach is complex, the specificity of the final predictions can benefit laboratory workers design their experiments with greater confidence.

中文翻译:

通过通用 HMM-ANN 算法预测蛋白质序列中的显性函数的数学基础

未知蛋白质序列的准确注释取决于模板序列的现有数据。这可以是经验或参考序列集,并提供详尽的可能函数池。预测主导函数的个别方法存在诸如不同程度的序列间冗余、任意域包含阈值、异构参数化协议和病态输入通道等缺点。在这里,我对通用算法的各个步骤进行了严格的理论推导,该算法集成并利用了几种统计方法来预测未知蛋白质序列中的主要功能。随附的数学证明、区间定义、分析和数值计算不仅旨在提供对预测的特异性和准确性的见解,但也提供了整合所涉及的操作机制及其随后的严格性的细节。该算法使用经过数字修改的原始隐马尔可夫模型分数明确定义的训练序列集,并在已知函数的基础上对它们进行聚类。然后将结果输入人工神经网络,可以使用可用数据完善其预测。该管道经过递归训练,可用于辨别主要的主要功能,从而注释未知的蛋白质序列。虽然该方法很复杂,但最终预测的特异性可以使实验室工作人员更有信心地设计他们的实验。该算法使用经过数字修改的原始隐马尔可夫模型分数明确定义的训练序列集,并在已知函数的基础上对它们进行聚类。然后将结果输入人工神经网络,可以使用可用数据完善其预测。该管道经过递归训练,可用于辨别主要的主要功能,从而注释未知的蛋白质序列。虽然该方法很复杂,但最终预测的特异性可以使实验室工作人员更有信心地设计他们的实验。该算法使用经过数字修改的原始隐马尔可夫模型分数明确定义的训练序列集,并在已知函数的基础上对它们进行聚类。然后将结果输入人工神经网络,可以使用可用数据完善其预测。该管道经过递归训练,可用于辨别主要的主要功能,从而注释未知的蛋白质序列。虽然该方法很复杂,但最终预测的特异性可以使实验室工作人员更有信心地设计他们的实验。该管道经过递归训练,可用于辨别主要的主要功能,从而注释未知的蛋白质序列。虽然该方法很复杂,但最终预测的特异性可以使实验室工作人员更有信心地设计他们的实验。该管道经过递归训练,可用于辨别主要的主要功能,从而注释未知的蛋白质序列。虽然该方法很复杂,但最终预测的特异性可以使实验室工作人员更有信心地设计他们的实验。
更新日期:2018-04-26
down
wechat
bug