当前位置: X-MOL 学术Comb. Chem. High Throughput Screen. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Computational Method for Identifying Malonylation Sites by Using Random Forest Algorithm.
Combinatorial Chemistry & High Throughput Screening ( IF 1.8 ) Pub Date : 2020-05-01 , DOI: 10.2174/1386207322666181227144318
ShaoPeng Wang 1 , JiaRui Li 1 , Xijun Sun 1 , Yu-Hang Zhang 2 , Tao Huang 2 , Yudong Cai 1
Affiliation  

Background: As a newly uncovered post-translational modification on the ε-amino group of lysine residue, protein malonylation was found to be involved in metabolic pathways and certain diseases. Apart from experimental approaches, several computational methods based on machine learning algorithms were recently proposed to predict malonylation sites. However, previous methods failed to address imbalanced data sizes between positive and negative samples.

Objective: In this study, we identified the significant features of malonylation sites in a novel computational method which applied machine learning algorithms and balanced data sizes by applying synthetic minority over-sampling technique.

Method: Four types of features, namely, amino acid (AA) composition, position-specific scoring matrix (PSSM), AA factor, and disorder were used to encode residues in protein segments. Then, a two-step feature selection procedure including maximum relevance minimum redundancy and incremental feature selection, together with random forest algorithm, was performed on the constructed hybrid feature vector.

Results: An optimal classifier was built from the optimal feature subset, which featured an F1-measure of 0.356. Feature analysis was performed on several selected important features.

Conclusion: Results showed that certain types of PSSM and disorder features may be closely associated with malonylation of lysine residues. Our study contributes to the development of computational approaches for predicting malonyllysine and provides insights into molecular mechanism of malonylation.



中文翻译:

用随机森林算法识别丙二酰化位点的计算方法。

背景:作为新发现的赖氨酸残基的ε-氨基翻译后修饰,蛋白质丙二酸化被发现与代谢途径和某些疾病有关。除了实验方法外,最近还提出了几种基于机器学习算法的计算方法来预测丙二酰化位点。但是,先前的方法无法解决正样本和负样本之间数据大小不平衡的问题。

目的:在这项研究中,我们通过一种新颖的计算方法,通过使用合成少数过采样技术应用了机器学习算法和平衡的数据大小,确定了丙二酰化位点的显着特征。

方法:使用四种类型的特征(即氨基酸(AA)组成,位置特异性得分矩阵(PSSM),AA因子和无序)来编码蛋白质片段中的残基。然后,对构建的混合特征向量执行包括最大相关性最小冗余和增量特征选择的两步特征选择过程,以及随机森林算法。

结果:从最佳特征子集构建了一个最佳分类器,该特征子集的F1度量值为0.356。对几个选定的重要特征进行了特征分析。

结论:结果表明,某些类型的PSSM和障碍特征可能与赖氨酸残基的丙二酰化密切相关。我们的研究为预测​​丙二酰赖氨酸的计算方法的发展做出了贡献,并为丙二酰化的分子机理提供了见解。

更新日期:2020-05-01
down
wechat
bug