当前位置: X-MOL 学术BMC Genomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Accurate prediction of DNA N4-methylcytosine sites via boost-learning various types of sequence features.
BMC Genomics ( IF 4.4 ) Pub Date : 2020-09-11 , DOI: 10.1186/s12864-020-07033-8
Zhixun Zhao 1 , Xiaocai Zhang 1 , Fang Chen 2 , Liang Fang 3 , Jinyan Li 1
Affiliation  

DNA N4-methylcytosine (4mC) is a critical epigenetic modification and has various roles in the restriction-modification system. Due to the high cost of experimental laboratory detection, computational methods using sequence characteristics and machine learning algorithms have been explored to identify 4mC sites from DNA sequences. However, state-of-the-art methods have limited performance because of the lack of effective sequence features and the ad hoc choice of learning algorithms to cope with this problem. This paper is aimed to propose new sequence feature space and a machine learning algorithm with feature selection scheme to address the problem. The feature importance score distributions in datasets of six species are firstly reported and analyzed. Then the impact of the feature selection on model performance is evaluated by independent testing on benchmark datasets, where ACC and MCC measurements on the performance after feature selection increase by 2.3% to 9.7% and 0.05 to 0.19, respectively. The proposed method is compared with three state-of-the-art predictors using independent test and 10-fold cross-validations, and our method outperforms in all datasets, especially improving the ACC by 3.02% to 7.89% and MCC by 0.06 to 0.15 in the independent test. Two detailed case studies by the proposed method have confirmed the excellent overall performance and correctly identified 24 of 26 4mC sites from the C.elegans gene, and 126 out of 137 4mC sites from the D.melanogaster gene. The results show that the proposed feature space and learning algorithm with feature selection can improve the performance of DNA 4mC prediction on the benchmark datasets. The two case studies prove the effectiveness of our method in practical situations.

中文翻译:

通过加强学习各种类型的序列特征来准确预测DNA N4-甲基胞嘧啶位点。

DNA N4-甲基胞嘧啶(4mC)是关键的表观遗传修饰,在限制性修饰系统中具有多种作用。由于实验实验室检测的高昂成本,已经探索了使用序列特征和机器学习算法的计算方法来从DNA序列中鉴定4mC位点。但是,由于缺少有效的序列特征以及学习算法的临时选择来解决此问题,因此现有技术的方法性能有限。本文旨在提出新的序列特征空间和一种具有特征选择方案的机器学习算法以解决该问题。首先报道并分析了六个物种的数据集中的特征重要性得分分布。然后,通过对基准数据集进行独立测试来评估特征选择对模型性能的影响,其中特征选择后的ACC和MCC测量对性能的影响分别增加2.3%至9.7%和0.05至0.19。通过独立测试和10倍交叉验证,将所提出的方法与三个最新的预测变量进行了比较,我们的方法在所有数据集中均表现出色,尤其是将ACC提高了3.02%至7.89%,将MCC提高了0.06至0.15在独立测试中。通过提出的方法进行的两个详细案例研究已经证实了出色的整体性能,并正确地从秀丽隐杆线虫基因中鉴定出26个4mC位点中的24个,从D.melanogaster基因中正确鉴定了137个4mC位点中的126个。结果表明,所提出的特征空间和具有特征选择的学习算法可以提高基准数据集上DNA 4mC预测的性能。这两个案例研究证明了我们的方法在实际情况下的有效性。
更新日期:2020-09-11
down
wechat
bug