当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Artificial intelligence paradigm for ligand-based virtual screening on the drug discovery of type 2 diabetes mellitus
Journal of Big Data ( IF 8.1 ) Pub Date : 2021-05-26 , DOI: 10.1186/s40537-021-00465-3
Alhadi Bustamam , Haris Hamzah , Nadya A. Husna , Sarah Syarofina , Nalendra Dwimantara , Arry Yanuar , Devvi Sarwinda

Background

New dipeptidyl peptidase-4 (DPP-4) inhibitors need to be developed to be used as agents with low adverse effects for the treatment of type 2 diabetes mellitus. This study aims to build quantitative structure-activity relationship (QSAR) models using the artificial intelligence paradigm. Rotation Forest and Deep Neural Network (DNN) are used to predict QSAR models. We compared principal component analysis (PCA) with sparse PCA (SPCA) as methods for transforming Rotation Forest. K-modes clustering with Levenshtein distance was used for the selection method of molecules, and CatBoost was used for the feature selection method.

Results

The amount of the DPP-4 inhibitor molecules resulting from the selection process of molecules using K-Modes clustering algorithm is 1020 with logP range value of -1.6693 to 4.99044. Several fingerprint methods such as extended connectivity fingerprint and functional class fingerprint with diameters of 4 and 6 were used to construct four fingerprint datasets, ECFP_4, ECFP_6, FCFP_4, and FCFP_6. There are 1024 features from the four fingerprint datasets that are then selected using the CatBoost method. CatBoost can represent QSAR models with good performance for machine learning and deep learning methods respectively with evaluation metrics, such as Sensitivity, Specificity, Accuracy, and Matthew’s correlation coefficient, all valued above 70% with a feature importance level of 60%, 70%, 80%, and 90%.

Conclusion

The K-modes clustering algorithm can produce a representative subset of DPP-4 inhibitor molecules. Feature selection in the fingerprint dataset using CatBoost is best used before making QSAR Classification and QSAR Regression models. QSAR Classification using Machine Learning and QSAR Classification using Deep Learning, each of which has an accuracy of above 70%. The QSAR RFC-PCA and QSAR RFR-PCA models performed better than QSAR RFC-SPCA and QSAR RFR-SPCA models because QSAR RFC-PCA and QSAR RFR-PCA models have more effective time than the QSAR RFC-SPCA and QSAR RFR-SPCA models.



中文翻译:

基于配体的虚拟筛选对2型糖尿病药物发现的人工智能范例

背景

需要开发新的二肽基肽酶-4(DPP-4)抑制剂,以用作治疗2型糖尿病的不良反应少的药物。本研究旨在利用人工智能范式建立定量构效关系(QSAR)模型。旋转森林和深层神经网络(DNN)用于预测QSAR模型。我们将主成分分析(PCA)与稀疏PCA(SPCA)作为转换轮换林的方法进行了比较。用Levenshtein距离的K型聚类用于分子的选择方法,而CatBoost用于特征选择方法。

结果

由使用K-Modes聚类算法进行的分子选择过程得出的DPP-4抑制剂分子数量为1020,logP范围值为-1.6693至4.99044。几种指纹方法(例如扩展连接性指纹和直径为4和6的功能类指纹)用于构造四个指纹数据集ECFP_4,ECFP_6,FCFP_4和FCFP_6。四个指纹数据集中有1024个特征,然后使用CatBoost方法进行选择。CatBoost可以代表具有良好性能的QSAR模型,分别适用于机器学习和深度学习方法,并具有评估指标,例如敏感度,特异性,准确性和Matthew的相关系数,所有值均超过70%,特征重要性等级分别为60%,70%, 80%和90%。

结论

K-模式聚类算法可以产生DPP-4抑制剂分子的代表性子集。在建立QSAR分类和QSAR回归模型之前,最好使用CatBoost在指纹数据集中选择特征。使用机器学习的QSAR分类和使用深度学习的QSAR分类,每一个的准确性都超过70%。QSAR RFC-PCA和QSAR RFR-PCA模型的性能优于QSAR RFC-SPCA和QSAR RFR-SPCA模型,因为QSAR RFC-PCA和QSAR RFR-PCA模型比QSAR RFC-SPCA和QSAR RFR-SPCA有更有效的时间楷模。

更新日期:2021-05-26
down
wechat
bug