当前位置: X-MOL 学术J. Chem. Inf. Model. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deep Learning-Based Imbalanced Data Classification for Drug Discovery.
Journal of Chemical Information and Modeling ( IF 5.6 ) Pub Date : 2020-06-23 , DOI: 10.1021/acs.jcim.9b01162
Selçuk Korkmaz 1
Affiliation  

Drug discovery studies have become increasingly expensive and time-consuming processes. In the early phase of drug discovery studies, an extensive search has been performed to find drug-like compounds, which then can be optimized over time to become a marketed drug. One of the conventional ways of detecting active compounds is to perform an HTS (high-throughput screening) experiment. As of July 2019, the PubChem repository contains 1.3 million bioassays that are generated through HTS experiments. This feature of PubChem makes it a great resource for performing machine learning algorithms to develop classification models to detect active compounds for drug discovery studies. However, data sets obtained from PubChem are highly imbalanced. This imbalanced nature of the data sets has a negative impact on the classification performance of machine learning algorithms. Here, we explored the classification performance of deep neural networks (DNN) on imbalance compound data sets after applying various data balancing methods. We used five confirmatory HTS bioassays from the PubChem repository and applied one undersampling and three oversampling methods as data balancing methods. We used a fully connected, two-hidden-layer DNN model for the classification of active and inactive molecules. To evaluate the performance of the network, we calculated six performance metrics, including balanced accuracy, precision, recall, F1 score, Matthews correlation coefficient, and area under the ROC curve. The study results showed that the effect of imbalanced data on network performance could be mitigated to a degree by applying the data balancing methods. The level of imbalance, however, has a negative effect on the performance of the network.

中文翻译:

基于深度学习的药物发现不平衡数据分类。

药物发现研究已变得越来越昂贵和耗时。在药物发现研究的早期阶段,已经进行了广泛的搜索以找到类似药物的化合物,然后可以随着时间的流逝对其进行优化以成为上市药物。检测活性化合物的常规方法之一是进行HTS(高通量筛选)实验。截至2019年7月,PubChem库包含通过HTS实验生成的130万份生物测定。PubChem的这一功能使其成为执行机器学习算法,开发分类模型以检测用于药物发现研究的活性化合物的重要资源。但是,从PubChem获得的数据集高度不平衡。数据集的这种不平衡性质对机器学习算法的分类性能具有负面影响。在这里,我们在应用各种数据平衡方法之后,探索了深度神经网络(DNN)对不平衡复合数据集的分类性能。我们使用了来自PubChem储存库的五种验证性HTS生物测定法,并将一种欠采样和三种过采样方法用作数据平衡方法。我们使用完全连接的两层DNN模型对活性和非活性分子进行分类。为了评估网络的性能,我们计算了六个性能指标,包括平衡的准确性,精度,召回率,F1得分,Matthews相关系数和ROC曲线下的面积。研究结果表明,通过应用数据平衡方法,可以在一定程度上减轻数据不平衡对网络性能的影响。但是,不平衡的程度对网络的性能有负面影响。
更新日期:2020-06-23
down
wechat
bug