当前位置: X-MOL 学术J. Comput. Aid. Mol. Des. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Undersampling: case studies of flaviviral inhibitory activities.
Journal of Computer-Aided Molecular Design ( IF 3.5 ) Pub Date : 2019-11-26 , DOI: 10.1007/s10822-019-00255-3
Stephen J Barigye 1 , José Manuel García de la Vega 1 , Juan A Castillo-Garit 2
Affiliation  

Imbalanced datasets, comprising of more inactive compounds relative to the active ones, are a common challenge in ligand-based model building workflows for drug discovery. This is particularly true for neglected tropical diseases since efforts to identify therapeutics for these diseases are often limited. In this report, we analyze the performance of several undersampling strategies in modeling the Dengue Virus 2 (DENV2) inhibitory activity, as well as the anti-flaviviral activities for the West Nile (WNV) and Zika (ZIKV) viruses. To this end, we build datasets comprising of 1218 (159 actives and 1059 inactives), 1044 (132 actives and 912 inactives) and 302 (75 actives and 227 inactives) molecules with known DENV2, WNV and ZIKV inhibitory activity profiles, respectively. We develop ensemble classifiers for these endpoints and compare the performance of the different undersampling algorithms on external sets. It is observed that data pruning algorithms yield superior performance relative to data selection algorithms. The best overall performance is provided by the one-sided selection algorithm with test set balanced accuracy (BACC) values of 0.84, 0.74 and 0.77 for the DENV2, WNV and ZIKV inhibitory activities, respectively. For the model building, we use the recently proposed GT-STAF information indices, and compare the predictivity of 3 molecular fragmentation approaches: connected subgraphs, substructure and alogp atom types, which are observed to show comparable performance. On the other hand, a combination of indices based on these fragmentation strategies enhances the predictivity of the built ensembles. The built models could be useful for screening new molecules with possible DENV, WNV and ZIKV inhibitory activities. ADMET modelers are encouraged to adopt undersampling algorithms in their workflows when dealing with imbalanced datasets.

中文翻译:

欠采样:黄病毒抑制活性的案例研究。

相对于活性化合物而言,由更多的非活性化合物组成的不平衡数据集是用于药物发现的基于配体的模型构建工作流程中的常见挑战。对于被忽视的热带病尤其如此,因为确定这些疾病的治疗方法的努力常常受到限制。在本报告中,我们分析了几种低采样策略在模拟登革热病毒2(DENV2)抑制活性以及西尼罗河(WNV)和寨卡病毒(ZIKV)的抗黄病毒活性方面的性能。为此,我们建立的数据集分别包含1218个分子(159个活性物和1059个非活性物),1044个分子(132个活性物和912个非活性物)和302个分子(75个活性物和227个非活性物),它们具有已知的DENV2,WNV和ZIKV抑制活性谱。我们为这些端点开发了集成分类器,并比较了外部集上不同欠采样算法的性能。可以看出,与数据选择算法相比,数据修剪算法产生了更高的性能。一侧选择算法可为DENV2,WNV和ZIKV抑制活性分别提供0.84、0.74和0.77的测试集平衡精度(BACC)值,从而提供最佳的总体性能。对于模型构建,我们使用了最近提出的GT-STAF信息索引,并比较了3种分子碎片化方法的可预测性:相连的子图,子结构和alogp原子类型,它们显示出可比的性能。另一方面,基于这些碎片策略的索引组合可增强构建乐团的可预测性。建立的模型对于筛选可能具有DENV,WNV和ZIKV抑制活性的新分子可能有用。鼓励ADMET建模人员在处理不平衡数据集时在工作流程中采用欠采样算法。
更新日期:2019-11-26
down
wechat
bug