当前位置: X-MOL 学术BMC Med. Genomics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
DBCSMOTE: a clustering-based oversampling technique for data-imbalanced warfarin dose prediction
BMC Medical Genomics ( IF 2.1 ) Pub Date : 2020-10-22 , DOI: 10.1186/s12920-020-00781-2
Yanyun Tao 1 , Yuzhen Zhang 2 , Bin Jiang 2
Affiliation  

Vitamin K antagonist (warfarin) is the most classical and widely used oral anticoagulant with assuring anticoagulant effect, wide clinical indications and low price. Warfarin dosage requirements of different patients vary largely. For warfarin daily dosage prediction, the data imbalance in dataset leads to inaccurate prediction on the patients of rare genotype, who usually have large stable dosage requirement. To balance the dataset of patients treated with warfarin and improve the predictive accuracy, an appropriate partition of majority and minority groups, together with an oversampling method, is required. To solve the data-imbalance problem mentioned above, we developed a clustering-based oversampling technique denoted as DBCSMOTE, which combines density-based spatial clustering of application with noise (DBCSCAN) and synthetic minority oversampling technique (SMOTE). DBCSMOTE automatically finds the minority groups by acquiring the association between samples in terms of the clinical features/genotypes and the warfarin dosage, and creates an extended dataset by adding the new synthetic samples of majority and minority groups. Meanwhile, two ensemble models, boosted regression tree (BRT) and random forest (RF), which are built on the extended dataset generateed by DBCSMOTE, accomplish the task of warfarin daily dosage prediction. DBCSMOTE and the comparison methods were tested on the datasets derived from our Hospital and International Warfarin Pharmacogenetics Consortium (IWPC). As the results, DBCSMOTE-BRT obtained the highest R-squared (R2) of 0.424 and the smallest mean squared error (mse) of 1.08. In terms of the percentage of patients whose predicted dose of warfarin is within 20% of the actual stable therapeutic dose (20%-p), DBCSMOTE-BRT can achieve the largest value of 47.8% among predictive models. The more important thing is that DBCSMOTE saved about 68% computational time to achieve the same or better performance than the Evolutionary SMOTE, which was the best oversampling method in warfarin dose prediction by far. Meanwhile, in warfarin dose prediction, it is discovered that DBCSMOTE is more effective in integrating BRT than RF for warfarin dose prediction. Our finding is that the genotypes, CYP2C9 and VKORC1, no doubt contribute to the predictive accuracy. It was also discovered left atrium diameter, glutamic pyruvic transaminase and serum creatinine included in the model actually improved the predictive accuracy; When congestive heart failure, diabetes mellitus and valve replacement were absent in DBCSMOTE-BRT/RF, the predictive accuracy of DBCSMOTE-BRT/RF decreased. The oversampling ratio and number of minority clusters have a large impact on the effect of oversampling. According to our test, the predictive accuracy was high when the number of minority clusters was 6 ~ 8. The oversampling ratio for small minority clusters should be large (> 1.2) and for large minority clusters should be small (< 0.2). If the dataset becomes larger, the DBCSMOTE would be re-optimized and its BRT/RF model should be re-trained. DBCSMOTE-BRT/RF outperformed the current commonly-used tool called Warfarindosing. As compared to Evolutionary SMOTE-BRT and RF models, DBCSMOTE-BRT and RF models take only a small computational time to achieve the same or higher performance in many cases. In terms of predictive accuracy, RF is not as good as BRT. However, RF still has a powerful ability in generating a highly accurate model as the dataset increases; the software “WarfarinSeer v2.0” is a test version, which packed DBCSMOTE-BRT/RF. It could be a convenient tool for clinical application in warfarin treatment.

中文翻译:


DBCSMOTE:一种基于聚类的过采样技术,用于数据不平衡的华法林剂量预测



维生素K拮抗剂(华法林)是最经典、应用最广泛的口服抗凝药,具有抗凝效果好、临床适应症广、价格低廉等特点。不同患者的华法林剂量需求差异很大。对于华法林每日剂量预测,数据集中的数据不平衡导致对稀有基因型患者的预测不准确,这些患者通常有较大的稳定剂量需求。为了平衡华法林治疗患者的数据集并提高预测准确性,需要对多数组和少数组进行适当划分,并采用过采样方法。为了解决上述数据不平衡问题,我们开发了一种基于聚类的过采样技术,表示为 DBCSMOTE,该技术结合了基于密度的噪声应用空间聚类(DBCSCAN)和合成少数过采样技术(SMOTE)。 DBCSMOTE 通过获取样本之间的临床特征/基因型和华法林剂量之间的关联来自动找到少数群体,并通过添加多数群体和少数群体的新合成样本来创建扩展数据集。同时,基于DBCSMOTE生成的扩展数据集构建的两种集成模型,即提升回归树(BRT)和随机森林(RF),完成了华法林每日剂量预测的任务。 DBCSMOTE 和比较方法在来自我们医院和国际华法林药物遗传学联盟 (IWPC) 的数据集上进行了测试。结果,DBCSMOTE-BRT 获得了最高的 R 平方 (R2) 0.424 和最小的均方误差 (mse) 1.08。 就华法林预测剂量在实际稳定治疗剂量(20%-p)20%以内的患者百分比而言,DBCSMOTE-BRT可以达到预测模型中的最大值47.8%。更重要的是,DBCSMOTE 节省了约 68% 的计算时间,达到了与 Evolutionary SMOTE 相同或更好的性能,这是迄今为止华法林剂量预测中最好的过采样方法。同时,在华法林剂量预测中,发现DBCSMOTE在整合BRT方面比RF更有效地进行华法林剂量预测。我们的发现是,CYP2C9 和 VKORC1 基因型无疑有助于预测准确性。研究还发现,模型中包含的左心房直径、谷丙转氨酶和血清肌酐实际上提高了预测准确性;当 DBCSMOTE-BRT/RF 中不存在充血性心力衰竭、糖尿病和瓣膜置换术时,DBCSMOTE-BRT/RF 的预测准确性下降。过采样比例和少数簇数量对过采样的效果影响很大。根据我们的测试,当少数簇的数量为 6 ~ 8 时,预测精度较高。小少数簇的过采样率应较大(> 1.2),而大少数簇的过采样率应较小(< 0.2)。如果数据集变大,则需要重新优化 DBCSMOTE,并重新训练其 BRT/RF 模型。 DBCSMOTE-BRT/RF 的性能优于当前常用的 Warfarindosing 工具。与 Evolutionary SMOTE-BRT 和 RF 模型相比,DBCSMOTE-BRT 和 RF 模型在许多情况下只需要很少的计算时间即可实现相同或更高的性能。就预测准确性而言,RF不如BRT。 然而,随着数据集的增加,RF仍然具有强大的生成高精度模型的能力;软件“WarfarinSeer v2.0”是测试版本,其中包含DBCSMOTE-BRT/RF。它可能成为华法林治疗临床应用的便捷工具。
更新日期:2020-10-26
down
wechat
bug