当前位置: X-MOL 学术ACS ES&T Water › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identification of Suitable Technologies for Drinking Water Quality Prediction: A Comparative Study of Traditional, Ensemble, Cost-Sensitive, Outlier Detection Learning Models and Sampling Algorithms
ACS ES&T Water ( IF 4.8 ) Pub Date : 2021-07-08 , DOI: 10.1021/acsestwater.1c00037
Xingguo Chen 1, 2, 3 , Houtao Liu 1 , Xiuying Xu 1 , Luoyuan Zhang 1 , Tianchi Lin 1 , Min Zuo 2 , Yichao Huang 4, 5 , Ruqin Shen 5, 6 , Da Chen 5 , Yongfeng Deng 5, 6
Affiliation  

Drinking water quality data sets used in learning models have been highly imbalanced, which has weakened the prediction ability of models for drinking water quality. Although some efforts have been made to address the issue of imbalance, little is known about the suitable technologies for drinking water quality prediction. Here, a total of 16 common learning models were applied individually to compare the drinking water quality prediction performance based on a large-scale highly imbalanced drinking water quality data set. Our results showed that ensemble, cost-sensitive learning models with higher F1-scores were more suitable for predicting drinking water quality, compared to other models tested in this study. In addition, the learning model performance could be enhanced by the introduction of two mainstream sampling algorithms [synthetic minority oversampling technique (SMOTE) combined with the Tomek links technique (TLTE) or the edited nearest neighbor technique (ENNTE), SMOTE + TLTE or SMOTE + ENNTE, respectively]. In particular, the F1-scores of deep cascade forest (DCF) with SMOTE + TLTE or SMOTE + ENNTE reached 94.54 ± 2.51% and 94.68 ± 2.72%, respectively. As a consequence, DCF with these two sampling algorithms has greater potential to be applied in drinking water quality monitoring and prediction, as well as other fields that have suffered from issues of imbalanced data.

中文翻译:

饮用水水质预测合适技术的识别:传统、集成、成本敏感、异常值检测学习模型和采样算法的比较研究

学习模型中使用的饮用水水质数据集高度不平衡,削弱了模型对饮用水水质的预测能力。尽管已经为解决不平衡问题做出了一些努力,但对饮用水质量预测的合适技术知之甚少。在这里,基于大规模高度不平衡的饮用水水质数据集,总共应用了 16 种常见的学习模型,以比较饮用水水质预测性能。我们的结果表明,与本研究中测试的其他模型相比,具有较高 F1 分数的集成、成本敏感的学习模型更适合预测饮用水质量。此外,通过引入两种主流采样算法[合成少数过采样技术 (SMOTE) 结合 Tomek 链接技术 (TLTE) 或编辑最近邻技术 (ENNTE),SMOTE + TLTE 或 SMOTE + ENNTE,可以提高学习模型的性能,分别]。特别是,具有 SMOTE + TLTE 或 SMOTE + ENNTE 的深层级联森林 (DCF) 的 F1 分数分别达到了 94.54 ± 2.51% 和 94.68 ± 2.72%。因此,具有这两种采样算法的 DCF 在饮用水质量监测和预测以及其他存在数据不平衡问题的领域具有更大的应用潜力。具有 SMOTE + TLTE 或 SMOTE + ENNTE 的深层级联森林 (DCF) 的 F1 分数分别达到 94.54 ± 2.51% 和 94.68 ± 2.72%。因此,具有这两种采样算法的 DCF 在饮用水质量监测和预测以及其他存在数据不平衡问题的领域具有更大的应用潜力。具有 SMOTE + TLTE 或 SMOTE + ENNTE 的深层级联森林 (DCF) 的 F1 分数分别达到 94.54 ± 2.51% 和 94.68 ± 2.72%。因此,具有这两种采样算法的 DCF 在饮用水水质监测和预测以及其他存在数据不平衡问题的领域具有更大的应用潜力。
更新日期:2021-08-13
down
wechat
bug