当前位置: X-MOL 学术Water Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Machine learning methods for imbalanced data set for prediction of faecal contamination in beach waters
Water Research ( IF 11.4 ) Pub Date : 2021-07-23 , DOI: 10.1016/j.watres.2021.117450
Mathias Bourel 1 , Angel M Segura 2 , Carolina Crisci 2 , Guzmán López 2 , Lia Sampognaro 2 , Victoria Vidal 2 , Carla Kruk 3 , Claudia Piccini 4 , Gonzalo Perera 2
Affiliation  

Predicting water contamination by statistical models is a useful tool to manage health risk in recreational beaches. Extreme contamination events, i.e. those exceeding normative are generally rare with respect to bathing conditions and thus the data is said to be imbalanced. Modeling and predicting those rare events present unique challenges. Here we introduce and evaluate several machine learning techniques and metrics to model imbalanced data and evaluate model performance. We do so by using a) simulated data-sets and b) a real data base with records of faecal coliform abundance monitored for 10 years in 21 recreational beaches in Uruguay (N 19000) using in situ and meteorological variables. We discuss advantages and disadvantages of the methods and provide a simple guide to perform models for a general audience. We also provide R codes to reproduce model fitting and testing. We found that most Machine Learning techniques are sensitive to imbalance and require specific data pre-treatment (e.g. upsampling) to improve performance. Accuracy (i.e. correctly classified cases over total cases) is not adequate to evaluate model performance on imbalanced data set. Instead, true positive rates (TPR) and false positive rates (FPR) are recommended. Among the 52 possible candidate algorithms tested, the stratified Random forest presented the better performance improving TPR in 50% with respect to baseline (0.4) and outperformed baseline in the evaluated metrics. Support vector machines combined with upsampling method or synthetic minority oversampling technique (SMOTE) performed well, similar to Adaboost with SMOTE. These results suggests that combining modeling strategies is necessary to improve our capacity to anticipate water contamination and avoid health risk.



中文翻译:

用于预测海滩水域粪便污染的不平衡数据集的机器学习方法

通过统计模型预测水污染是管理休闲海滩健康风险的有用工具。极端污染事件,即那些超出标准的污染事件在沐浴条件方面通常很少见,因此数据被认为是不平衡的。对这些罕见事件进行建模和预测提出了独特的挑战。在这里,我们介绍和评估几种机器学习技术和指标,以对不平衡数据进行建模并评估模型性能。我们这样做是通过使用 a) 模拟数据集和 b) 一个真实的数据库,其中记录了乌拉圭 (N) 21 个休闲海滩 10 年来监测的粪便大肠菌群丰度19000)就地使用和气象变量。我们讨论了这些方法的优缺点,并提供了一个简单的指南来为普通观众执行模型。我们还提供 R 代码来重现模型拟合和测试。我们发现大多数机器学习技术对不平衡很敏感,需要特定的数据预处理(例如上采样)来提高性能。准确性(即正确分类案例超过总案例)不足以评估模型在不平衡数据集上的性能。相反,建议使用真阳性率 (TPR) 和假阳性率 (FPR)。在测试的 52 种可能的候选算法中,分层随机森林表现出更好的性能,相对于基线 (0.4) 提高了 50% 的 TPR,并在评估指标中优于基线。支持向量机结合上采样方法或合成少数过采样技术(SMOTE)表现良好,类似于带有 SMOTE 的 Adaboost。这些结果表明,结合建模策略对于提高我们预测水污染和避免健康风险的能力是必要的。

更新日期:2021-08-02
down
wechat
bug