当前位置: X-MOL 学术Ecol. Inform. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Effects of class imbalance on resampling and ensemble learning for improved prediction of cyanobacteria blooms
Ecological Informatics ( IF 5.1 ) Pub Date : 2020-11-09 , DOI: 10.1016/j.ecoinf.2020.101202
Jihoon Shin , Seonghyeon Yoon , YoungWoo Kim , Taeho Kim , ByeongGeon Go , YoonKyung Cha

This study aimed to explicitly explore the effects of the degree of class imbalance on predicting infrequently occurring events, i.e., cyanobacteria blooms. Although class imbalance poses a major issue in binary classification schemes, few efforts have been made to relate model performance with real-life applications. The data utilized herein were collected from 2013 to 2019 at 13 sites within three major rivers in South Korea; a variety of physicochemical and hydrometeorological factors were obtained as input variables, and the occurrence of cyanobacteria blooms (indicated by a cell count ≥ 1000 cells/mL) was included as a response variable. The imbalance ratio (IR) for cyanobacteria blooms differed significantly by site, ranging widely from 0.93 to 9.32. The study results suggested that class imbalance negatively affected model performance, with an increase in the IR significantly increasing the false negative (FN) rate. The application of resampling decreased the FN rate while simultaneously increasing the true positive (TP) rate, which yielded improvements that tended to increase with increasing IRs. Ensemble classifiers, which combine multiple single classifiers into an integrated classifier, alone could not successfully address the class imbalance problem; however, in combination with resampling, they consistently outperformed single classifiers. Among the ensemble classifiers, AdaBoost yielded the most stable performance across a range of IRs, irrespective of the resampling application. A variable importance analysis indicated that temperature was usually the primary influencing factor of cyanobacteria blooms. These findings highlight the effectiveness of resampling applications for addressing class imbalance, while providing useful guidelines for learning from imbalance data, including the selection of classification algorithms and model evaluation metrics.



中文翻译:

类不平衡对重采样和集成学习的影响,以改善蓝藻花序的预测

这项研究旨在明确探讨类别不平衡程度对预测不经常发生的事件(即蓝细菌开花)的影响。尽管类不平衡是二进制分类方案中的主要问题,但很少有人将模型性能与实际应用联系起来。本文使用的数据是2013年至2019年在韩国三大河流中的13个站点收集的; 获得了多种物理化学和水文气象因素作为输入变量,并将蓝藻水华的发生(以细胞计数≥1000个细胞/ mL表示)作为响应变量。蓝藻水华的不平衡率(IR)因地点而异,差异很大,范围从0.93至9.32。研究结果表明,类别失衡会对模型性能产生负面影响,随着IR的增加,假阴性(FN)率也会大大增加。重采样的应用降低了FN率,同时提高了真正的正(TP)率,从而产生了随着IR增加而趋于增加的改进。仅将多个单个分类器组合为一个集成分类器的集成分类器无法成功解决类不平衡问题。但是,结合重采样,它们始终优于单个分类器。在集成分类器中,无论重新采样应用如何,AdaBoost在一系列IR上都能提供最稳定的性能。重要性分析表明,温度通常是蓝藻水华的主要影响因素。

更新日期:2020-11-21
down
wechat
bug