当前位置: X-MOL 学术Front. Environ. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Comparison of resampling algorithms to address class imbalance when developing machine learning models to predict foodborne pathogen presence in agricultural water
Frontiers in Environmental Science ( IF 3.3 ) Pub Date : 2021-06-15 , DOI: 10.3389/fenvs.2021.701288
Daniel Lowell Weller , Tanzy M. T. Love , Martin Wiedmann

Recent studies have shown that predictive models can supplement or provide alternatives to E. coli-testing for assessing foodborne pathogen presence in water used for produce production. Since these studies used balanced training data and focused on enteric pathogens, research is needed to determine (i) if predictive models can be used to assess Listeria contamination status in agricultural water, and (ii) how resampling (to deal with imbalanced data) affects model performance. To address these knowledge gaps, this study developed models that predict nonpathogenic Listeria spp. (excluding L. monocytogenes) and L. monocytogenes presence in agricultural water using various combinations of learner (e.g., random forest, regression), feature type, and resampling (none, oversampling, SMOTE). Four feature types were used in model training: microbial, physicochemical, spatial, and weather. “Full models” were trained using all four feature types, while “nested models” used between one and three types. In total, 45 full (15 learners*3 resampling approaches) and 108 nested (5 learners*9 feature sets*3 resampling approaches) models were trained per outcome. Model performance was compared against baseline models where E. coli concentration was the sole predictor. Overall, the machine learning models outperformed the baseline E. coli models, with random forests outperforming other models built using other learners (e.g., rule-based learners). Resampling produced more accurate models than not resampling, with SMOTE models outperforming, on average, oversampling models. Regardless of resampling method, spatial and physicochemical water quality features drove accurate predictions for the nonpathogenic Listeria spp. and L. monocytogenes models, respectively. Overall, these findings (i) illustrate the need for alternates to existing E. coli-based monitoring programs for assessing produce safety hazards in agricultural water, and (ii) suggest that predictive models may be one such alternative. Moreover, these findings provide a conceptual framework for how such models can be developed in the future. For example, future studies should consider using random forest learners, SMOTE resampling, and spatial features to develop models to predict the presence of foodborne pathogens, such as L. monocytogenes, in agricultural water.

中文翻译:

在开发机器学习模型以预测农业用水中食源性病原体存在时解决类别不平衡的重采样算法的比较

最近的研究表明,预测模型可以补充或提供大肠杆菌测试的替代方案,以评估用于产品生产的水中是否存在食源性病原体。由于这些研究使用平衡的训练数据并侧重于肠道病原体,因此需要进行研究以确定 (i) 是否可以使用预测模型来评估农业用水中的李斯特菌污染状况,以及 (ii) 重新采样(以处理不平衡的数据)如何影响模型性能。为了解决这些知识空白,本研究开发了预测非致病性李斯特菌属的模型。(不包括单核细胞增生李斯特菌)和使用学习器(例如,随机森林、回归)、特征类型和重采样(无、过采样、SMOTE)的各种组合在农业水中存在的单核细胞增生李斯特菌。模型训练中使用了四种特征类型:微生物、物理化学、空间和天气。“完整模型”使用所有四种特征类型进行训练,而“嵌套模型”使用一到三种类型。对于每个结果,总共训练了 45 个完整(15 个学习者 * 3 种重采样方法)和 108 个嵌套(5 个学习者 * 9 个特征集 * 3 种重采样方法)模型。模型性能与基线模型进行了比较,其中大肠杆菌浓度是唯一的预测因子。总体而言,机器学习模型优于基线大肠杆菌模型,随机森林优于使用其他学习器(例如,基于规则的学习器)构建的其他模型。重采样产生比不重采样更准确的模型,平均而言,SMOTE 模型优于过采样模型。无论采用何种重采样方法,空间和物理化学水质特征推动了对非致病性李斯特菌的准确预测。和 L. monocytogenes 模型,分别。总体而言,这些发现 (i) 说明需要替代现有的基于大肠杆菌的监测程序来评估农业用水中的产品安全危害,并且 (ii) 表明预测模型可能是这样一种替代方案。此外,这些发现为未来如何开发此类模型提供了一个概念框架。例如,未来的研究应考虑使用随机森林学习器、SMOTE 重采样和空间特征来开发模型来预测农业用水中食源性病原体(如单核细胞增生李斯特菌)的存在。这些发现 (i) 说明需要替代现有的基于大肠杆菌的监测程序来评估农业用水中的产品安全危害,并且 (ii) 表明预测模型可能是这样的一种替代方法。此外,这些发现为未来如何开发此类模型提供了一个概念框架。例如,未来的研究应考虑使用随机森林学习器、SMOTE 重采样和空间特征来开发模型来预测农业用水中食源性病原体(如单核细胞增生李斯特菌)的存在。这些发现 (i) 说明需要替代现有的基于大肠杆菌的监测程序来评估农业用水中的产品安全危害,并且 (ii) 表明预测模型可能是这样的一种替代方法。此外,这些发现为未来如何开发此类模型提供了一个概念框架。例如,未来的研究应考虑使用随机森林学习器、SMOTE 重采样和空间特征来开发模型来预测农业用水中食源性病原体(如单核细胞增生李斯特菌)的存在。
更新日期:2021-06-15
down
wechat
bug