Coping with imbalanced data problem in digital mapping of soil classes,European Journal of Soil Science

当前位置： X-MOL 学术 › Eur. J. Soil Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Coping with imbalanced data problem in digital mapping of soil classes
European Journal of Soil Science ( IF 4.0 ) Pub Date : 2023-04-10 , DOI: 10.1111/ejss.13368
Amin Sharififar ₁ , Fereydoon Sarmadian ₂

Affiliation

An unsolved problem in the digital mapping of categorical soil variables and soil types is the imbalanced number of observations, which leads to reduced accuracy and the loss of the minority class (the class with a significantly lower number of observations compared to other classes) in the final map. So far, synthetic over- and under-sampling techniques have been explored in soil science; however, more efficient approaches that do not have the drawbacks of these techniques and guarantee retention of the minority classes in the produced map are essentially required. Such approaches suggested in the present study for digital mapping of soil classes include machine learning models of ensemble gradient boosting, cost-sensitive learning and one-class classification (OCC) of the minority class combined with multi-class classification. In this regard, extreme gradient boosting (XGB) as an ensemble gradient learner, a cost-sensitive decision tree (CSDT) within the C5.0 algorithm, and a one-class support vector machine combined with multi-class classification (OCCM) were investigated to map eight soil great groups with a naturally imbalanced frequency of observations in northwest Iran. A total of 453 profile data points were used for mapping the soil great groups of the study area. A data split was done manually for each class separately, which resulted in an overall 70% of the data for calibration and 30% for validation. The bootstrapping approach of calibration (with 10 runs) was performed to produce multiple maps for each model. The 10 bootstraps were evaluated against the hold-out validation dataset. The average values of accuracy measures, including Kappa (K), overall accuracy (OA), producer's accuracy (PA) and user's accuracy (UA), were explored. In addition, the results of this study were compared with a previous study in the same area, in which resampling techniques were used to deal with imbalanced data for digital soil class mapping. The findings show that all three suggested methods can deal well with the imbalanced classification problem, with OCCM showing the highest K (= 0.76) and OA (= 82) in the validation stage. Also, this model can guarantee the retention of the minority classes in the final map. Comparing the present approaches with the previous study approach demonstrates that the three newly suggested methods can remarkably increase both overall and individual class accuracy for mapping.

中文翻译：

土类数字制图中数据不平衡问题的处理

分类土壤变量和土壤类型的数字制图中一个未解决的问题是观测数量不平衡，这会导致准确性降低和少数类别（与其他类别相比观测数量明显较少的类别）的损失最终地图。到目前为止，已经在土壤科学中探索了合成过采样和欠采样技术；然而，本质上需要更有效的方法，这些方法没有这些技术的缺点，并保证在生成的地图中保留少数类。本研究中建议的土壤类别数字制图方法包括集成梯度提升的机器学习模型、成本敏感学习和少数类的一类分类 (OCC) 与多类分类相结合。在这方面，研究了作为集成梯度学习器的极端梯度提升 (XGB)、C5.0 算法中的成本敏感决策树 (CSDT) 以及结合多类分类 (OCCM) 的单类支持向量机来映射八在伊朗西北部，土壤大群体的观测频率自然不平衡。共使用453个剖面数据点绘制研究区土壤大类群。分别为每个类别手动完成数据拆分，这导致总共 70% 的数据用于校准，30% 的数据用于验证。执行自举校准方法（运行 10 次）为每个模型生成多个地图。针对保留验证数据集评估了 10 个引导程序。准确度测量的平均值，包括 Kappa (K)、总体准确度 (OA)、生产者的准确性 (PA) 和用户的准确性 (UA)，进行了探索。此外，将本研究的结果与同一地区先前的研究进行了比较，在该研究中，重采样技术用于处理数字土壤类映射的不平衡数据。研究结果表明，所有三种建议的方法都可以很好地处理不平衡分类问题，OCCM 在验证阶段显示出最高的 K (= 0.76) 和 OA (= 82)。此外，该模型可以保证在最终地图中保留少数类。将目前的方法与以前的研究方法进行比较表明，这三种新提出的方法可以显着提高映射的整体和单个类别的准确性。将这项研究的结果与同一地区的先前研究进行了比较，在该研究中，重采样技术用于处理数字土壤类映射的不平衡数据。研究结果表明，所有三种建议的方法都可以很好地处理不平衡分类问题，OCCM 在验证阶段显示出最高的 K (= 0.76) 和 OA (= 82)。此外，该模型可以保证在最终地图中保留少数类。将目前的方法与以前的研究方法进行比较表明，这三种新提出的方法可以显着提高映射的整体和单个类别的准确性。将这项研究的结果与同一地区的先前研究进行了比较，在该研究中，重采样技术用于处理数字土壤类映射的不平衡数据。研究结果表明，所有三种建议的方法都可以很好地处理不平衡分类问题，OCCM 在验证阶段显示出最高的 K (= 0.76) 和 OA (= 82)。此外，该模型可以保证在最终地图中保留少数类。将目前的方法与以前的研究方法进行比较表明，这三种新提出的方法可以显着提高映射的整体和单个类别的准确性。研究结果表明，所有三种建议的方法都可以很好地处理不平衡分类问题，OCCM 在验证阶段显示出最高的 K (= 0.76) 和 OA (= 82)。此外，该模型可以保证在最终地图中保留少数类。将目前的方法与以前的研究方法进行比较表明，这三种新提出的方法可以显着提高映射的整体和单个类别的准确性。研究结果表明，所有三种建议的方法都可以很好地处理不平衡分类问题，OCCM 在验证阶段显示出最高的 K (= 0.76) 和 OA (= 82)。此外，该模型可以保证在最终地图中保留少数类。将目前的方法与以前的研究方法进行比较表明，这三种新提出的方法可以显着提高映射的整体和单个类别的准确性。

更新日期：2023-04-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11