当前位置: X-MOL 学术Eur. J. Soil Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Synthetic resampling strategies and machine learning for digital soil mapping in Iran
European Journal of Soil Science ( IF 4.2 ) Pub Date : 2020-01-08 , DOI: 10.1111/ejss.12893
Ruhollah Taghizadeh-Mehrjardi 1, 2 , Karsten Schmidt 1, 3 , Kamran Eftekhari 4 , Thorsten Behrens 1 , Mohammad Jamshidi 4 , Naser Davatgar 4 , Norair Toomanian 5 , Thomas Scholten 1, 3
Affiliation  

Most common machine learning (ML) algorithms usually work well on balanced training sets, that is, datasets in which all classes are approximately represented equally. Otherwise, the accuracy estimates may be unreliable and classes with only a few values are often misclassified or neglected. This is known as a class imbalance problem in machine learning and datasets that do not meet this criterion are referred to as imbalanced data. Most datasets of soil classes are, therefore, imbalanced data. One of our main objectives is to compare eight resampling strategies that have been developed to counteract the imbalanced data problem. We compared the performance of five of the most common ML algorithms with the resampling approaches. The highest increase in prediction accuracy was achieved with SMOTE (the synthetic minority oversampling technique). In comparison to the baseline prediction on the original dataset, we achieved an increase of about 10, 20 and 10% in the overall accuracy, kappa index and F‐score, respectively. Regarding the ML approaches, random forest (RF) showed the best performance with an overall accuracy, kappa index and F‐score of 66, 60 and 57%, respectively. Moreover, the combination of RF and SMOTE improved the accuracy of the individual soil classes, compared to RF trained on the original dataset and allowed better prediction of soil classes with a low number of samples in the corresponding soil profile database, in our case for Chernozems. Our results show that balancing existing soil legacy data using synthetic sampling strategies can significantly improve the prediction accuracy in digital soil mapping (DSM).

中文翻译:

伊朗数字土壤制图的综合重采样策略和机器学习

最常见的机器学习(ML)算法通常在平衡的训练集(即所有类均近似表示的数据集)上运行良好。否则,准确性估计可能会不可靠,并且只有几个值的类别经常被错误分类或忽略。这在机器学习中被称为类不平衡问题,不符合此标准的数据集称为不平衡数据。因此,大多数土壤类别的数据集都是不平衡数据。我们的主要目标之一是比较为解决不平衡数据问题而开发的八种重采样策略。我们将五种最常见的ML算法与重采样方法的性能进行了比较。使用SMOTE(合成少数样本过采样技术)可以最大程度地提高预测精度。与原始数据集的基线预测相比,我们的整体准确性,kappa指数和F评分分别提高了约10%,20%和10%。关于ML方法,随机森林(RF)表现出最佳性能,总体准确性,kappa指数和F得分分别为66%,60%和57%。此外,与原始数据集上训练的RF相比,RF和SMOTE的组合提高了单个土壤分类的准确性,并且在相应的土壤剖面数据库中,对于较少的样品,可以更好地预测土壤分类,在我们的黑樱桃中。我们的结果表明,使用合成采样策略平衡现有土壤遗留数据可以显着提高数字土壤测绘(DSM)的预测准确性。
更新日期:2020-01-08
down
wechat
bug