Improving the predictions of soil properties from VNIR–SWIR spectra in an unlabeled region using semi-supervised and active learning

doi:10.1016/j.geoderma.2020.114830

Geoderma

Volume 387, 1 April 2021, 114830

https://doi.org/10.1016/j.geoderma.2020.114830 Get rights and content

Highlights

•
The semi-supervised learning framework of LapSVR is presented.
•
LapSVR outperforms other supervised techniques in VNIR-SWIR spectroscopy.
•
The active learning framework is presented along with a novel strategy.
•
Active learning identifies samples from the unknown region that ought to be labeled.
•
The proposed methodologies statistically outperform their counterparts.

Abstract

Monitoring the status of the soil ecosystem to identify the spatio-temporal extent of the pressures exerted and mitigate the effects of climate change and land degradation necessitates the need for reliable and cost-effective solutions. To address this need, soil spectroscopy in the visible, near- and shortwave-infrared (VNIR–SWIR) has emerged as a viable alternative to traditional analytical approaches. To this end, large-scale soil spectral libraries coupled with advanced machine learning tools have been developed to infer the soil properties from the hyperspectral signatures. However, models developed from one region may exhibit diminished performance when applied to a new, unseen by the model, region due to the large and inherent soil variability (e.g. pedogenetical differences, diverse soil types etc.). Given an existing spectral library with labeled data and a new unlabeled region (i.e. where no soil samples are analytically measured) the question then becomes how to best develop a model which can more accurately predict the soil properties of the unlabeled region.

In this paper, a machine learning technique leveraging on the capabilities of semi-supervised learning which exploits the predictors’ distribution of the unlabeled dataset and of active learning which expertly selects a small set of data from the unlabeled dataset as a spiking subset in order to develop a more robust model is proposed. The semi-supervised learning approach is the Laplacian Support Vector Regression following the manifold regularization framework. As far as the active learning component is concerned, the pool-based approach is utilized as it best matches with the aforementioned use-case scenario, which iteratively selects a subset of data from the unlabeled region to spike the calibration set. As a query strategy, a novel machine learning–based strategy is proposed herein to best identify the spiking subset at each iteration. The experimental analysis was conducted using data from the Land Use and Coverage Area Frame Survey of 2009 which covered most of the then member-states of the European Union, and in particular by focusing on the mineral cropland soil samples from 5 different countries. The statistical analysis conducted ascertained the efficacy of our approach when compared to the current state-of-the-art in soil spectroscopy.

Introduction

Soil is an important non-renewable resource that provides for agriculture, improves water quality, and buffers greenhouse gases in the atmosphere. The preservation and sustainable management of soils is crucial to tackle the main challenges that humanity is facing such as increasing demands for food by an increasing population, climate change, environmental degradation, water scarcity, and loss of biodiversity (Hatfield and Walthall, 2015, Montanarella et al., 2016, Hatfield et al., 2017, Amundson et al., 2015). Soil-related consequences of the aforementioned pressures include significant changes in soils properties, surface water and groundwater quality, food security, water supplies, human health, energy, agriculture, and the sustainability of ecosystems including that of the soil biota (Qafoku, 2015). The preservation of the soil ecosystem dictates the need for accurate information about the soil’s status in order to monitor its spatio-temporal variability. Assessing the state of the soil however requires complex analytical approaches which are costly in time and resource.

Soil spectroscopy in the visible, near-infrared and shortwave-infrared (VNIR–SWIR, 0.35 μm to 2.5 μm) has been established as a rapid and cost-efficient alternative to the traditional analytical methods required for the estimation of some key physical and chemical soil constituents (Stenberg et al., 2010, Nocita et al., 2015). In the past decades a number of large soil spectral libraries (SSLs) have been developed throughout the world and have meticulously recorded the VNIR–SWIR spectra, the physicochemical properties, and other metadata for thousands of soil samples (e.g. Rossel et al., 2016, Orgiazzi et al., 2018, Tziolas et al., 2019). These have enabled researchers to calibrate chemometric models that link the spectral signature to the known reference values of the soil properties, thus facilitating the operational application of soil spectroscopy. However, the application of models calibrated using data from one SSL to a new, unseen by the model, region or a local site remains a challenging task. This may occur because the unique characteristics of the new region are not appropriately encapsulated by the calibration set, thus leading to erroneous predictions by the model (Guerrero et al., 2010, Guerrero et al., 2014, Seidel et al., 2019).

Machine learning provides systems the ability to automatically learn and improve from experience. Supervised machine learning approaches rely on the identification of patterns from a set of labeled data, which are records consisting of a set of predictive features coupled with a target output (label) provided by an expert. In soil spectroscopy the predictive features are the diffuse reflectance spectral signatures while the labels are the soil properties. With the advent of artificial intelligence and deep learning techniques which can cope well with big data, many efforts have focused on using data comprised of hundreds of thousands of records. Supervised approaches like multi-kernel support vector machines (Tsakiridis et al., 2020), fuzzy rule-based systems (Tsakiridis et al., 2019), convolutional neural networks (Padarian et al., 2018, Tsakiridis et al., 2020), Gaussian processes (Ramirez-Lopez et al., 2013, Tziolas et al., 2019), and local partial least squares regression (Nocita et al., 2014) have exhibited their potential to deal with soil hyperspectral data. However, data labeling can be expensive both in terms of financial cost and labeling time in many applications (e.g. for speech recognition or medical imaging). This includes the problem tackled by soil spectroscopy where data labeling refers to the measurement of soil properties through analytical techniques in a chemical laboratory. In these applications it is common to have a few labeled and a plethora of unlabeled data. Semi-supervised learning aims at exploiting the prior knowledge from unlabeled data to enhance the performance of the learning algorithm, and it is particularly adept when the labeled data are few while the unlabeled data numerous. Some approaches of semi-supervised learning are based on the manifold assumption (Belkin et al., 2006) which states that the marginal distribution underlying the data can be described using a low-dimensional manifold embedded in a higher dimensional space. The Laplacian support vector machine (LapSVM) is a graph-based model following this assumption (Belkin et al., 2006).

Active learning on the other hand is a technique aiming at selecting the most informative samples from the pool of unlabeled data, which are subsequently labeled by a human or a machine annotator and thence incorporated into the calibration set. It is motivated by the observation that a model may enhance its performance while simultaneously use less labeled data for training, if it can choose the data on which it trains (Cohn et al., 1996). In scenarios where the labeled data are few, because the learner is able to choose the examples to be labeled, it can develop a model with substantially fewer data for training than required in normal supervised learning (with a priori labeling of the data). There have been various query strategies proposed in pool-based active learning, such as: (i) uncertainty sampling, where the samples for which the model is most uncertain are selected for labelling, (ii) query-by-committee, where a group of models predicts the unknown samples, and the samples on which they most disagree are selected for labelling, (iii) estimated error reduction, that estimates the expected future error which would have resulted if some instances were labeled and incorporated into the calibration set.

To account for the less accurate predictions at local or unseen sites various approaches have been proposed. They are mostly focused on (i) either constraining the SSL by selecting a subset of samples based on spectral or other similarities and then building a new model using only this subset of data (e.g. Nocita et al., 2014, Xu et al., 2016, Lobsey et al., 2017, Castaldi et al., 2018), or (ii) augmenting the SSL with site-specific samples (the so-termed spiking approach) (e.g. Sankey et al., 2008, Guerrero et al., 2010, Wetterlind and Stenberg, 2010, Jiang et al., 2017), or a combination thereof. In the first case, no data are labeled using analytical methods from the new site, which may not work well if the new site is not well-represented in the existing SSL. In the latter case, the most important question that arises is which and how many site-specific samples should be selected for labeling (Lucà et al., 2017, Nawar and Mouazen, 2018).

Our approach is motivated by this need to partially alleviate the cost of laboratory analyses which is needed for (i) accurate site-specific calibrations, and (ii) whilst establishing large SSLs. A particularly noteworthy application of this second case may be for the development of the soil system for Africa under the Soils4Africa project, which is funded by the European Union under the Horizon 2020 SFS-35–2019-2020 call, and will take place in the next four years (2020–2023). In particular, it may help identify which portion from the large number of soil samples that will be collected (ca. 20,000) can be very accurately predicted and which needs to be included in the calibration set based on its spectral properties, thus partially alleviating the costs associated with the analytical techniques.

The major contributions of this work are as follows:

•
It explores the use of semi-supervised learning in the domain of soil spectroscopy to identify whether it can enhance the prediction accuracy compared to traditionally used supervised approaches.
•
A novel machine learning–based strategy is proposed to select the samples from the unlabeled region that ought to be labeled (i.e. queried) in order to attain a more robust model and compared with existing strategies.
•
It explores the use of active learning and compares its application in soil spectroscopy with other approaches proposed in the literature.

The rest of the paper is organized as follows. Section 2 briefly presents the semi-supervised Laplacian Support Vector Regression algorithm. Section 3 lays out the concepts of active learning and introduces the novel machine learning–based strategy proposed herein. In Section 4 the experimental set up is presented, including the dataset formulation, the algorithms used to compare our methodology, and the training thereof. The experimental results are given in Section 5, while the conclusions drawn from this work and suggestions for future work may be found in Section 7.

Section snippets

Support vector regression

The support vector machine was introduced by Vapnik (Vapnik, 1998) in the context of structural risk minimization theory, and it aims to maximize its generalization ability through solving a quadratic optimization problem (Vapnik, 1992). Given a set of training N data $D_{trn} = {(x_{1}, y_{1}), \dots, (x_{N}, y_{N})} \subset X \times R$ , where $(x_{i}, y_{i})$ denotes the pairs of input variables and real output, while X represents the space of input patterns (e.g. $X = R^{d}$ , with d being the dimensionality of the input patterns $x_{i}$ ), the goal is to

Pool-based active learning

This active learning scenario is suited in situations where a large pool of unlabeled data has already been gathered and therefore it is possible to evaluate all instances in terms of their informativeness in order to decide which ones should be labeled. For soil spectroscopy, this scenario concerns for example the collection of a number of soil samples from a new region and the measurement of their spectral signatures, which is the pool of unlabeled data. Given an existing SSL the question is

The LUCAS SSL

The statistical office of the European Union (EUROSTAT) organizes a triennial survey of land use, land cover and changes over time across the EU member-states, known as the Land Use and Coverage Area Frame Survey (LUCAS), with the latest survey conducted in 2018. Since 2009, topsoil assessment has been a key component of LUCAS. Its main goal is to establish a harmonized and comparable dataset of topsoil properties at the EU scale by collecting soil samples using a common sampling protocol and

Efficacy of semi-supervised learning

In this section the results of the first experiment detailed in Section 4.5 are presented. To demonstrate how well the semi-supervised learning approach of LapSVR performs, we compared it with SVR, PLS, and SBL across all datasets. The results are given in Table 2 with the RPIQ metric across all experiments further visualized in Fig. 8. It can be readily identified that the Clay content is predicted consistently with smaller error (RPIQ $> 2$ ) than the OC content whose prediction accuracy is

Discussion

In this work, the synergy of active and semi-supervised learning was explored in the domain of soil spectroscopy to utilize the existing efforts establishing large SSLs when predicting the soil properties of a new, unseen by the model, region. This is a common case in this domain where a large cost is associated in labeling data through analytical techniques. Other approaches in the past have focused in either constraining the SSL or selecting a subset of spiking samples using a priori

Conclusions

In this paper, the semi-supervised learning approach (and in particular the LapSVR algorithm) was applied to derive local spectroscopic calibrations of soil properties in an unknown region using an existing SSL of another region. Compared to other standard supervised approaches (i.e. SVR, PLS, and SBL), this technique yielded statistically better results in terms of accuracy of prediction. Moreover, the synergistic use of LapSVR with the pool-based active learning scenario was also examined to

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The LUCAS topsoil dataset used in this work was made available by the European Commission through the European Soil Data Centre managed by the Joint Research Centre (JRC), http://esdac.jrc.ec.europa.eu/.

References (59)

V. Bellon-Maurel et al.
Near-infrared (NIR) and mid-infrared (MIR) spectroscopic techniques for assessing the amount of carbon stock in soils - Critical review and research perspectives
Soil Biol. Biochem.
(2011)
J.C. Bezdek et al.
FCM: The fuzzy c-means clustering algorithm
Computers & Geosciences
(1984)
F. Castaldi et al.
Evaluating the capability of the sentinel 2 data for soil organic carbon prediction in croplands
ISPRS Journal of Photogrammetry and Remote Sensing
(2019)
C. Guerrero et al.
Spiking of NIR regional models using samples from target sites: Effect of model size on prediction accuracy
Geoderma
(2010)
Q. Jiang et al.
Estimation of soil organic carbon and total nitrogen in different soil layers using VNIR spectroscopy: Effects of spiking on model applicability
Geoderma
(2017)
F. Lucà et al.
Effect of calibration set size on prediction at local scale of soil carbon by vis-NIR spectroscopy
Geoderma
(2017)
S. Nawar et al.
Optimal sample selection for measurement of soil organic carbon using on-line vis-NIR spectroscopy
Computers Electron. Agric.
(2018)
M. Nocita et al.
Prediction of soil organic carbon content by diffuse reflectance spectroscopy using a local partial least square regression approach
Soil Biol. Biochem.
(2014)
M. Nocita et al.
Soil Spectroscopy: An Alternative to Wet Chemistry for Soil Monitoring
Adv. Agron.
(2015)
L. Ramirez-Lopez et al.
The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets
Geoderma
(2013)

R.A. Rossel et al.

Using data mining to model and interpret soil diffuse reflectance spectra

Geoderma

(2010)

J.B. Sankey et al.

Comparing local vs. global visible and near-infrared (VisNIR) diffuse reflectance spectroscopy (DRS) calibrations for the prediction of soil clay, organic C and inorganic C

Geoderma

(2008)

M. Seidel et al.

Strategies for the efficient estimation of soil organic carbon at the field scale with vis-NIR spectroscopy: Spectral libraries and spiking vs. local calibrations

Geoderma

(2019)

N.L. Tsakiridis et al.

Using interpretable fuzzy rule-based models for the estimation of soil organic carbon from VNIR/SWIR spectra and soil texture

Chemometrics Intell. Lab. Syst.

(2019)

N.L. Tsakiridis et al.

An evolutionary fuzzy rule-based system applied to the prediction of soil organic carbon from soil spectral libraries

Appl. Soft Comput.

(2019)

N. Tziolas et al.

A memory-based learning approach utilizing combined spectral sources and geographical proximity for improved VIS-NIR-SWIR soil properties estimation

Geoderma

(2019)

A. Agrawal et al.

A rewriting system for convex optimization problems

J. Control Decision

(2018)

R. Amundson et al.

Soil and human security in the 21st century

Science

(2015)

L. Anjos et al.

World reference base for soil resources 2014

(2015)

M. Belkin et al.

Manifold regularization: A geometric framework for learning from labeled and unlabeled examples

Journal of Machine Learning Research

(2006)

L. Breiman

Random forests

Mach. Learn.

(2001)

F. Castaldi et al.

Estimation of soil organic carbon in arable soil in belgium and luxembourg with the LUCAS topsoil database

Eur. J. Soil Sci.

(2018)

D.A. Cohn et al.

Active learning with statistical models

Journal of Artificial Intelligence Research

(1996)

J. Demšar

Statistical comparisons of classifiers over multiple data sets

Journal of Machine learning research

(2006)

S. Diamond et al.

CVXPY: A Python-embedded modeling language for convex optimization

J. Mach. Learn. Res.

(2016)

H. Drucker et al.

Support vector regression machines

M. Friedman

The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance

J. Am. Stat. Assoc.

(1937)

C. Guerrero et al.

Assessment of soil organic carbon at local scale with spiked NIR calibrations: effects of selection and extra-weighting on the spiking subset

Eur. J. Soil Sci.

(2014)

Hatfield, J.L., Sauer, T.J., Cruse, R.M., 2017. Soil: The Forgotten Piece of the Water, Food, Energy Nexus. (pp. 1–46)....

Cited by (8)

Semi-supervised learning for the spatial extrapolation of soil information
2022, Geoderma
Citation Excerpt :
This indicates the effectiveness of SSLR→T for spatial extrapolation. Several studies have shown higher performances of SSL in comparison to SL for spatial interpolation (Du et al., 2020; Han et al., 2018; Manian et al., 2022; Riese et al., 2019; Tsakiridis et al., 2021; Zhang et al., 2021b) and other applications (Al-Azzam and Shatnawi, 2021; Esche et al., 2022). However, to the best of our knowledge, there have not been any published case studies that deal with soil spatial extrapolation using SSL.
Digital soil mapping (DSM) can be used to predict soils at unvisited sites, but problems arise when predictions are needed in areas without any soil observations. In such situations, DSM can still extend the results from reference areas with soil data to target areas that are alike in terms of soil-forming factors and obey the same rules. Such DSM methods have low accuracy due to the complexity of spatial variation in soil, and the difficulty of matching soil-forming factors exactly between reference and target areas. A new approach for extrapolating soil information from reference to target areas is proposed in the current research. We evaluated the ability of a semi-supervised learning (SSL_R→T) approach compared to a supervised learning (SL_R→T) approach for extrapolating soil classes in two areas (reference and target areas) in central Iran. The SSL_R→T used soil observations from the reference area and covariates from both areas. Then, the learned knowledge produced by SSL_R→T was transferred to the target area to estimate soil classes. The findings revealed that SSL_R→T resulted in higher overall accuracy (0.65) and kappa index (0.44) in the target area compared to the SL_R→T (overall accuracy = 0.40 and kappa index = 0.18). Furthermore, the SSL_R→T produced the lower values of the confusion index (mean = 0.66) compared to the SL_R→T (mean = 0.80). This indicated that the SSL_R→T could not only increase the accuracy but also decrease the uncertainty of the soil class predictions, compared to the spatial extrapolation predictions derived from the SL_R→T. Generally, these findings indicated that leveraging covariate information from the target area during the training of DSM models in the reference area could successfully improve the generalization power of the models, indicating the effectiveness of SSL_R→T for spatial extrapolation.
Estimating the spatial distribution of soil heavy metals in oil mining area using air quality data
2022, Atmospheric Environment
Citation Excerpt :
In order to overcome the problem of low estimation accuracy of traditional low-correlation environmental variables, high-correlation environmental variables are used in the spatial estimation of soil heavy metals, mainly hyperspectral remote sensing variables (Tan et al., 2021; Pyo et al., 2020b; Zhou et al., 2021; Liu et al., 2020). For instance visible and near-infrared (VNIR) (Pyo et al., 2020b; Gholizadeh et al., 2020; Z. Wang et al., 2021; Zhang et al., 2019), and short wave infrared spectral region (SWIR) (Tsakiridis et al., 2021; Meng et al., 2020) bands show it has the high sensitivity and response to heavy metals in 420–580 nm, 780–900 nm, and 1800–2100 nm. Hyperspectral variables have rich spectral resolution (that is, the spectral interval is small), and they are highly sensitive and highly correlated to changes in the content of soil heavy metals.
Air quality is a vital environment variable which determines spatial accumulation of soil heavy metals. It is very important to estimate the contribution of air quality for soil heavy metals in oil mining area. For the end, we collected 116 samples from surface soil of oil mining in the Yellow River Delta (YRD) of China, and analyzed the content of As, Cr, Ni, Pb, and Zn. Furthermore, 40 monitoring stations data of air quality were collected in study area, including CO, NO₂, SO₂, O₃, PM_2.5, and PM₁₀. Spatial estimation and mapping of heavy metals in soil were carried out by hybrid geostatistical models, including multiple linear regression-ordinary kriging (MLROK), support vector machine-ordinary kriging (SVMOK) and random forest-ordinary kriging (RFOK). RFOK exhibited the highest estimation accuracy (R²) for As (65.76%), Cr (77.85%), Ni (61.47%), Pb (74.64%), and Zn (71.35%) in comparison with other models. And relative R² of RFOK improved 30%, while MLROK and SVMOK increased over 100% for Zn (RI_o = 121.90% and RI_o = 121.64%) based on their original R² of machine learning models. In addition, mapping results by RFOK showed the high concentrations of heavy metals were focused in the central and northeastern (As), northern (Cr), northeastern and northwestern (Ni), central and eastern (Pb), and northern (Zn). Especially, compared with vegetation index and topographic factors, PM_2.5 is the highest driving variable for As (18.34%) and Zn (12.91%), and CO is the most important variable for Cr (18.22%), Ni (14.28%). The above results indicated that there is a mechanism of sources-receptor relationship between air quality and soil heavy metals, that is, oil well and factory in study area discharge heavy metal particles into the atmosphere, and then enter the soil through atmospheric deposition and precipitation. Enlightened by this study, variable selection should be focused on important sources for the accumulation of heavy metals in study area, who must take decisions to prevent and to early warn heavy metals pollution in mine soil.
Recent advances of chemometric calibration methods in modern spectroscopy: Algorithms, strategy, and related issues
2022, TrAC - Trends in Analytical Chemistry
Citation Excerpt :
With the extensive application of modern spectroscopic technology, a large number of spectral data resources will be generated, among which most of spectral data (i.e., unlabeled samples) are no corresponding reference values. Therefore, how to make full use of the information of these unlabeled samples to construct semi-supervised and even unsupervised analysis models may be a new research direction in future [191–193]. Multispectral data fusion techniques, which can take full advantage of the complementarity or synergy of the diverse spectra data from multiple sources to make the outcome of the qualitative or quantitative analysis more reliable and accurate, is one of the research hotspots in the domain of spectral analysis in recent years.
In recent years, modern spectral analysis techniques, such as ultraviolet–visible (UV-vis) spectroscopy, mid-infrared (MIR) spectroscopy, near-infrared (NIR) spectroscopy, Raman spectroscopy, terahertz (THz) spectroscopy, nuclear magnetic resonance (NMR) spectroscopy, laser-induced breakdown spectroscopy (LIBS), etc., have experienced rapid development and have been widely applied in various fields such as agricultural, food, pharmaceutical, petroleum, chemical industry, tobacco, environmental protection and medical science. A remarkable feature of all these techniques is to extract useful chemical information from the spectral data as detailed as possible with the aid of chemometric methods with the aim of significantly improving both robustness and accuracy of analytical results. Under the general background of the development in artificial intelligence, big data, cloud computing, and other technologies, the emergence of novel idea, approaches, and strategies endows chemometrics with a new vitality. Chemometrics has become the research focuses and hotspots in various fields, especially in the field of spectral analysis. This article reviewed various chemometric methods applied in modern spectral analysis in recent ten years, especially from the perspective of practicability, including spectral pre-processing, wavelength (variable) selection, data dimension reduction, quantitative calibration, pattern recognition, calibration transfer, calibration maintenance, and multispectral data fusion. More importantly, future trends in chemometric methods in the field of spectral analysis was also prospected in this article. It is sincerely expected that this summary and review could give specialists and scholars in the fields of spectroscopy and chemometrics certain inspiration to accelerate modern spectral analysis techniques booming evolution.
Deep transfer learning of global spectra for local soil carbon monitoring
2022, ISPRS Journal of Photogrammetry and Remote Sensing
Citation Excerpt :
This challenge has been the focus of research over the past decades (Naes et al., 1990; Shenk et al., 1997; Ramirez-Lopez et al., 2013). Several researchers have proposed methods to localise the modelling with SSLs, but with limited success, because their performance with different data, in various applications, and when the localisation is from a large (e.g. global) to a small (e.g. an agricultural field) level, is generally inconsistent (Seidel et al., 2019; Li et al., 2021; Tsakiridis et al., 2021). Spiking augments a SSL with a few labelled local data to form an augmented SSL for the modelling (Wetterlind and Stenberg, 2010; Brown, 2007; Sankey et al., 2008; Viscarra Rossel et al., 2009).
There is global interest in spectroscopy and the development of large and diverse soil spectral libraries (SSL) to model soil organic carbon (SOC) and monitor, report, and verify (MRV) its changes. The reason is that increasing SOC can improve food production and mitigate climate change. However, ‘global’ modelling of SOC with such diverse and hyperdimensional SSLs do not generalise well locally, e.g. at a field scale. To address this challenge, we propose deep transfer learning (DTL) to leverage useful information from large-scale SSLs to assist local modelling. We used one global, three country-specific SSLs and data from three local sites with DTL to improve the modelling and localise the SOC estimates in individual fields or farms in each country. With DTL, we transferred instances from the SSLs, representations from one-dimensional convolutional neural networks (1D-CNNs) trained on the SSLs, and both instances and representations to improve local modelling. Transferring instances effectively used information from the global SSL to most accurately estimate SOC in each site, reducing the root mean square error (RMSE) by 25.8% on average compared with local modelling. Our results highlight the effectiveness of DTL and the value of diverse, global SSLs for accurate local SOC predictions. Applying DTL with a global SSL one could estimate SOC anywhere in the world more accurately, rapidly, and cost-effectively, enabling MRV protocols to monitor SOC changes.
Pedogenic-weathering evolution and soil discrimination by sensor fusion combined with machine-learning-based spectral modeling
2022, Geoderma
Citation Excerpt :
Recently, soil sensing using portable X-ray fluorescence (PXRF) and visible near-infrared reflectance (VNIR) has been proposed as a rapid, effective, and non-destructive alternative to conventional laboratory analyses for characterizing soil samples (Benedet et al., 2020a; Javadi et al., 2021; Kebonye et al., 2021). Using machine-learning techniques (e.g., partial least squares regression-PLSR, random forest-RF, Cubist, support vector machine-SVM, and artificial neural network-ANN), multiple soil properties/parameters can be obtained from spectral data (Riedel et al., 2018; Coblinski et al., 2020; Tsakiridis et al., 2021). Although several physical and chemical soil properties (e.g., total organic carbon—TOC and total nitrogen—TN) have been successfully predicted using VNIR and/or PXRF (Terra et al., 2019; Zhang and Hartemink, 2019; Benedet et al., 2020b; Wan et al., 2020), chemometric spectral research on pedogenesis-related mineralogic and magnetic attributes is still rare.
Loess deposits are important records of the evolution of the soil environment and pedogenic weathering. Changes in pedogenic weathering conditions through space and time, as well as discrimination/sourcing of loess-derived soils, are important scientific issues in the soil (paleosol) community. Here, 502 soil samples from four loess chronosequences representing different climatic zones of China were investigated using portable X-ray fluorescence (PXRF) and visible to near-infrared reflectance (VNIR) in combination with previously published mineralogic and magnetic datasets. A spectral modeling approach was employed to discriminate loess-derived soils from different regions of China. We developed a three-dimensional model to fingerprint loess-derived soils using principal component analysis (PCA), showing that soils accumulating under varying climatic conditions were effectively discriminated with sensor fusion data. Six key soil mineralogic and magnetic attributes serving as pedogenic-weathering proxies were analyzed and predicted with conventional methods and chemometric models. Predictive models were constructed with machine learning methods, including partial least squares regression (PLSR), random forest (RF), and Cubist algorithms. The results indicate that the Cubist algorithm works better than PLSR and RF in predicting pedogenic-weathering proxies. The cross-validation results indicate that, although models derived from single sensors (i.e., PXRF or VNIR) work well in predicting pedogenic-weathering proxies, the sensor fusion approach is superior with regard to accuracy and robustness of results in most cases. We suggest that the combined elemental and secondary-mineral information provided by the fused PXRF-VNIR datasets can yield high-accuracy models in soil (paleosol) investigations. The sensor fusion models reveal that pedogenic processes in the loess chronosequences are diversified both spatially and temporally in different climate zones of China. Our results suggest that spectral modeling can be an alternative to geochemical, mineralogic and magnetic pedogenic-weathering proxies, and that it has great potential for investigating soil-forming conditions and pedogenic-weathering evolution, especially when large-area and/or high-resolution analysis is required.
Biochar as Soil Amendment: The Effect of Biochar on Soil Properties Using VIS-NIR Diffuse Reflectance Spectroscopy, Biochar Aging and Soil Microbiology—A Review
2023, Land

View all citing articles on Scopus

View full text

Improving the predictions of soil properties from VNIR–SWIR spectra in an unlabeled region using semi-supervised and active learning

Highlights

Abstract

Introduction

Section snippets

Support vector regression

Pool-based active learning

The LUCAS SSL

Efficacy of semi-supervised learning

Discussion

Conclusions

Declaration of Competing Interest

Acknowledgments

Soil Biol. Biochem.

Computers & Geosciences

ISPRS Journal of Photogrammetry and Remote Sensing

Geoderma

Geoderma

Geoderma

Computers Electron. Agric.

Soil Biol. Biochem.

Adv. Agron.

Geoderma

Geoderma

Geoderma

Geoderma

Chemometrics Intell. Lab. Syst.

Appl. Soft Comput.

Geoderma

A rewriting system for convex optimization problems

J. Control Decision

Soil and human security in the 21st century

Science

World reference base for soil resources 2014

Manifold regularization: A geometric framework for learning from labeled and unlabeled examples

Journal of Machine Learning Research

Random forests

Mach. Learn.

Estimation of soil organic carbon in arable soil in belgium and luxembourg with the LUCAS topsoil database

Eur. J. Soil Sci.

Active learning with statistical models

Journal of Artificial Intelligence Research

Statistical comparisons of classifiers over multiple data sets

Journal of Machine learning research

CVXPY: A Python-embedded modeling language for convex optimization

J. Mach. Learn. Res.

Support vector regression machines

The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance

J. Am. Stat. Assoc.

Assessment of soil organic carbon at local scale with spiked NIR calibrations: effects of selection and extra-weighting on the spiking subset

Eur. J. Soil Sci.