Improving the predictions of soil properties from VNIR–SWIR spectra in an unlabeled region using semi-supervised and active learning
Introduction
Soil is an important non-renewable resource that provides for agriculture, improves water quality, and buffers greenhouse gases in the atmosphere. The preservation and sustainable management of soils is crucial to tackle the main challenges that humanity is facing such as increasing demands for food by an increasing population, climate change, environmental degradation, water scarcity, and loss of biodiversity (Hatfield and Walthall, 2015, Montanarella et al., 2016, Hatfield et al., 2017, Amundson et al., 2015). Soil-related consequences of the aforementioned pressures include significant changes in soils properties, surface water and groundwater quality, food security, water supplies, human health, energy, agriculture, and the sustainability of ecosystems including that of the soil biota (Qafoku, 2015). The preservation of the soil ecosystem dictates the need for accurate information about the soil’s status in order to monitor its spatio-temporal variability. Assessing the state of the soil however requires complex analytical approaches which are costly in time and resource.
Soil spectroscopy in the visible, near-infrared and shortwave-infrared (VNIR–SWIR, 0.35 μm to 2.5 μm) has been established as a rapid and cost-efficient alternative to the traditional analytical methods required for the estimation of some key physical and chemical soil constituents (Stenberg et al., 2010, Nocita et al., 2015). In the past decades a number of large soil spectral libraries (SSLs) have been developed throughout the world and have meticulously recorded the VNIR–SWIR spectra, the physicochemical properties, and other metadata for thousands of soil samples (e.g. Rossel et al., 2016, Orgiazzi et al., 2018, Tziolas et al., 2019). These have enabled researchers to calibrate chemometric models that link the spectral signature to the known reference values of the soil properties, thus facilitating the operational application of soil spectroscopy. However, the application of models calibrated using data from one SSL to a new, unseen by the model, region or a local site remains a challenging task. This may occur because the unique characteristics of the new region are not appropriately encapsulated by the calibration set, thus leading to erroneous predictions by the model (Guerrero et al., 2010, Guerrero et al., 2014, Seidel et al., 2019).
Machine learning provides systems the ability to automatically learn and improve from experience. Supervised machine learning approaches rely on the identification of patterns from a set of labeled data, which are records consisting of a set of predictive features coupled with a target output (label) provided by an expert. In soil spectroscopy the predictive features are the diffuse reflectance spectral signatures while the labels are the soil properties. With the advent of artificial intelligence and deep learning techniques which can cope well with big data, many efforts have focused on using data comprised of hundreds of thousands of records. Supervised approaches like multi-kernel support vector machines (Tsakiridis et al., 2020), fuzzy rule-based systems (Tsakiridis et al., 2019), convolutional neural networks (Padarian et al., 2018, Tsakiridis et al., 2020), Gaussian processes (Ramirez-Lopez et al., 2013, Tziolas et al., 2019), and local partial least squares regression (Nocita et al., 2014) have exhibited their potential to deal with soil hyperspectral data. However, data labeling can be expensive both in terms of financial cost and labeling time in many applications (e.g. for speech recognition or medical imaging). This includes the problem tackled by soil spectroscopy where data labeling refers to the measurement of soil properties through analytical techniques in a chemical laboratory. In these applications it is common to have a few labeled and a plethora of unlabeled data. Semi-supervised learning aims at exploiting the prior knowledge from unlabeled data to enhance the performance of the learning algorithm, and it is particularly adept when the labeled data are few while the unlabeled data numerous. Some approaches of semi-supervised learning are based on the manifold assumption (Belkin et al., 2006) which states that the marginal distribution underlying the data can be described using a low-dimensional manifold embedded in a higher dimensional space. The Laplacian support vector machine (LapSVM) is a graph-based model following this assumption (Belkin et al., 2006).
Active learning on the other hand is a technique aiming at selecting the most informative samples from the pool of unlabeled data, which are subsequently labeled by a human or a machine annotator and thence incorporated into the calibration set. It is motivated by the observation that a model may enhance its performance while simultaneously use less labeled data for training, if it can choose the data on which it trains (Cohn et al., 1996). In scenarios where the labeled data are few, because the learner is able to choose the examples to be labeled, it can develop a model with substantially fewer data for training than required in normal supervised learning (with a priori labeling of the data). There have been various query strategies proposed in pool-based active learning, such as: (i) uncertainty sampling, where the samples for which the model is most uncertain are selected for labelling, (ii) query-by-committee, where a group of models predicts the unknown samples, and the samples on which they most disagree are selected for labelling, (iii) estimated error reduction, that estimates the expected future error which would have resulted if some instances were labeled and incorporated into the calibration set.
To account for the less accurate predictions at local or unseen sites various approaches have been proposed. They are mostly focused on (i) either constraining the SSL by selecting a subset of samples based on spectral or other similarities and then building a new model using only this subset of data (e.g. Nocita et al., 2014, Xu et al., 2016, Lobsey et al., 2017, Castaldi et al., 2018), or (ii) augmenting the SSL with site-specific samples (the so-termed spiking approach) (e.g. Sankey et al., 2008, Guerrero et al., 2010, Wetterlind and Stenberg, 2010, Jiang et al., 2017), or a combination thereof. In the first case, no data are labeled using analytical methods from the new site, which may not work well if the new site is not well-represented in the existing SSL. In the latter case, the most important question that arises is which and how many site-specific samples should be selected for labeling (Lucà et al., 2017, Nawar and Mouazen, 2018).
Our approach is motivated by this need to partially alleviate the cost of laboratory analyses which is needed for (i) accurate site-specific calibrations, and (ii) whilst establishing large SSLs. A particularly noteworthy application of this second case may be for the development of the soil system for Africa under the Soils4Africa project, which is funded by the European Union under the Horizon 2020 SFS-35–2019-2020 call, and will take place in the next four years (2020–2023). In particular, it may help identify which portion from the large number of soil samples that will be collected (ca. 20,000) can be very accurately predicted and which needs to be included in the calibration set based on its spectral properties, thus partially alleviating the costs associated with the analytical techniques.
The major contributions of this work are as follows:
- •
It explores the use of semi-supervised learning in the domain of soil spectroscopy to identify whether it can enhance the prediction accuracy compared to traditionally used supervised approaches.
- •
A novel machine learning–based strategy is proposed to select the samples from the unlabeled region that ought to be labeled (i.e. queried) in order to attain a more robust model and compared with existing strategies.
- •
It explores the use of active learning and compares its application in soil spectroscopy with other approaches proposed in the literature.
The rest of the paper is organized as follows. Section 2 briefly presents the semi-supervised Laplacian Support Vector Regression algorithm. Section 3 lays out the concepts of active learning and introduces the novel machine learning–based strategy proposed herein. In Section 4 the experimental set up is presented, including the dataset formulation, the algorithms used to compare our methodology, and the training thereof. The experimental results are given in Section 5, while the conclusions drawn from this work and suggestions for future work may be found in Section 7.
Section snippets
Support vector regression
The support vector machine was introduced by Vapnik (Vapnik, 1998) in the context of structural risk minimization theory, and it aims to maximize its generalization ability through solving a quadratic optimization problem (Vapnik, 1992). Given a set of training N data , where denotes the pairs of input variables and real output, while X represents the space of input patterns (e.g. , with d being the dimensionality of the input patterns ), the goal is to
Pool-based active learning
This active learning scenario is suited in situations where a large pool of unlabeled data has already been gathered and therefore it is possible to evaluate all instances in terms of their informativeness in order to decide which ones should be labeled. For soil spectroscopy, this scenario concerns for example the collection of a number of soil samples from a new region and the measurement of their spectral signatures, which is the pool of unlabeled data. Given an existing SSL the question is
The LUCAS SSL
The statistical office of the European Union (EUROSTAT) organizes a triennial survey of land use, land cover and changes over time across the EU member-states, known as the Land Use and Coverage Area Frame Survey (LUCAS), with the latest survey conducted in 2018. Since 2009, topsoil assessment has been a key component of LUCAS. Its main goal is to establish a harmonized and comparable dataset of topsoil properties at the EU scale by collecting soil samples using a common sampling protocol and
Efficacy of semi-supervised learning
In this section the results of the first experiment detailed in Section 4.5 are presented. To demonstrate how well the semi-supervised learning approach of LapSVR performs, we compared it with SVR, PLS, and SBL across all datasets. The results are given in Table 2 with the RPIQ metric across all experiments further visualized in Fig. 8. It can be readily identified that the Clay content is predicted consistently with smaller error (RPIQ ) than the OC content whose prediction accuracy is
Discussion
In this work, the synergy of active and semi-supervised learning was explored in the domain of soil spectroscopy to utilize the existing efforts establishing large SSLs when predicting the soil properties of a new, unseen by the model, region. This is a common case in this domain where a large cost is associated in labeling data through analytical techniques. Other approaches in the past have focused in either constraining the SSL or selecting a subset of spiking samples using a priori
Conclusions
In this paper, the semi-supervised learning approach (and in particular the LapSVR algorithm) was applied to derive local spectroscopic calibrations of soil properties in an unknown region using an existing SSL of another region. Compared to other standard supervised approaches (i.e. SVR, PLS, and SBL), this technique yielded statistically better results in terms of accuracy of prediction. Moreover, the synergistic use of LapSVR with the pool-based active learning scenario was also examined to
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
The LUCAS topsoil dataset used in this work was made available by the European Commission through the European Soil Data Centre managed by the Joint Research Centre (JRC), http://esdac.jrc.ec.europa.eu/.
References (59)
- et al.
Near-infrared (NIR) and mid-infrared (MIR) spectroscopic techniques for assessing the amount of carbon stock in soils - Critical review and research perspectives
Soil Biol. Biochem.
(2011) - et al.
FCM: The fuzzy c-means clustering algorithm
Computers & Geosciences
(1984) - et al.
Evaluating the capability of the sentinel 2 data for soil organic carbon prediction in croplands
ISPRS Journal of Photogrammetry and Remote Sensing
(2019) - et al.
Spiking of NIR regional models using samples from target sites: Effect of model size on prediction accuracy
Geoderma
(2010) - et al.
Estimation of soil organic carbon and total nitrogen in different soil layers using VNIR spectroscopy: Effects of spiking on model applicability
Geoderma
(2017) - et al.
Effect of calibration set size on prediction at local scale of soil carbon by vis-NIR spectroscopy
Geoderma
(2017) - et al.
Optimal sample selection for measurement of soil organic carbon using on-line vis-NIR spectroscopy
Computers Electron. Agric.
(2018) - et al.
Prediction of soil organic carbon content by diffuse reflectance spectroscopy using a local partial least square regression approach
Soil Biol. Biochem.
(2014) - et al.
Soil Spectroscopy: An Alternative to Wet Chemistry for Soil Monitoring
Adv. Agron.
(2015) - et al.
The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets
Geoderma
(2013)
Using data mining to model and interpret soil diffuse reflectance spectra
Geoderma
Comparing local vs. global visible and near-infrared (VisNIR) diffuse reflectance spectroscopy (DRS) calibrations for the prediction of soil clay, organic C and inorganic C
Geoderma
Strategies for the efficient estimation of soil organic carbon at the field scale with vis-NIR spectroscopy: Spectral libraries and spiking vs. local calibrations
Geoderma
Using interpretable fuzzy rule-based models for the estimation of soil organic carbon from VNIR/SWIR spectra and soil texture
Chemometrics Intell. Lab. Syst.
An evolutionary fuzzy rule-based system applied to the prediction of soil organic carbon from soil spectral libraries
Appl. Soft Comput.
A memory-based learning approach utilizing combined spectral sources and geographical proximity for improved VIS-NIR-SWIR soil properties estimation
Geoderma
A rewriting system for convex optimization problems
J. Control Decision
Soil and human security in the 21st century
Science
World reference base for soil resources 2014
Manifold regularization: A geometric framework for learning from labeled and unlabeled examples
Journal of Machine Learning Research
Random forests
Mach. Learn.
Estimation of soil organic carbon in arable soil in belgium and luxembourg with the LUCAS topsoil database
Eur. J. Soil Sci.
Active learning with statistical models
Journal of Artificial Intelligence Research
Statistical comparisons of classifiers over multiple data sets
Journal of Machine learning research
CVXPY: A Python-embedded modeling language for convex optimization
J. Mach. Learn. Res.
Support vector regression machines
The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance
J. Am. Stat. Assoc.
Assessment of soil organic carbon at local scale with spiked NIR calibrations: effects of selection and extra-weighting on the spiking subset
Eur. J. Soil Sci.
Cited by (8)
Semi-supervised learning for the spatial extrapolation of soil information
2022, GeodermaCitation Excerpt :This indicates the effectiveness of SSLR→T for spatial extrapolation. Several studies have shown higher performances of SSL in comparison to SL for spatial interpolation (Du et al., 2020; Han et al., 2018; Manian et al., 2022; Riese et al., 2019; Tsakiridis et al., 2021; Zhang et al., 2021b) and other applications (Al-Azzam and Shatnawi, 2021; Esche et al., 2022). However, to the best of our knowledge, there have not been any published case studies that deal with soil spatial extrapolation using SSL.
Estimating the spatial distribution of soil heavy metals in oil mining area using air quality data
2022, Atmospheric EnvironmentCitation Excerpt :In order to overcome the problem of low estimation accuracy of traditional low-correlation environmental variables, high-correlation environmental variables are used in the spatial estimation of soil heavy metals, mainly hyperspectral remote sensing variables (Tan et al., 2021; Pyo et al., 2020b; Zhou et al., 2021; Liu et al., 2020). For instance visible and near-infrared (VNIR) (Pyo et al., 2020b; Gholizadeh et al., 2020; Z. Wang et al., 2021; Zhang et al., 2019), and short wave infrared spectral region (SWIR) (Tsakiridis et al., 2021; Meng et al., 2020) bands show it has the high sensitivity and response to heavy metals in 420–580 nm, 780–900 nm, and 1800–2100 nm. Hyperspectral variables have rich spectral resolution (that is, the spectral interval is small), and they are highly sensitive and highly correlated to changes in the content of soil heavy metals.
Recent advances of chemometric calibration methods in modern spectroscopy: Algorithms, strategy, and related issues
2022, TrAC - Trends in Analytical ChemistryCitation Excerpt :With the extensive application of modern spectroscopic technology, a large number of spectral data resources will be generated, among which most of spectral data (i.e., unlabeled samples) are no corresponding reference values. Therefore, how to make full use of the information of these unlabeled samples to construct semi-supervised and even unsupervised analysis models may be a new research direction in future [191–193]. Multispectral data fusion techniques, which can take full advantage of the complementarity or synergy of the diverse spectra data from multiple sources to make the outcome of the qualitative or quantitative analysis more reliable and accurate, is one of the research hotspots in the domain of spectral analysis in recent years.
Deep transfer learning of global spectra for local soil carbon monitoring
2022, ISPRS Journal of Photogrammetry and Remote SensingCitation Excerpt :This challenge has been the focus of research over the past decades (Naes et al., 1990; Shenk et al., 1997; Ramirez-Lopez et al., 2013). Several researchers have proposed methods to localise the modelling with SSLs, but with limited success, because their performance with different data, in various applications, and when the localisation is from a large (e.g. global) to a small (e.g. an agricultural field) level, is generally inconsistent (Seidel et al., 2019; Li et al., 2021; Tsakiridis et al., 2021). Spiking augments a SSL with a few labelled local data to form an augmented SSL for the modelling (Wetterlind and Stenberg, 2010; Brown, 2007; Sankey et al., 2008; Viscarra Rossel et al., 2009).
Pedogenic-weathering evolution and soil discrimination by sensor fusion combined with machine-learning-based spectral modeling
2022, GeodermaCitation Excerpt :Recently, soil sensing using portable X-ray fluorescence (PXRF) and visible near-infrared reflectance (VNIR) has been proposed as a rapid, effective, and non-destructive alternative to conventional laboratory analyses for characterizing soil samples (Benedet et al., 2020a; Javadi et al., 2021; Kebonye et al., 2021). Using machine-learning techniques (e.g., partial least squares regression-PLSR, random forest-RF, Cubist, support vector machine-SVM, and artificial neural network-ANN), multiple soil properties/parameters can be obtained from spectral data (Riedel et al., 2018; Coblinski et al., 2020; Tsakiridis et al., 2021). Although several physical and chemical soil properties (e.g., total organic carbon—TOC and total nitrogen—TN) have been successfully predicted using VNIR and/or PXRF (Terra et al., 2019; Zhang and Hartemink, 2019; Benedet et al., 2020b; Wan et al., 2020), chemometric spectral research on pedogenesis-related mineralogic and magnetic attributes is still rare.