Elsevier

Geoderma

Volume 387, 1 April 2021, 114830
Geoderma

Improving the predictions of soil properties from VNIR–SWIR spectra in an unlabeled region using semi-supervised and active learning

https://doi.org/10.1016/j.geoderma.2020.114830Get rights and content

Highlights

  • The semi-supervised learning framework of LapSVR is presented.

  • LapSVR outperforms other supervised techniques in VNIR-SWIR spectroscopy.

  • The active learning framework is presented along with a novel strategy.

  • Active learning identifies samples from the unknown region that ought to be labeled.

  • The proposed methodologies statistically outperform their counterparts.

Abstract

Monitoring the status of the soil ecosystem to identify the spatio-temporal extent of the pressures exerted and mitigate the effects of climate change and land degradation necessitates the need for reliable and cost-effective solutions. To address this need, soil spectroscopy in the visible, near- and shortwave-infrared (VNIR–SWIR) has emerged as a viable alternative to traditional analytical approaches. To this end, large-scale soil spectral libraries coupled with advanced machine learning tools have been developed to infer the soil properties from the hyperspectral signatures. However, models developed from one region may exhibit diminished performance when applied to a new, unseen by the model, region due to the large and inherent soil variability (e.g. pedogenetical differences, diverse soil types etc.). Given an existing spectral library with labeled data and a new unlabeled region (i.e. where no soil samples are analytically measured) the question then becomes how to best develop a model which can more accurately predict the soil properties of the unlabeled region.

In this paper, a machine learning technique leveraging on the capabilities of semi-supervised learning which exploits the predictors’ distribution of the unlabeled dataset and of active learning which expertly selects a small set of data from the unlabeled dataset as a spiking subset in order to develop a more robust model is proposed. The semi-supervised learning approach is the Laplacian Support Vector Regression following the manifold regularization framework. As far as the active learning component is concerned, the pool-based approach is utilized as it best matches with the aforementioned use-case scenario, which iteratively selects a subset of data from the unlabeled region to spike the calibration set. As a query strategy, a novel machine learning–based strategy is proposed herein to best identify the spiking subset at each iteration. The experimental analysis was conducted using data from the Land Use and Coverage Area Frame Survey of 2009 which covered most of the then member-states of the European Union, and in particular by focusing on the mineral cropland soil samples from 5 different countries. The statistical analysis conducted ascertained the efficacy of our approach when compared to the current state-of-the-art in soil spectroscopy.

Introduction

Soil is an important non-renewable resource that provides for agriculture, improves water quality, and buffers greenhouse gases in the atmosphere. The preservation and sustainable management of soils is crucial to tackle the main challenges that humanity is facing such as increasing demands for food by an increasing population, climate change, environmental degradation, water scarcity, and loss of biodiversity (Hatfield and Walthall, 2015, Montanarella et al., 2016, Hatfield et al., 2017, Amundson et al., 2015). Soil-related consequences of the aforementioned pressures include significant changes in soils properties, surface water and groundwater quality, food security, water supplies, human health, energy, agriculture, and the sustainability of ecosystems including that of the soil biota (Qafoku, 2015). The preservation of the soil ecosystem dictates the need for accurate information about the soil’s status in order to monitor its spatio-temporal variability. Assessing the state of the soil however requires complex analytical approaches which are costly in time and resource.

Soil spectroscopy in the visible, near-infrared and shortwave-infrared (VNIR–SWIR, 0.35 μm to 2.5 μm) has been established as a rapid and cost-efficient alternative to the traditional analytical methods required for the estimation of some key physical and chemical soil constituents (Stenberg et al., 2010, Nocita et al., 2015). In the past decades a number of large soil spectral libraries (SSLs) have been developed throughout the world and have meticulously recorded the VNIR–SWIR spectra, the physicochemical properties, and other metadata for thousands of soil samples (e.g. Rossel et al., 2016, Orgiazzi et al., 2018, Tziolas et al., 2019). These have enabled researchers to calibrate chemometric models that link the spectral signature to the known reference values of the soil properties, thus facilitating the operational application of soil spectroscopy. However, the application of models calibrated using data from one SSL to a new, unseen by the model, region or a local site remains a challenging task. This may occur because the unique characteristics of the new region are not appropriately encapsulated by the calibration set, thus leading to erroneous predictions by the model (Guerrero et al., 2010, Guerrero et al., 2014, Seidel et al., 2019).

Machine learning provides systems the ability to automatically learn and improve from experience. Supervised machine learning approaches rely on the identification of patterns from a set of labeled data, which are records consisting of a set of predictive features coupled with a target output (label) provided by an expert. In soil spectroscopy the predictive features are the diffuse reflectance spectral signatures while the labels are the soil properties. With the advent of artificial intelligence and deep learning techniques which can cope well with big data, many efforts have focused on using data comprised of hundreds of thousands of records. Supervised approaches like multi-kernel support vector machines (Tsakiridis et al., 2020), fuzzy rule-based systems (Tsakiridis et al., 2019), convolutional neural networks (Padarian et al., 2018, Tsakiridis et al., 2020), Gaussian processes (Ramirez-Lopez et al., 2013, Tziolas et al., 2019), and local partial least squares regression (Nocita et al., 2014) have exhibited their potential to deal with soil hyperspectral data. However, data labeling can be expensive both in terms of financial cost and labeling time in many applications (e.g. for speech recognition or medical imaging). This includes the problem tackled by soil spectroscopy where data labeling refers to the measurement of soil properties through analytical techniques in a chemical laboratory. In these applications it is common to have a few labeled and a plethora of unlabeled data. Semi-supervised learning aims at exploiting the prior knowledge from unlabeled data to enhance the performance of the learning algorithm, and it is particularly adept when the labeled data are few while the unlabeled data numerous. Some approaches of semi-supervised learning are based on the manifold assumption (Belkin et al., 2006) which states that the marginal distribution underlying the data can be described using a low-dimensional manifold embedded in a higher dimensional space. The Laplacian support vector machine (LapSVM) is a graph-based model following this assumption (Belkin et al., 2006).

Active learning on the other hand is a technique aiming at selecting the most informative samples from the pool of unlabeled data, which are subsequently labeled by a human or a machine annotator and thence incorporated into the calibration set. It is motivated by the observation that a model may enhance its performance while simultaneously use less labeled data for training, if it can choose the data on which it trains (Cohn et al., 1996). In scenarios where the labeled data are few, because the learner is able to choose the examples to be labeled, it can develop a model with substantially fewer data for training than required in normal supervised learning (with a priori labeling of the data). There have been various query strategies proposed in pool-based active learning, such as: (i) uncertainty sampling, where the samples for which the model is most uncertain are selected for labelling, (ii) query-by-committee, where a group of models predicts the unknown samples, and the samples on which they most disagree are selected for labelling, (iii) estimated error reduction, that estimates the expected future error which would have resulted if some instances were labeled and incorporated into the calibration set.

To account for the less accurate predictions at local or unseen sites various approaches have been proposed. They are mostly focused on (i) either constraining the SSL by selecting a subset of samples based on spectral or other similarities and then building a new model using only this subset of data (e.g. Nocita et al., 2014, Xu et al., 2016, Lobsey et al., 2017, Castaldi et al., 2018), or (ii) augmenting the SSL with site-specific samples (the so-termed spiking approach) (e.g. Sankey et al., 2008, Guerrero et al., 2010, Wetterlind and Stenberg, 2010, Jiang et al., 2017), or a combination thereof. In the first case, no data are labeled using analytical methods from the new site, which may not work well if the new site is not well-represented in the existing SSL. In the latter case, the most important question that arises is which and how many site-specific samples should be selected for labeling (Lucà et al., 2017, Nawar and Mouazen, 2018).

Our approach is motivated by this need to partially alleviate the cost of laboratory analyses which is needed for (i) accurate site-specific calibrations, and (ii) whilst establishing large SSLs. A particularly noteworthy application of this second case may be for the development of the soil system for Africa under the Soils4Africa project, which is funded by the European Union under the Horizon 2020 SFS-35–2019-2020 call, and will take place in the next four years (2020–2023). In particular, it may help identify which portion from the large number of soil samples that will be collected (ca. 20,000) can be very accurately predicted and which needs to be included in the calibration set based on its spectral properties, thus partially alleviating the costs associated with the analytical techniques.

The major contributions of this work are as follows:

  • It explores the use of semi-supervised learning in the domain of soil spectroscopy to identify whether it can enhance the prediction accuracy compared to traditionally used supervised approaches.

  • A novel machine learning–based strategy is proposed to select the samples from the unlabeled region that ought to be labeled (i.e. queried) in order to attain a more robust model and compared with existing strategies.

  • It explores the use of active learning and compares its application in soil spectroscopy with other approaches proposed in the literature.

The rest of the paper is organized as follows. Section 2 briefly presents the semi-supervised Laplacian Support Vector Regression algorithm. Section 3 lays out the concepts of active learning and introduces the novel machine learning–based strategy proposed herein. In Section 4 the experimental set up is presented, including the dataset formulation, the algorithms used to compare our methodology, and the training thereof. The experimental results are given in Section 5, while the conclusions drawn from this work and suggestions for future work may be found in Section 7.

Section snippets

Support vector regression

The support vector machine was introduced by Vapnik (Vapnik, 1998) in the context of structural risk minimization theory, and it aims to maximize its generalization ability through solving a quadratic optimization problem (Vapnik, 1992). Given a set of training N data Dtrn={(x1,y1),,(xN,yN)}X×R, where (xi,yi) denotes the pairs of input variables and real output, while X represents the space of input patterns (e.g. X=Rd, with d being the dimensionality of the input patterns xi), the goal is to

Pool-based active learning

This active learning scenario is suited in situations where a large pool of unlabeled data has already been gathered and therefore it is possible to evaluate all instances in terms of their informativeness in order to decide which ones should be labeled. For soil spectroscopy, this scenario concerns for example the collection of a number of soil samples from a new region and the measurement of their spectral signatures, which is the pool of unlabeled data. Given an existing SSL the question is

The LUCAS SSL

The statistical office of the European Union (EUROSTAT) organizes a triennial survey of land use, land cover and changes over time across the EU member-states, known as the Land Use and Coverage Area Frame Survey (LUCAS), with the latest survey conducted in 2018. Since 2009, topsoil assessment has been a key component of LUCAS. Its main goal is to establish a harmonized and comparable dataset of topsoil properties at the EU scale by collecting soil samples using a common sampling protocol and

Efficacy of semi-supervised learning

In this section the results of the first experiment detailed in Section 4.5 are presented. To demonstrate how well the semi-supervised learning approach of LapSVR performs, we compared it with SVR, PLS, and SBL across all datasets. The results are given in Table 2 with the RPIQ metric across all experiments further visualized in Fig. 8. It can be readily identified that the Clay content is predicted consistently with smaller error (RPIQ >2) than the OC content whose prediction accuracy is

Discussion

In this work, the synergy of active and semi-supervised learning was explored in the domain of soil spectroscopy to utilize the existing efforts establishing large SSLs when predicting the soil properties of a new, unseen by the model, region. This is a common case in this domain where a large cost is associated in labeling data through analytical techniques. Other approaches in the past have focused in either constraining the SSL or selecting a subset of spiking samples using a priori

Conclusions

In this paper, the semi-supervised learning approach (and in particular the LapSVR algorithm) was applied to derive local spectroscopic calibrations of soil properties in an unknown region using an existing SSL of another region. Compared to other standard supervised approaches (i.e. SVR, PLS, and SBL), this technique yielded statistically better results in terms of accuracy of prediction. Moreover, the synergistic use of LapSVR with the pool-based active learning scenario was also examined to

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The LUCAS topsoil dataset used in this work was made available by the European Commission through the European Soil Data Centre managed by the Joint Research Centre (JRC), http://esdac.jrc.ec.europa.eu/.

References (59)

  • R.A. Rossel et al.

    Using data mining to model and interpret soil diffuse reflectance spectra

    Geoderma

    (2010)
  • J.B. Sankey et al.

    Comparing local vs. global visible and near-infrared (VisNIR) diffuse reflectance spectroscopy (DRS) calibrations for the prediction of soil clay, organic C and inorganic C

    Geoderma

    (2008)
  • M. Seidel et al.

    Strategies for the efficient estimation of soil organic carbon at the field scale with vis-NIR spectroscopy: Spectral libraries and spiking vs. local calibrations

    Geoderma

    (2019)
  • N.L. Tsakiridis et al.

    Using interpretable fuzzy rule-based models for the estimation of soil organic carbon from VNIR/SWIR spectra and soil texture

    Chemometrics Intell. Lab. Syst.

    (2019)
  • N.L. Tsakiridis et al.

    An evolutionary fuzzy rule-based system applied to the prediction of soil organic carbon from soil spectral libraries

    Appl. Soft Comput.

    (2019)
  • N. Tziolas et al.

    A memory-based learning approach utilizing combined spectral sources and geographical proximity for improved VIS-NIR-SWIR soil properties estimation

    Geoderma

    (2019)
  • A. Agrawal et al.

    A rewriting system for convex optimization problems

    J. Control Decision

    (2018)
  • R. Amundson et al.

    Soil and human security in the 21st century

    Science

    (2015)
  • L. Anjos et al.

    World reference base for soil resources 2014

    (2015)
  • M. Belkin et al.

    Manifold regularization: A geometric framework for learning from labeled and unlabeled examples

    Journal of Machine Learning Research

    (2006)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • F. Castaldi et al.

    Estimation of soil organic carbon in arable soil in belgium and luxembourg with the LUCAS topsoil database

    Eur. J. Soil Sci.

    (2018)
  • D.A. Cohn et al.

    Active learning with statistical models

    Journal of Artificial Intelligence Research

    (1996)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine learning research

    (2006)
  • S. Diamond et al.

    CVXPY: A Python-embedded modeling language for convex optimization

    J. Mach. Learn. Res.

    (2016)
  • H. Drucker et al.

    Support vector regression machines

  • M. Friedman

    The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance

    J. Am. Stat. Assoc.

    (1937)
  • C. Guerrero et al.

    Assessment of soil organic carbon at local scale with spiked NIR calibrations: effects of selection and extra-weighting on the spiking subset

    Eur. J. Soil Sci.

    (2014)
  • Hatfield, J.L., Sauer, T.J., Cruse, R.M., 2017. Soil: The Forgotten Piece of the Water, Food, Energy Nexus. (pp. 1–46)....
  • Cited by (8)

    • Semi-supervised learning for the spatial extrapolation of soil information

      2022, Geoderma
      Citation Excerpt :

      This indicates the effectiveness of SSLR→T for spatial extrapolation. Several studies have shown higher performances of SSL in comparison to SL for spatial interpolation (Du et al., 2020; Han et al., 2018; Manian et al., 2022; Riese et al., 2019; Tsakiridis et al., 2021; Zhang et al., 2021b) and other applications (Al-Azzam and Shatnawi, 2021; Esche et al., 2022). However, to the best of our knowledge, there have not been any published case studies that deal with soil spatial extrapolation using SSL.

    • Estimating the spatial distribution of soil heavy metals in oil mining area using air quality data

      2022, Atmospheric Environment
      Citation Excerpt :

      In order to overcome the problem of low estimation accuracy of traditional low-correlation environmental variables, high-correlation environmental variables are used in the spatial estimation of soil heavy metals, mainly hyperspectral remote sensing variables (Tan et al., 2021; Pyo et al., 2020b; Zhou et al., 2021; Liu et al., 2020). For instance visible and near-infrared (VNIR) (Pyo et al., 2020b; Gholizadeh et al., 2020; Z. Wang et al., 2021; Zhang et al., 2019), and short wave infrared spectral region (SWIR) (Tsakiridis et al., 2021; Meng et al., 2020) bands show it has the high sensitivity and response to heavy metals in 420–580 nm, 780–900 nm, and 1800–2100 nm. Hyperspectral variables have rich spectral resolution (that is, the spectral interval is small), and they are highly sensitive and highly correlated to changes in the content of soil heavy metals.

    • Recent advances of chemometric calibration methods in modern spectroscopy: Algorithms, strategy, and related issues

      2022, TrAC - Trends in Analytical Chemistry
      Citation Excerpt :

      With the extensive application of modern spectroscopic technology, a large number of spectral data resources will be generated, among which most of spectral data (i.e., unlabeled samples) are no corresponding reference values. Therefore, how to make full use of the information of these unlabeled samples to construct semi-supervised and even unsupervised analysis models may be a new research direction in future [191–193]. Multispectral data fusion techniques, which can take full advantage of the complementarity or synergy of the diverse spectra data from multiple sources to make the outcome of the qualitative or quantitative analysis more reliable and accurate, is one of the research hotspots in the domain of spectral analysis in recent years.

    • Deep transfer learning of global spectra for local soil carbon monitoring

      2022, ISPRS Journal of Photogrammetry and Remote Sensing
      Citation Excerpt :

      This challenge has been the focus of research over the past decades (Naes et al., 1990; Shenk et al., 1997; Ramirez-Lopez et al., 2013). Several researchers have proposed methods to localise the modelling with SSLs, but with limited success, because their performance with different data, in various applications, and when the localisation is from a large (e.g. global) to a small (e.g. an agricultural field) level, is generally inconsistent (Seidel et al., 2019; Li et al., 2021; Tsakiridis et al., 2021). Spiking augments a SSL with a few labelled local data to form an augmented SSL for the modelling (Wetterlind and Stenberg, 2010; Brown, 2007; Sankey et al., 2008; Viscarra Rossel et al., 2009).

    • Pedogenic-weathering evolution and soil discrimination by sensor fusion combined with machine-learning-based spectral modeling

      2022, Geoderma
      Citation Excerpt :

      Recently, soil sensing using portable X-ray fluorescence (PXRF) and visible near-infrared reflectance (VNIR) has been proposed as a rapid, effective, and non-destructive alternative to conventional laboratory analyses for characterizing soil samples (Benedet et al., 2020a; Javadi et al., 2021; Kebonye et al., 2021). Using machine-learning techniques (e.g., partial least squares regression-PLSR, random forest-RF, Cubist, support vector machine-SVM, and artificial neural network-ANN), multiple soil properties/parameters can be obtained from spectral data (Riedel et al., 2018; Coblinski et al., 2020; Tsakiridis et al., 2021). Although several physical and chemical soil properties (e.g., total organic carbon—TOC and total nitrogen—TN) have been successfully predicted using VNIR and/or PXRF (Terra et al., 2019; Zhang and Hartemink, 2019; Benedet et al., 2020b; Wan et al., 2020), chemometric spectral research on pedogenesis-related mineralogic and magnetic attributes is still rare.

    View all citing articles on Scopus
    View full text