Predicting crop root concentration factors of organic contaminants with machine learning models

https://doi.org/10.1016/j.jhazmat.2021.127437Get rights and content

Highlights

  • FCNN model achieved the best prediction performance for RCFs.

  • Machine learning models performed better than traditional linear regression model.

  • Machine learning can identify important property descriptors for predicting RCFs.

  • Machine learning can learn complex relationships in contaminant-soil-plant systems.

Abstract

Accurate prediction of uptake and accumulation of organic contaminants by crops from soils is essential to assessing human exposure via the food chain. However, traditional empirical or mechanistic models frequently show variable performance due to complex interactions among contaminants, soils, and plants. Thus, in this study different machine learning algorithms were compared and applied to predict root concentration factors (RCFs) based on a dataset comprising 57 chemicals and 11 crops, followed by comparison with a traditional linear regression model as the benchmark. The RCF patterns and predictions were investigated by unsupervised t-distributed stochastic neighbor embedding and four supervised machine learning models including Random Forest, Gradient Boosting Regression Tree, Fully Connected Neural Network, and Supporting Vector Regression based on 15 property descriptors. The Fully Connected Neural Network demonstrated superior prediction performance for RCFs (R2 = 0.79, mean absolute error [MAE] = 0.22) over other machine learning models (R2 = 0.68–0.76, MAE = 0.23–0.26). All four machine learning models performed better than the traditional linear regression model (R2 = 0.62, MAE = 0.29). Four key property descriptors were identified in predicting RCFs. Specifically, increasing root lipid content and decreasing soil organic matter content increased RCFs, while increasing excess molar refractivity and molecular volume of contaminants decreased RCFs. These results show that machine learning models can improve prediction accuracy by learning nonlinear relationships between RCFs and properties of contaminants, soils, and plants.

Introduction

Release of organic chemicals into the agroecosystem occurs intentionally through agrichemical application and unintentionally through atmospheric deposition, contaminated soil amendments (such as manure or biosolids) or the use of contaminated irrigation water (Jiang et al., 2009, Li et al., 2011, Qi et al., 2020, Yurdakul et al., 2019). Plants can act as vectors for transferring chemicals from the environment to the food chain. Contaminant uptake by crop roots from contaminated soils is key to subsequent translocation and accumulation in plants (Pullagurala et al., 2018). Thus, extensive laboratory and field studies have been conducted to measure the transfer of many chemicals from soil to plant tissues (Doucette et al., 2018). Specifically, the transfer of organic contaminants from soils to crop roots is usually evaluated by root concentration factors (RCFs). RCFs are defined as the ratio of contaminant concentration in root to that in soil by assuming an equilibrium state for contaminants sorbed by the soil, dissolved in soil pore water, and accumulated in plant roots (Torralba Sanchez et al., 2017, McKone and Maddalena, 2007). Therefore, the interactions among chemicals, soils, and plants collectively determine the RCFs. These interactions include sorption–desorption of contaminants between soil and pore water, and plant root uptake of contaminants from soils (Li et al., 2019). The complex interactions of contaminants in plant-soil-water systems have most commonly been studied in experimental systems with limited soil and plant types for a small number of chemicals. The limited data and poor coverage of both chemical and environmental (plant and soil) properties do not allow for a systematic evaluation that links the influences of chemical, soil and plant properties to RCFs. In addition, new chemicals are being developed and discharged to the environment every year, and their uptake by crops may not be measured in a timely manner. Therefore, an alternative approach to lab and field experiments is to develop reliable prediction models, which can be used as a rapid screening tool to initially evaluate the potential transfer of contaminants from soils to crops (McKone and Maddalena, 2007, Mamy et al., 2015).

Most existing models that predict plant uptake of organic contaminants from soil are either empirical regression models or mechanistic models. Empirical models have traditionally relied upon limited physicochemical properties (e.g., logKow and molecular weight) (Collins et al., 2006), while mechanistic models were developed based on the assumption of several uptake processes (Feng et al., 2019, Fantke et al., 2011). For example, Topp et al. (1986) proposed a simple linear regression model based on molecular weight to predict plant uptake of organic chemicals from soil. Ryan et al. (1988) combined empirical relationships of Briggs et al. (1982) with a simple partition model to build a screening model for assessing the uptake of non-ionic chemicals from soils. Trapp and Matthies (1995) developed a one-compartment model for uptake of organic chemicals by foliar vegetables incorporating a number of different uptake processes and potentially significant loss mechanisms. Chiou et al. (2001a) proposed a mechanistic partition-limited model for the passive root uptake of contaminants from soils. These models are either limited by the number of properties considered, or the oversimplification of the complex soil-plant-chemical interactions in several governing equations based on the assumed uptake processes. Hence it is challenging for these models to be generalized to broader spectrum of chemicals, plants, and soils.

In contrast, machine learning models do not rely on the assumed mechanisms to discover data patterns and can learn complex relationships between input features and predicted targets (Ahmad et al., 2021, Chen et al., 2021). They can learn the complex functions directly from data by training large numbers of parameters, which makes it possible to produce accurate predictions when underlying mechanisms are not completely understood or parameterized. There are in general two types of machine learning models, depending on whether training labels (e.g., RCF values in this study) are used. Supervised machine learning models utilize labels to train the model, while unsupervised machine learning models do not require labels. Common supervised machine learning models include Gradient Boosting Regression Tree (GBRT) (Friedman, 2001), Fully Connected Neural Network (FCNN) (Rumelhart et al., 1986), Supporting Vector Regression (SVR) (Cybenko, 1989), and Random Forest (RF) (Breiman, 2001) among many others. In fact, various machine learning models were developed based on these classic models. For example, LightBoost and XGBoost are variants of GBRT, convolutional neural network is one type of neural networks (NN), and DeepForest is a variant of RF. Recently, machine learning models have been increasingly used in environmental applications with varying performance. For example, RF was previously shown to perform the best among GBRT, FCNN, SVR, and RF in predicting chemical ecotoxicity (HC50) (Hou et al., 2020a, Hou et al., 2020c). Deep neural network models were widely used to predict chemical properties with superior performance (Feinberg et al., 2018, Walters and Barzilay, 2020). More specifically, Bagheri et al. (2020) developed a NN model to predict RCFs from hydroponic studies using multiple physicochemical properties including molecular weight, logKow, rotatable bonds, hydrogen bond donor, hydrogen bond acceptor, and polar surface area. Gao et al. (2021) built a GBRT model with ECFP4 fingerprints to predict RCFs from soil concentrations. However, there has been no systematic comparison of various machine learning models for predicting RCFs, which is essential to the future application of machine learning in risk assessment.

Additionally, unsupervised machine learning models can visualize and discover patterns from data. However, in the previous studies supervised learning was usually used to predict RCFs, whereas unsupervised methods are less explored. For example, t-SNE is a popular unsupervised learning method and has been successfully used in visualizing high dimensional data (e.g., discovering cell patterns from single cell sequencing data) (Belkina et al., 2019, Kobak and Berens, 2019). Recently, it has also been applied to visualize per-and polyfluoroalkyl substances C-F bond energy patterns based on molecular descriptors (Raza et al., 2019). The ability of t-SNE to reveal local similarities of data points in the dataset makes it a valuable tool for exploring high dimensional data (Van der Maaten and Hinton, 2008), which can be used for evaluating RCFs from a multitude of chemical, soil, and plant properties. Finally, the crop uptake of organic contaminants from soils involves the complex interactions among contaminants, soils, and plants. However, the analysis of important properties or features for RCF prediction has not been fully studied. Identifying important properties related to RCF prediction can enhance our understanding of plant root uptake of organic contaminants.

To fill the aforementioned knowledge gaps, this study aimed to predict the crop RCFs from the properties (features) of contaminants, soils, and plants using both supervised and unsupervised machine learning methods. Briefly, the patterns from the collected crop RCF dataset were first investigated with an unsupervised machine learning algorithm, namely t-distributed stochastic neighbor embedding (t-SNE) (Van der Maaten and Hinton, 2008). The uptake of organic chemicals by crops from soils was then predicted using four supervised machine learning models including GBRT, FCNN, RF, and SVR. The performances of the four machine learning models were systematically compared. Furthermore, two different feature importance analysis methods and individual conditional expectation analysis were performed. These analyses identified key parameters for predicting RCFs and revealed the complex relationships among target chemicals, plant, soil properties, and RCFs.

Section snippets

RCF dataset

An RCF database for crop uptake of organic contaminants was compiled. The data were initially collected and screened from previous peer-reviewed studies (published 1959–2020) using Web of Science and searching terms that included “organic contaminants”, “plant uptake from soils”, “bioaccumulation factors”, and “bioconcentration factors”. Only studies that reported soil organic matter content (fom) and accessible crop root lipid content (flipid) were included in the dataset. These two soil and

t-SNE plot of the RCF dataset

The RCF dataset was first explored with t-SNE to examine the effectiveness of property representation of chemicals, plants and soils. As an unsupervised machine learning method, t-SNE can find patterns of RCFs from other input features without using any RCF data. The 243 data points based on their property descriptors were clustered in Fig. 2 using t-SNE and then colored according to their corresponding RCF values. By reducing the high dimensional data into 2D space, data points with similar

Conclusions

With increasing exposure of organic chemicals to agroecosystems, it is essential to assess their uptake and accumulation in food crops. Accurate prediction of RCFs from soil has been challenging due to the complex interactions among chemicals, soils, and plants. Existing empirical regression models and mechanistic models were usually limited by their prediction abilities for diverse types of chemicals, plants, and soils. Emerging machine learning models provide new methodologies for predicting

CRediT authorship contribution statement

Feng Gao: Conceptualization, Methodology, Formal analysis, Software, Investigation, Resources, Visualization, Writing – original draft, Writing – review & editing. Yike Shen: Conceptualization, Investigation, Writing – review & editing. J. Brett Sallach: Conceptualization, Writing – review & editing. Hui Li: Conceptualization, Writing – review & editing. Wei Zhang: Conceptualization, Writing – review & editing. Yuanbo Li: Conceptualization, Funding, Writing – review & editing. Cun Liu:

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The work was supported by the National Key Research and Development Program of China, China (2019YFC1604503 and 2016YFD0800403).

References (62)

  • Q. Liu et al.

    Uptake kinetics and accumulation of pesticides in wheat (Triticum aestivum L.): impact of chemical and plant properties

    Environ. Pollut.

    (2021)
  • H. Li et al.

    Fishpond sediment-borne DDTs and HCHs in the Pearl River Delta: characteristics, environmental risk and fate following the use of the sediment as plant growth media

    J. Hazard. Mater.

    (2011)
  • H.-X. Luo et al.

    Comparison of machine learning algorithms for mapping mango plantations based on Gaofen-1 imagery

    J. Integr. Agric.

    (2020)
  • V.L.R. Pullagurala et al.

    Plant uptake and translocation of contaminants of emerging concern in soil

    Sci. Total Environ.

    (2018)
  • P. Qi et al.

    Investigation of polycyclic aromatic hydrocarbons in soils from Caserta provincial territory, southern Italy: Spatial distribution, source apportionment, and risk assessment

    J. Hazard. Mater.

    (2020)
  • J. Ryan et al.

    Plant uptake of non-ionic organic chemicals from soils

    Chemosphere

    (1988)
  • E. Topp et al.

    Factors affecting the uptake of 14C-labeled organic chemicals by plants from soil

    Ecotoxicol. Environ. Saf.

    (1986)
  • Z. Yang et al.

    Performance of the partition-limited model on predicting ryegrass uptake of polycyclic aromatic hydrocarbons

    Chemosphere

    (2007)
  • S. Yurdakul et al.

    Levels, temporal/spatial variations and sources of PAHs and PCBs in soil of a highly industrialized area

    Atmos. Pollut. Res.

    (2019)
  • X. Zhan et al.

    Influence of plant root morphology and tissue composition on phenanthrene uptake: stepwise multiple linear regression analysis

    Environ. Pollut.

    (2013)
  • X. Zhu et al.

    The application of machine learning methods for prediction of metal sorption onto biochars

    J. Hazard. Mater.

    (2019)
  • A. Altmann et al.

    Permutation importance: a corrected feature importance measure

    Bioinformatics

    (2010)
  • A.C. Belkina et al.

    Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets

    Nat. Commun.

    (2019)
  • L. Breiman

    Random forests

    Mach. Learn

    (2001)
  • G.G. Briggs et al.

    Relationships between lipophilicity and root uptake and translocation of non‐ionised chemicals by barley

    Pestic. Sci.

    (1982)
  • L.J. Carter et al.

    Fate and uptake of pharmaceuticals in soil–plant systems

    J. Agric. Food Chem.

    (2014)
  • C.T. Chiou et al.

    A partition-limited model for the plant uptake of organic contaminants from soil and water

    Environ. Sci. Technol.

    (2001)
  • Collins, C., Martin, I., Fryer, M., 2006. Evaluation of models for predicting plant uptake of chemicals from soil,...
  • G. Cybenko

    Approximation by superpositions of a sigmoidal function

    Math. Control. Signals, Syst.

    (1989)
  • W.J. Doucette et al.

    A review of measured bioaccumulation data on terrestrial plants for organic chemicals: metrics, variability, and the need for standardized measurement protocols

    Environ. Toxicol. Chem.

    (2018)
  • P. Ertl et al.

    Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties

    J. Med. Chem.

    (2000)
  • Cited by (23)

    • Bioaccessibility of arsenic, lead, and cadmium in contaminated mining/smelting soils: Assessment, modeling, and application for soil environment criteria derivation

      2023, Journal of Hazardous Materials
      Citation Excerpt :

      The RF model was evaluated along with two regression models (PLSR and RR) for the bioaccessibility of PHEs after the log transformation of the target pollutant dataset in the study (Table S8). RF is an integrated learning method that fits multiple decision trees on different subsets of the dataset and averages the results of each tree to improve prediction accuracy and control overfitting (Gao et al., 2022). RR constructs a linear regression model by constraining the magnitude of the regression coefficients only by adding a penalty term to the fitting error.

    View all citing articles on Scopus
    1

    Equal contribution.

    View full text