Predicting crop root concentration factors of organic contaminants with machine learning models
Graphical Abstract
Introduction
Release of organic chemicals into the agroecosystem occurs intentionally through agrichemical application and unintentionally through atmospheric deposition, contaminated soil amendments (such as manure or biosolids) or the use of contaminated irrigation water (Jiang et al., 2009, Li et al., 2011, Qi et al., 2020, Yurdakul et al., 2019). Plants can act as vectors for transferring chemicals from the environment to the food chain. Contaminant uptake by crop roots from contaminated soils is key to subsequent translocation and accumulation in plants (Pullagurala et al., 2018). Thus, extensive laboratory and field studies have been conducted to measure the transfer of many chemicals from soil to plant tissues (Doucette et al., 2018). Specifically, the transfer of organic contaminants from soils to crop roots is usually evaluated by root concentration factors (RCFs). RCFs are defined as the ratio of contaminant concentration in root to that in soil by assuming an equilibrium state for contaminants sorbed by the soil, dissolved in soil pore water, and accumulated in plant roots (Torralba Sanchez et al., 2017, McKone and Maddalena, 2007). Therefore, the interactions among chemicals, soils, and plants collectively determine the RCFs. These interactions include sorption–desorption of contaminants between soil and pore water, and plant root uptake of contaminants from soils (Li et al., 2019). The complex interactions of contaminants in plant-soil-water systems have most commonly been studied in experimental systems with limited soil and plant types for a small number of chemicals. The limited data and poor coverage of both chemical and environmental (plant and soil) properties do not allow for a systematic evaluation that links the influences of chemical, soil and plant properties to RCFs. In addition, new chemicals are being developed and discharged to the environment every year, and their uptake by crops may not be measured in a timely manner. Therefore, an alternative approach to lab and field experiments is to develop reliable prediction models, which can be used as a rapid screening tool to initially evaluate the potential transfer of contaminants from soils to crops (McKone and Maddalena, 2007, Mamy et al., 2015).
Most existing models that predict plant uptake of organic contaminants from soil are either empirical regression models or mechanistic models. Empirical models have traditionally relied upon limited physicochemical properties (e.g., logKow and molecular weight) (Collins et al., 2006), while mechanistic models were developed based on the assumption of several uptake processes (Feng et al., 2019, Fantke et al., 2011). For example, Topp et al. (1986) proposed a simple linear regression model based on molecular weight to predict plant uptake of organic chemicals from soil. Ryan et al. (1988) combined empirical relationships of Briggs et al. (1982) with a simple partition model to build a screening model for assessing the uptake of non-ionic chemicals from soils. Trapp and Matthies (1995) developed a one-compartment model for uptake of organic chemicals by foliar vegetables incorporating a number of different uptake processes and potentially significant loss mechanisms. Chiou et al. (2001a) proposed a mechanistic partition-limited model for the passive root uptake of contaminants from soils. These models are either limited by the number of properties considered, or the oversimplification of the complex soil-plant-chemical interactions in several governing equations based on the assumed uptake processes. Hence it is challenging for these models to be generalized to broader spectrum of chemicals, plants, and soils.
In contrast, machine learning models do not rely on the assumed mechanisms to discover data patterns and can learn complex relationships between input features and predicted targets (Ahmad et al., 2021, Chen et al., 2021). They can learn the complex functions directly from data by training large numbers of parameters, which makes it possible to produce accurate predictions when underlying mechanisms are not completely understood or parameterized. There are in general two types of machine learning models, depending on whether training labels (e.g., RCF values in this study) are used. Supervised machine learning models utilize labels to train the model, while unsupervised machine learning models do not require labels. Common supervised machine learning models include Gradient Boosting Regression Tree (GBRT) (Friedman, 2001), Fully Connected Neural Network (FCNN) (Rumelhart et al., 1986), Supporting Vector Regression (SVR) (Cybenko, 1989), and Random Forest (RF) (Breiman, 2001) among many others. In fact, various machine learning models were developed based on these classic models. For example, LightBoost and XGBoost are variants of GBRT, convolutional neural network is one type of neural networks (NN), and DeepForest is a variant of RF. Recently, machine learning models have been increasingly used in environmental applications with varying performance. For example, RF was previously shown to perform the best among GBRT, FCNN, SVR, and RF in predicting chemical ecotoxicity (HC50) (Hou et al., 2020a, Hou et al., 2020c). Deep neural network models were widely used to predict chemical properties with superior performance (Feinberg et al., 2018, Walters and Barzilay, 2020). More specifically, Bagheri et al. (2020) developed a NN model to predict RCFs from hydroponic studies using multiple physicochemical properties including molecular weight, logKow, rotatable bonds, hydrogen bond donor, hydrogen bond acceptor, and polar surface area. Gao et al. (2021) built a GBRT model with ECFP4 fingerprints to predict RCFs from soil concentrations. However, there has been no systematic comparison of various machine learning models for predicting RCFs, which is essential to the future application of machine learning in risk assessment.
Additionally, unsupervised machine learning models can visualize and discover patterns from data. However, in the previous studies supervised learning was usually used to predict RCFs, whereas unsupervised methods are less explored. For example, t-SNE is a popular unsupervised learning method and has been successfully used in visualizing high dimensional data (e.g., discovering cell patterns from single cell sequencing data) (Belkina et al., 2019, Kobak and Berens, 2019). Recently, it has also been applied to visualize per-and polyfluoroalkyl substances C-F bond energy patterns based on molecular descriptors (Raza et al., 2019). The ability of t-SNE to reveal local similarities of data points in the dataset makes it a valuable tool for exploring high dimensional data (Van der Maaten and Hinton, 2008), which can be used for evaluating RCFs from a multitude of chemical, soil, and plant properties. Finally, the crop uptake of organic contaminants from soils involves the complex interactions among contaminants, soils, and plants. However, the analysis of important properties or features for RCF prediction has not been fully studied. Identifying important properties related to RCF prediction can enhance our understanding of plant root uptake of organic contaminants.
To fill the aforementioned knowledge gaps, this study aimed to predict the crop RCFs from the properties (features) of contaminants, soils, and plants using both supervised and unsupervised machine learning methods. Briefly, the patterns from the collected crop RCF dataset were first investigated with an unsupervised machine learning algorithm, namely t-distributed stochastic neighbor embedding (t-SNE) (Van der Maaten and Hinton, 2008). The uptake of organic chemicals by crops from soils was then predicted using four supervised machine learning models including GBRT, FCNN, RF, and SVR. The performances of the four machine learning models were systematically compared. Furthermore, two different feature importance analysis methods and individual conditional expectation analysis were performed. These analyses identified key parameters for predicting RCFs and revealed the complex relationships among target chemicals, plant, soil properties, and RCFs.
Section snippets
RCF dataset
An RCF database for crop uptake of organic contaminants was compiled. The data were initially collected and screened from previous peer-reviewed studies (published 1959–2020) using Web of Science and searching terms that included “organic contaminants”, “plant uptake from soils”, “bioaccumulation factors”, and “bioconcentration factors”. Only studies that reported soil organic matter content (fom) and accessible crop root lipid content (flipid) were included in the dataset. These two soil and
t-SNE plot of the RCF dataset
The RCF dataset was first explored with t-SNE to examine the effectiveness of property representation of chemicals, plants and soils. As an unsupervised machine learning method, t-SNE can find patterns of RCFs from other input features without using any RCF data. The 243 data points based on their property descriptors were clustered in Fig. 2 using t-SNE and then colored according to their corresponding RCF values. By reducing the high dimensional data into 2D space, data points with similar
Conclusions
With increasing exposure of organic chemicals to agroecosystems, it is essential to assess their uptake and accumulation in food crops. Accurate prediction of RCFs from soil has been challenging due to the complex interactions among chemicals, soils, and plants. Existing empirical regression models and mechanistic models were usually limited by their prediction abilities for diverse types of chemicals, plants, and soils. Emerging machine learning models provide new methodologies for predicting
CRediT authorship contribution statement
Feng Gao: Conceptualization, Methodology, Formal analysis, Software, Investigation, Resources, Visualization, Writing – original draft, Writing – review & editing. Yike Shen: Conceptualization, Investigation, Writing – review & editing. J. Brett Sallach: Conceptualization, Writing – review & editing. Hui Li: Conceptualization, Writing – review & editing. Wei Zhang: Conceptualization, Writing – review & editing. Yuanbo Li: Conceptualization, Funding, Writing – review & editing. Cun Liu:
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The work was supported by the National Key Research and Development Program of China, China (2019YFC1604503 and 2016YFD0800403).
References (62)
- et al.
Adsorption of Indigo Carmine dye onto the surface-modified adsorbent prepared from municipal waste and simulation using deep neural network
J. Hazard. Mater.
(2021) - et al.
Examining plant uptake and translocation of emerging contaminants using machine learning: implications to food security
Sci. Total Environ.
(2020) - et al.
Evaluation of different boosting ensemble machine learning models and novel deep learning and boosting framework for head-cut gully erosion susceptibility
J. Environ. Manag.
(2021) - et al.
Mechanistic study on uptake and transport of pharmaceuticals in lettuce from water
Environ. Int.
(2019) - et al.
Plant uptake of pesticides and human health: dynamic modeling of residues in wheat and ingestion intake
Chemosphere
(2011) - et al.
Dynamic modeling of famoxadone and oxathiapiprolin residue on cucumber and Chinese cabbage based on tomato and lettuce archetypes
J. Hazard. Mater.
(2019) - et al.
Comparison between random forest and gradient boosting machine methods for predicting Listeria spp. prevalence in the environment of pastured poultry farms
Food Res. Int.
(2019) - et al.
Estimate ecotoxicity characterization factors for chemicals in life cycle assessment using machine learning models
Environ. Int.
(2020) - et al.
Lipid–water partition coefficients and correlations with uptakes by algae of organic compounds
J. Hazard. Mater.
(2014) - et al.
Occurrence, distribution and possible sources of organochlorine pesticides in agricultural soil of Shanghai, China
J. Hazard. Mater.
(2009)
Uptake kinetics and accumulation of pesticides in wheat (Triticum aestivum L.): impact of chemical and plant properties
Environ. Pollut.
Fishpond sediment-borne DDTs and HCHs in the Pearl River Delta: characteristics, environmental risk and fate following the use of the sediment as plant growth media
J. Hazard. Mater.
Comparison of machine learning algorithms for mapping mango plantations based on Gaofen-1 imagery
J. Integr. Agric.
Plant uptake and translocation of contaminants of emerging concern in soil
Sci. Total Environ.
Investigation of polycyclic aromatic hydrocarbons in soils from Caserta provincial territory, southern Italy: Spatial distribution, source apportionment, and risk assessment
J. Hazard. Mater.
Plant uptake of non-ionic organic chemicals from soils
Chemosphere
Factors affecting the uptake of 14C-labeled organic chemicals by plants from soil
Ecotoxicol. Environ. Saf.
Performance of the partition-limited model on predicting ryegrass uptake of polycyclic aromatic hydrocarbons
Chemosphere
Levels, temporal/spatial variations and sources of PAHs and PCBs in soil of a highly industrialized area
Atmos. Pollut. Res.
Influence of plant root morphology and tissue composition on phenanthrene uptake: stepwise multiple linear regression analysis
Environ. Pollut.
The application of machine learning methods for prediction of metal sorption onto biochars
J. Hazard. Mater.
Permutation importance: a corrected feature importance measure
Bioinformatics
Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets
Nat. Commun.
Random forests
Mach. Learn
Relationships between lipophilicity and root uptake and translocation of non‐ionised chemicals by barley
Pestic. Sci.
Fate and uptake of pharmaceuticals in soil–plant systems
J. Agric. Food Chem.
A partition-limited model for the plant uptake of organic contaminants from soil and water
Environ. Sci. Technol.
Approximation by superpositions of a sigmoidal function
Math. Control. Signals, Syst.
A review of measured bioaccumulation data on terrestrial plants for organic chemicals: metrics, variability, and the need for standardized measurement protocols
Environ. Toxicol. Chem.
Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties
J. Med. Chem.
Cited by (23)
Bioavailability (BA)-based risk assessment of soil heavy metals in provinces of China through the predictive BA-models
2024, Journal of Hazardous MaterialsWastewater-derived contaminants of emerging concern: Concentrations in soil solution under simulated irrigation scenarios
2023, Soil and Environmental HealthBioaccessibility of arsenic, lead, and cadmium in contaminated mining/smelting soils: Assessment, modeling, and application for soil environment criteria derivation
2023, Journal of Hazardous MaterialsCitation Excerpt :The RF model was evaluated along with two regression models (PLSR and RR) for the bioaccessibility of PHEs after the log transformation of the target pollutant dataset in the study (Table S8). RF is an integrated learning method that fits multiple decision trees on different subsets of the dataset and averages the results of each tree to improve prediction accuracy and control overfitting (Gao et al., 2022). RR constructs a linear regression model by constraining the magnitude of the regression coefficients only by adding a penalty term to the fitting error.
- 1
Equal contribution.