Research PaperMachine learning-based biomarkers identification from toxicogenomics – Bridging to regulatory relevant phenotypic endpoints
Graphical Abstract
Introduction
Genotoxicity is of great concern because of its link to mutagenicity, carcinogenicity as well as cancer, and there is urgent demand for genotoxicity screening and risk assessment for various environmental and health applications (USEPA, 2016, Lan et al., 2016, Gou et al., 2014, Gou and Gu, 2011). In the absence of, or combined with, in-vivo carcinogenicity data, in-vitro or cell-based genotoxicity assays provide supporting data for cancer risk assessment (Ahn et al., 2009). Recently, toxicogenomics has emerged to be a promising technology that reveals molecular-level activities, at the gene, protein, or metabolite level of organisms, in response to environmental contaminants and may represent the underlying cellular network mechanisms of toxicity responses (Lan et al., 2016, Altenburger et al., 2012). This also responds to the Tox21 vision that promotes a systematic transit from current in-vivo whole animal-based testing, to more in-vitro mechanistic pathway-based assays using high-throughput screening and tiered testing (National Research Council, 2007, Andersen and Krewski, 2009). However, one of the major challenges in realization and implementations of the Tox21 vision is the urgent need to establish quantitative link between in-vitro toxicogenomic assay molecular endpoint and in-vivo phenotypic regulatory relevant endpoints.
Establishing quantitative causal relationships between in-vitro assay endpoints to regulatory-relevant apical endpoints holds the key to the realization of predictive toxicology through practical and widespread implementation of in-vitro assay-based toxicity screening schemes and strategies for environmental and health applications (Benigni, 2016, Thomas et al., 2009, Thomas et al., 2012, Ankley et al., 2006, Kramer et al., 2011, Muth-Köhne et al., 2016, Bradbury et al., 2004, Conolly et al., 2017). Adverse-outcome pathway (AOP) framework is the state-of-the-art approach to link mechanistic toxicity mechanisms with the phenotypic adverse outcome that would enable the assessment of health risk as well as ecotoxicological risks from exposure to pollutants and their mixtures (Blalock et al., 2018, Groh et al., 2015, Ankley et al., 2010, Ankley and Edwards, 2018, Carusi et al., 2018). Coalesce of effective biomarkers and proper predicting framework would enable more cost-effective and wider implementation of toxicomics in monitoring of genotoxicity and predict adverse toxic responses (Harrill et al., 2009, Stahl et al., 2015, Angrish et al., 2016). Subsequently, proper selection and validation of predicative biomarkers plays a crucial role in our ability to link molecular-level effects recorded in in-vitro assays to the in-vivo regulatory relevant phenotypic endpoints, or system-level impacts in many fields such as, environmental toxicity, disease prediction and health risk identification (Garcia-Reyero et al., 2009, Strimbu and Tavel, 2010, Christin et al., 2013, Ma and Huang, 2005).
The rapid advancement in bioinformatics and machine learning methods enables more sophisticated biomarkers identification (Thomas et al., 2009, Abeel et al., 2010, Wei et al., 2014). Current biomarker identification from toxicomics data employs feature selection and classification methods. Two general approaches of feature selection include filter and wrapper methods. The filter methods often provide relatively simpler and faster alternatives to select the most important features and the features are selected or filtered based on their relevance to differentiate a target outcome from others (Ding and Peng, 2005). The wrapper methods combine feature selection along with the classification method, where the features are judged based on their ability to increase the accuracy of the classification models (Xiong et al., 2001, Kohavi and John, 1997). However, the wrapper methods are often associated with extensive computational cost, and prone to possible overfitting when the sample size is relatively small (Radovic et al., 2017, Suto et al., 2016). In addition, since the selected features of the filter methods are independent of the classification method, they often have higher relevance to the target outcome than those derived from a wrapper method (Ding and Peng, 2005, Radovic et al., 2017). Filter based feature selection methods that have been applied to toxicomics data (e.g., gene and protein expression data) include mutual information, statistical tests (t-test, F-test, chi-square), information gain (Bolón-Canedo et al., 2014), gain ratio (Wei et al., 2014), and ReliefF (Bolón-Canedo et al., 2014) among others. Though most of these algorithms find the important biomarkers based on their relevance and correlation to the target outcome, they do not address redundancy and overfitting issues. The maximum relevance and minimum redundancy (MRMR) algorithm aims to reduce the redundancy in datasets, while also identifying the most relevant features and biomarkers to predict the outcome (Ding and Peng, 2005, Radovic et al., 2017). Furthermore, using the right classification algorithm for a specific problem is important in order to avoid overfitting by the model (Statnikov et al., 2005). The classification algorithms that have been used in the past to classify toxicogenomics data include k-nearest neighbor, naïve-Bayes, and support vector machines (SVMs). SVMs have been shown to yield reliable and efficient classification performances, while limiting overfitting, particularly for cases where the number of features is higher than the number of samples (as often seen with toxicogenomic data) (Abeel et al., 2010).
Although a few isolated biomarkers have been used for genotoxicity detection in both environmental and human health applications, such as CYP1A1, CYP1B1, and CYP-R (Ellinger-Ziegelbauer et al., 2009), RAD54 (Walmsley et al., 1997),A2m, Ca3, Cxcl1, and Cyp8b1 (Huang and Tung, 2017), their correlation with phenotypic genotoxicity endpoints or carcinogenicity has not been quantified. Furthermore, the temporal dependencies of toxicogenomics responses have also not been considered in most cases, since most studies record a snapshot of the responses (Ellinger-Ziegelbauer et al., 2009). It is still an open research area to identify relevant toxicogenomic-based biomarkers that quantitatively link in-vitro responses to regulatory relevant in-vivo toxicity endpoints, utilizing the temporal molecular response patterns.
In this study, we applied MRMR feature selection and SVM classification algorithm, to identify an ensemble of biomarkers from temporal toxicogenomic assays, for genotoxicity and carcinogenicity prediction and for bridging to regulatory relevant phenotypic endpoints via AOP. As per the AOP framework, molecular initiating event for DNA damage would link to an adverse outcome of genotoxicity at organism or population level that is relevant to risk assessment (Kramer et al., 2011, Ankley et al., 2010, Brockmeier et al., 2017). We proposed and developed a novel quantitative toxicogenomics assay to evaluate mechanistic genotoxicity through the detection and quantification of molecular level changes in proteins involved in known DNA damage repair pathways, to comply with the AOP concept (Lan et al., 2016, Milanowska et al., 2010, Hohmann and Mager, 2007). The selected key proteins involved in all known DNA damage and repair stress response pathways are conserved among yeast and other eukaryotes including human, therefore is expected to capture AOP molecular effects at sub-cytotoxic dose levels that lead to phenotypic changes and adverse outcome (Lan et al., 2016, Ankley et al., 2010, Simmons et al., 2009). The protein expression changes, in exposure to each chemical, are monitored by employing a genotoxicity assay using GFP-tagged yeast reporter stains, covering 38 selected protein biomarkers indicative of all the seven known DNA damage repair pathways (Lan et al., 2016, O'Connor et al., 2012). Two separate case studies—i) in-vivo rodent carcinogenicity and ii) Ames test-based genotoxicity prediction—are performed to identify the biomarker-ensembles for chemically-induced genotoxicity and carcinogenicity endpoint prediction. For each case study, six concentrations of 20 chemicals are selected that include model genotoxic compounds with reported endpoints and negative control without any reported genotoxic effects. Both in-vivo rodent carcinogenicity and Ames based genotoxicity are among the most widely used endpoints for genotoxicity assessment and they are being used in the National Toxicology Program (Bucher, 2002) and to prepare toxicity databases such as the Carcinogenic Potency Database (CPDB) (Gold et al., 2000). The performance of the prediction models is evaluated by estimating the area under the receiver operating characteristics curve (AUC), as well as the classification accuracy, sensitivity, and specificity. The number and identities of selected top-ranked biomarkers and their relationship with the prediction performances are assessed.
Section snippets
Materials
A time-series toxicogenomic assay of 20 chemicals is evaluated in the current study, including model genotoxic compounds and negative control without any reported genotoxic effect. Details of the chemicals are provided in Table S1. Two types of genotoxicity endpoints are investigated, including i) in-vivo rodent carcinogenicity and ii) Ames genotoxicity assay. Both carcinogenicity and genotoxicity endpoint data is collected from the existing literature and are summarized in Table S1 (Lan et
Identification of toxicogenomics-based biomarkers for in-vivo carcinogenicity prediction
The most relevant protein biomarkers are identified based on their rank measures and ability to differentiate the altered expression level between the carcinogenicity-positive and -negative compounds. Three separate scores according to the three ranking criteria, namely t-stat, MRMR-TCD and MRMR-TCQ, as described in the methods section, are used. Fig. 1 shows the most relevant biomarkers that have higher scores and assumingly higher relevancy to in-vivo carcinogenicity. The higher t-stat score
CRediT authorship contribution statement
Sheikh Mokhlesur Rahman: Conceptualization, Methodology, Data curation, Formal analysis, Validation, Visualization, Writing – original draft, Writing – review & editing. JiaQi Lan: Methodology, Data curation, Formal analysis, Validation, Writing – review & editing. David Kaeli: Data processing and analysis, Formal analysis; Jennifer Dy: Data analysis methods, Formal analysis, Visualization, Writing – review & editing. Akram Alshawabkeh: Funding acquisition, Project administration, Resources,
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This study is funded by National Science Foundation (NSF, CBET-1437257 and IIS-1546428), National Institute of Environmental Health Sciences (NIEHS) grants P42ES017198 and P50ES026049, and U.S. Environmental Protection Agency (EPA) grant R83615501. The authors would like to thank Guangyu Li for his contribution during the revision process of this article.
References (86)
- et al.
Prediction and classification of the modes of genotoxic actions using bacterial biosensors specific for DNA damages
Biosens. Bioelectron.
(2009) - et al.
Comparison of the performance of multiclass classifiers in chemical data: addressing the problem of overfitting with the permutation test
Chemom. Intell. Lab. Syst.
(2020) - et al.
The adverse outcome pathway: a multifaceted framework supporting 21st century toxicology
Curr. Opin. Toxicol.
(2018) - et al.
A review of microarray datasets and applied feature selection methods
Inf. Sci.
(2014) - et al.
Harvesting the promise of AOPs: an assessment and recommendations
Sci. Total Environ.
(2018) - et al.
A comprehensive survey on support vector machine classification: applications, challenges and trends
Neurocomputing
(2020) - et al.
A critical assessment of feature selection methods for biomarker discovery in clinical proteomics
Mol. Cell. Proteom.
(2013) - et al.
Application of toxicogenomics to study mechanisms of genotoxicity and carcinogenicity
Toxicol. Lett.
(2009) An introduction to ROC analysis
Pattern Recogn. Lett.
(2006)- et al.
Development and application of the adverse outcome pathway framework for understanding and predicting chronic toxicity: I. Challenges and research needs in ecotoxicology
Chemosphere
(2015)