Research Paper
Machine learning-based biomarkers identification from toxicogenomics – Bridging to regulatory relevant phenotypic endpoints

https://doi.org/10.1016/j.jhazmat.2021.127141Get rights and content

Highlights

  • Feature selection method, MRMR, was employed to identify most relevant biomarkers for genotoxicity prediction.

  • Machine learning-based classification method enabled phenotypic toxicity prediction using selected optimal biomarkers.

  • Key biomarkers associated with DNA-damage and repair pathways could predict in-vivo carcinogenicity and Ames genotoxicity.

  • Molecular endpoints were validated and correlated with regulatory relevant phenotypic endpoints.

Abstract

One of the major challenges in realization and implementations of the Tox21 vision is the urgent need to establish quantitative link between in-vitro assay molecular endpoint and in-vivo regulatory-relevant phenotypic toxicity endpoint. Current toxicomics approach still mostly rely on large number of redundant markers without pre-selection or ranking, therefore, selection of relevant biomarkers with minimal redundancy would reduce the number of markers to be monitored and reduce the cost, time, and complexity of the toxicity screening and risk monitoring. Here, we demonstrated that, using time series toxicomics in-vitro assay along with machine learning-based feature selection (maximum relevance and minimum redundancy (MRMR)) and classification method (support vector machine (SVM)), an “optimal” number of biomarkers with minimum redundancy can be identified for prediction of phenotypic toxicity endpoints with good accuracy. We included two case studies for in-vivo carcinogenicity and Ames genotoxicity prediction, using 20 selected chemicals including model genotoxic chemicals and negative controls, respectively. The results suggested that, employing the adverse outcome pathway (AOP) concept, molecular endpoints based on a relatively small number of properly selected biomarker-ensemble involved in the conserved DNA-damage and repair pathways among eukaryotes, were able to predict both Ames genotoxicity endpoints and in-vivo carcinogenicity in rats. A prediction accuracy of 76% with AUC = 0.81 was achieved while predicting in-vivo carcinogenicity with the top-ranked five biomarkers. For Ames genotoxicity prediction, the top-ranked five biomarkers were able to achieve prediction accuracy of 70% with AUC = 0.75. However, the specific biomarkers identified as the top-ranked five biomarkers are different for the two different phenotypic genotoxicity assays. The top-ranked biomarkers for the in-vivo carcinogenicity prediction mainly focused on double strand break repair and DNA recombination, whereas the selected top-ranked biomarkers for Ames genotoxicity prediction are associated with base- and nucleotide-excision repair The method developed in this study will help to fill in the knowledge gap in phenotypic anchoring and predictive toxicology, and contribute to the progress in the implementation of tox 21 vision for environmental and health applications.

Introduction

Genotoxicity is of great concern because of its link to mutagenicity, carcinogenicity as well as cancer, and there is urgent demand for genotoxicity screening and risk assessment for various environmental and health applications (USEPA, 2016, Lan et al., 2016, Gou et al., 2014, Gou and Gu, 2011). In the absence of, or combined with, in-vivo carcinogenicity data, in-vitro or cell-based genotoxicity assays provide supporting data for cancer risk assessment (Ahn et al., 2009). Recently, toxicogenomics has emerged to be a promising technology that reveals molecular-level activities, at the gene, protein, or metabolite level of organisms, in response to environmental contaminants and may represent the underlying cellular network mechanisms of toxicity responses (Lan et al., 2016, Altenburger et al., 2012). This also responds to the Tox21 vision that promotes a systematic transit from current in-vivo whole animal-based testing, to more in-vitro mechanistic pathway-based assays using high-throughput screening and tiered testing (National Research Council, 2007, Andersen and Krewski, 2009). However, one of the major challenges in realization and implementations of the Tox21 vision is the urgent need to establish quantitative link between in-vitro toxicogenomic assay molecular endpoint and in-vivo phenotypic regulatory relevant endpoints.

Establishing quantitative causal relationships between in-vitro assay endpoints to regulatory-relevant apical endpoints holds the key to the realization of predictive toxicology through practical and widespread implementation of in-vitro assay-based toxicity screening schemes and strategies for environmental and health applications (Benigni, 2016, Thomas et al., 2009, Thomas et al., 2012, Ankley et al., 2006, Kramer et al., 2011, Muth-Köhne et al., 2016, Bradbury et al., 2004, Conolly et al., 2017). Adverse-outcome pathway (AOP) framework is the state-of-the-art approach to link mechanistic toxicity mechanisms with the phenotypic adverse outcome that would enable the assessment of health risk as well as ecotoxicological risks from exposure to pollutants and their mixtures (Blalock et al., 2018, Groh et al., 2015, Ankley et al., 2010, Ankley and Edwards, 2018, Carusi et al., 2018). Coalesce of effective biomarkers and proper predicting framework would enable more cost-effective and wider implementation of toxicomics in monitoring of genotoxicity and predict adverse toxic responses (Harrill et al., 2009, Stahl et al., 2015, Angrish et al., 2016). Subsequently, proper selection and validation of predicative biomarkers plays a crucial role in our ability to link molecular-level effects recorded in in-vitro assays to the in-vivo regulatory relevant phenotypic endpoints, or system-level impacts in many fields such as, environmental toxicity, disease prediction and health risk identification (Garcia-Reyero et al., 2009, Strimbu and Tavel, 2010, Christin et al., 2013, Ma and Huang, 2005).

The rapid advancement in bioinformatics and machine learning methods enables more sophisticated biomarkers identification (Thomas et al., 2009, Abeel et al., 2010, Wei et al., 2014). Current biomarker identification from toxicomics data employs feature selection and classification methods. Two general approaches of feature selection include filter and wrapper methods. The filter methods often provide relatively simpler and faster alternatives to select the most important features and the features are selected or filtered based on their relevance to differentiate a target outcome from others (Ding and Peng, 2005). The wrapper methods combine feature selection along with the classification method, where the features are judged based on their ability to increase the accuracy of the classification models (Xiong et al., 2001, Kohavi and John, 1997). However, the wrapper methods are often associated with extensive computational cost, and prone to possible overfitting when the sample size is relatively small (Radovic et al., 2017, Suto et al., 2016). In addition, since the selected features of the filter methods are independent of the classification method, they often have higher relevance to the target outcome than those derived from a wrapper method (Ding and Peng, 2005, Radovic et al., 2017). Filter based feature selection methods that have been applied to toxicomics data (e.g., gene and protein expression data) include mutual information, statistical tests (t-test, F-test, chi-square), information gain (Bolón-Canedo et al., 2014), gain ratio (Wei et al., 2014), and ReliefF (Bolón-Canedo et al., 2014) among others. Though most of these algorithms find the important biomarkers based on their relevance and correlation to the target outcome, they do not address redundancy and overfitting issues. The maximum relevance and minimum redundancy (MRMR) algorithm aims to reduce the redundancy in datasets, while also identifying the most relevant features and biomarkers to predict the outcome (Ding and Peng, 2005, Radovic et al., 2017). Furthermore, using the right classification algorithm for a specific problem is important in order to avoid overfitting by the model (Statnikov et al., 2005). The classification algorithms that have been used in the past to classify toxicogenomics data include k-nearest neighbor, naïve-Bayes, and support vector machines (SVMs). SVMs have been shown to yield reliable and efficient classification performances, while limiting overfitting, particularly for cases where the number of features is higher than the number of samples (as often seen with toxicogenomic data) (Abeel et al., 2010).

Although a few isolated biomarkers have been used for genotoxicity detection in both environmental and human health applications, such as CYP1A1, CYP1B1, and CYP-R (Ellinger-Ziegelbauer et al., 2009), RAD54 (Walmsley et al., 1997),A2m, Ca3, Cxcl1, and Cyp8b1 (Huang and Tung, 2017), their correlation with phenotypic genotoxicity endpoints or carcinogenicity has not been quantified. Furthermore, the temporal dependencies of toxicogenomics responses have also not been considered in most cases, since most studies record a snapshot of the responses (Ellinger-Ziegelbauer et al., 2009). It is still an open research area to identify relevant toxicogenomic-based biomarkers that quantitatively link in-vitro responses to regulatory relevant in-vivo toxicity endpoints, utilizing the temporal molecular response patterns.

In this study, we applied MRMR feature selection and SVM classification algorithm, to identify an ensemble of biomarkers from temporal toxicogenomic assays, for genotoxicity and carcinogenicity prediction and for bridging to regulatory relevant phenotypic endpoints via AOP. As per the AOP framework, molecular initiating event for DNA damage would link to an adverse outcome of genotoxicity at organism or population level that is relevant to risk assessment (Kramer et al., 2011, Ankley et al., 2010, Brockmeier et al., 2017). We proposed and developed a novel quantitative toxicogenomics assay to evaluate mechanistic genotoxicity through the detection and quantification of molecular level changes in proteins involved in known DNA damage repair pathways, to comply with the AOP concept (Lan et al., 2016, Milanowska et al., 2010, Hohmann and Mager, 2007). The selected key proteins involved in all known DNA damage and repair stress response pathways are conserved among yeast and other eukaryotes including human, therefore is expected to capture AOP molecular effects at sub-cytotoxic dose levels that lead to phenotypic changes and adverse outcome (Lan et al., 2016, Ankley et al., 2010, Simmons et al., 2009). The protein expression changes, in exposure to each chemical, are monitored by employing a genotoxicity assay using GFP-tagged yeast reporter stains, covering 38 selected protein biomarkers indicative of all the seven known DNA damage repair pathways (Lan et al., 2016, O'Connor et al., 2012). Two separate case studies—i) in-vivo rodent carcinogenicity and ii) Ames test-based genotoxicity prediction—are performed to identify the biomarker-ensembles for chemically-induced genotoxicity and carcinogenicity endpoint prediction. For each case study, six concentrations of 20 chemicals are selected that include model genotoxic compounds with reported endpoints and negative control without any reported genotoxic effects. Both in-vivo rodent carcinogenicity and Ames based genotoxicity are among the most widely used endpoints for genotoxicity assessment and they are being used in the National Toxicology Program (Bucher, 2002) and to prepare toxicity databases such as the Carcinogenic Potency Database (CPDB) (Gold et al., 2000). The performance of the prediction models is evaluated by estimating the area under the receiver operating characteristics curve (AUC), as well as the classification accuracy, sensitivity, and specificity. The number and identities of selected top-ranked biomarkers and their relationship with the prediction performances are assessed.

Section snippets

Materials

A time-series toxicogenomic assay of 20 chemicals is evaluated in the current study, including model genotoxic compounds and negative control without any reported genotoxic effect. Details of the chemicals are provided in Table S1. Two types of genotoxicity endpoints are investigated, including i) in-vivo rodent carcinogenicity and ii) Ames genotoxicity assay. Both carcinogenicity and genotoxicity endpoint data is collected from the existing literature and are summarized in Table S1 (Lan et

Identification of toxicogenomics-based biomarkers for in-vivo carcinogenicity prediction

The most relevant protein biomarkers are identified based on their rank measures and ability to differentiate the altered expression level between the carcinogenicity-positive and -negative compounds. Three separate scores according to the three ranking criteria, namely t-stat, MRMR-TCD and MRMR-TCQ, as described in the methods section, are used. Fig. 1 shows the most relevant biomarkers that have higher scores and assumingly higher relevancy to in-vivo carcinogenicity. The higher t-stat score

CRediT authorship contribution statement

Sheikh Mokhlesur Rahman: Conceptualization, Methodology, Data curation, Formal analysis, Validation, Visualization, Writing – original draft, Writing – review & editing. JiaQi Lan: Methodology, Data curation, Formal analysis, Validation, Writing – review & editing. David Kaeli: Data processing and analysis, Formal analysis; Jennifer Dy: Data analysis methods, Formal analysis, Visualization, Writing – review & editing. Akram Alshawabkeh: Funding acquisition, Project administration, Resources,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This study is funded by National Science Foundation (NSF, CBET-1437257 and IIS-1546428), National Institute of Environmental Health Sciences (NIEHS) grants P42ES017198 and P50ES026049, and U.S. Environmental Protection Agency (EPA) grant R83615501. The authors would like to thank Guangyu Li for his contribution during the revision process of this article.

References (86)

  • R. Kohavi et al.

    Wrappers for feature subset selection

    Artif. Intell.

    (1997)
  • C.E. Metz

    Basic principles of ROC analysis

    Semin. Nucl. Med.

    (1978)
  • D. Moraes et al.

    Low false positive learning with support vector machines

    J. Vis. Commun. Image Represent.

    (2016)
  • E. Muth-Köhne et al.

    Linking the response of endocrine regulated genes to adverse effects on sex differentiation improves comprehension of aromatase inhibition in a fish sexual development test

    Aquat. Toxicol.

    (2016)
  • P.A. Neale et al.

    In vitro bioassays to assess drinking water quality

    Curr. Opin. Environ. Sci. Health

    (2019)
  • S.D. Richardson et al.

    Occurrence, genotoxicity, and carcinogenicity of regulated and emerging disinfection by-products in drinking water: a review and roadmap for research.

    Mutat. Res. Rev. Mutat. Res.

    (2007)
  • M. Srivastava et al.

    DNA double-strand break repair inhibitors as cancer therapeutics

    Chem. Biol.

    (2015)
  • S.H. Stahl et al.

    Systems toxicology: modelling biomarkers of glutathione homeostasis and paracetamol metabolism

    Drug Discov. Today: Technol.

    (2015)
  • D. Stalter et al.

    Fingerprinting the reactive toxicity pathways of 50 drinking water disinfection by-products

    Water Res.

    (2016)
  • A. Tubbs et al.

    Endogenous DNA damage as a source of genomic instability in cancer

    Cell

    (2017)
  • E.A. Zanaty

    Support vector machines (SVMs) versus multilayer perception (MLP) in data classification.

    Egypt. Inform. J.

    (2012)
  • H. Zeinvand-Lorestani et al.

    Comparative study of in vitro prooxidative properties and genotoxicity induced by aflatoxin B1 and its laccase-mediated detoxification products

    Chemosphere

    (2015)
  • T. Abeel et al.

    Robust biomarker identification for cancer diagnosis with ensemble feature selection methods

    Bioinformatics

    (2010)
  • R. Altenburger et al.

    Mixture toxicity revisited from a toxicogenomic perspective

    Environ. Sci. Technol.

    (2012)
  • R. Altenburger et al.

    Future water quality monitoring: improving the balance between exposure and toxicity assessments of real-world pollutant mixtures

    Environ. Sci. Eur.

    (2019)
  • D.G. Altman et al.

    Diagnostic tests. 1: sensitivity and specificity

    BMJ

    (1994)
  • B.N. Ames et al.

    Carcinogens are mutagens: a simple test system combining liver homogenates for activation and bacteria for detection

    Proc. Natl. Acad. Sci. USA

    (1973)
  • M.E. Andersen et al.

    Toxicity testing in the 21st century: bringing the vision to life

    Toxicol. Sci.

    (2009)
  • M.M. Angrish et al.

    Taxonomic applicability of inflammatory cytokines in adverse outcome pathway (AOP) development

    J. Toxicol. Environ. Health A

    (2016)
  • G. Ankley et al.

    Pathway-based approaches for environmental monitoring and risk assessment

    Environ. Sci. Technol.

    (2016)
  • G.T. Ankley et al.

    , Toxicogenomics in regulatory ecotoxicology

    Environ. Sci. Technol.

    (2006)
  • G.T. Ankley et al.

    Adverse outcome pathways: a conceptual framework to support ecotoxicology research and risk assessment

    Environ. Toxicol. Chem.

    (2010)
  • M. Ashburner et al.

    Gene ontology: tool for the unification of biology

    Nat. Genet.

    (2000)
  • R. Benigni

    Predictive toxicology today: the transition from biological knowledge to practicable models

    Expert Opin. Drug Metab. Toxicol.

    (2016)
  • B.J. Blalock et al.

    Transcriptomic and network analyses reveal mechanistic-based biomarkers of endocrine disruption in the marine mussel, Mytilus edulis

    Environ. Sci. Technol.

    (2018)
  • S.P. Bradbury et al.

    Meeting the scientific needs of ecological risk assessment in a regulatory context

    Environ. Sci. Technol.

    (2004)
  • E.K. Brockmeier et al.

    The role of omics in the application of adverse outcome pathways for chemical risk assessment

    Toxicol. Sci.

    (2017)
  • J.R. Bucher

    The National Toxicology Program rodent bioassay: designs, interpretations, and scientific contributions

    Ann. N. Y. Acad. Sci.

    (2002)
  • R. Burbidge et al.

    An introduction to support vector machines for data mining

    Keynote Papers, Young OR12

    (2001)
  • R.B. Conolly et al.

    Quantitative adverse outcome pathways and their application to predictive toxicology

    Environ. Sci. Technol.

    (2017)
  • Davenport, M.A.; Baraniuk, R.G.; Scott, C.D.(2006). Controlling false alarms with support vector machines. In:...
  • C. Ding et al.

    Minimum redundancy feature selection from microarray gene expression data

    J. Bioinform. Comput. Biol.

    (2005)
  • P.A. Flach et al.

    A coherent interpretation of AUC as a measure of aggregated classification performance

    ICML

    (2011)
  • Cited by (0)

    View full text