Trends in Genetics
ReviewOpening the Black Box: Interpretable Machine Learning for Geneticists
Section snippets
Importance of Interpretable Machine Learning (ML) and Overview of Strategies
Biological big data [1,2] has driven progresses in fields ranging from population genetics [3] to precision medicine [4]. Much of this progress is possible because of advances in ML (Box 1) [5., 6., 7., 8., 9., 10.], ‘[a] field of study that gives computers the ability to learn without being explicitly programmed’ [11]. ML works by identifying patterns in data in the form of a model (see Glossary) that can be used to make predictions about new data. While powerful, a common criticism is that
Probing Strategies Dissect the Inner Structure of ML Models
Training an ML model involves identifying the set of parameters best able to predict the label of an instance (e.g., gene Y is upregulated). After training, these parameters can be ‘probed’ (i.e., inspected) to better understand what the model learned (Figure 1B). Probing strategies provide global interpretations with some exceptions (see later). Because the type of parameters (e.g., coefficient weight, decision nodes) and how those parameters relate to each other (e.g., linear combination,
Perturbing Strategies for Interpreting ML Models
Perturbing strategies involve modifying the input data and observing some change in the model output (Figure 1C). Because modifications to the input data can be made regardless of the ML algorithm used, perturbing strategies are generally model agnostic. We discuss two general perturbation-based strategies: sensitivity analysis and what-if methods (Figure 3).
Surrogate Strategies for Interpreting ML Models
Image you have an ML model that is truly a black box, meaning that it cannot be probed, and perturbation strategies do not provide useful information. In such a case, one can train a more interpretable model to approximate the black box model. Examples of interpretable models include linear models where coefficients reflect feature importance, or decision trees where mean decrease node impurity can be calculated. These inherently interpretable models are referred to as surrogate models. For
Concluding Remarks
Interpretability is critical for applications of ML in genetics and beyond and will therefore see substantial advances in the coming years. Just like there is no one universally best ML algorithm, there will not likely be one ML interpretation strategy that works best on all data or for all questions. Rather, the interpretation strategy should be tailored to what one wants to learn from the ML model and confidence in the interpretation will come when multiple approaches tell the same story.
Acknowledgments
We thank Yuying Xie for his insightful feedback on this review. This work was partly supported by the National Science Foundation (NSF) Graduate Research Fellowship (Fellow ID: 2015196719) to C.B.A.; NSF (IIS-1907704, IIS-1845081, CNS-1815636) grants to J.T.; and the US Department of Energy Great Lakes Bioenergy Research Center (BER DE-SC0018409) and NSF (IOS-1546617, DEB-1655386) grants to S-H.S.
Glossary
- Algorithm
- the procedure taken to solve a problem/build a model.
- Decision tree
- a model made up of a hierarchical series of true/false questions.
- Deep learning
- a subset of ML algorithms inspired by the structure of the brain that can find complex, nonlinear patterns in data.
- Feature
- an explanatory (i.e., independent) variable used to build a model.
- Global interpretation
- a type of ML interpretation that explains the overall relationship between the features and the label for all instances.
- Instance
- a single
References (65)
- et al.
Supervised machine learning for population genetics: a new paradigm
Trends Genet.
(2018) Machine learning for big data analytics in plants
Trends Plant Sci.
(2014)Explanation in artificial intelligence: insights from the social sciences
Artif. Intell.
(2019)Methods for interpreting and understanding deep neural networks
Digit. Signal Process.
(2018)- et al.
Rule extraction from support vector machines: a review
Neurocomputing
(2010) Deep learning for visual understanding: a review
Neurocomputing
(2016)- et al.
Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks
Ecol. Model.
(2002) Biology: the big challenges of big data
Nature
(2013)Big data: astronomical or genomical?
PLoS Biol.
(2015)From big data analysis to personalized medicine for all: challenges and opportunities
BMC Med. Genet.
(2015)
Deep learning for computational biology
Mol. Syst. Biol.
Ten quick tips for machine learning in computational biology
BioData Min.
Machine learning methods for analysis of metabolic data and metabolic pathway modeling
Metabolites
Machine learning applications in genetics and genomics
Nat. Rev. Genet.
Machine learning and its applications to biology
PLoS Comput. Biol.
Some studies in machine learning using the game of checkers
IBM J. Res. Dev.
The mythos of model interpretability
Comm. ACM
A survey of methods for explaining black box models
ACM Comput. Surv.
Interpretable Machine Learning: A Guide for Making Black Box Models Explainable
Elements of Causal Inference
Learning natural selection from the site frequency spectrum
Genetics
Support vector machines and kernels for computational biology
PLoS Comput. Biol.
The spectrum kernel: a string kernel for SVM protein classification
Accurate splice site detection for Caenorhabditis elegans
Enhanced regulatory sequence prediction using gapped k-mer features
PLoS Comput. Biol.
POIMs: positional oligomer importance matrices—understanding support vector machine-based signal detectors
Bioinformatics
Random forests
Mach. Learn.
The boosting approach to machine learning: an overview
Greedy function approximation: a gradient boosting machine
Ann. Stat.
Integrative random forest for gene regulatory network inference
Bioinformatics
Cis-regulatory code for predicting plant cell-type transcriptional response to high salinity
Plant Physiol.
Iterative random forests to discover predictive and stable high-order interactions
Proc. Natl. Acad. Sci. U. S. A.
Cited by (188)
A broad learning model guided by global and local receptive causal features for online incremental machinery fault diagnosis
2024, Expert Systems with ApplicationsWIDINet: A diagnostic model for staging pneumoconiosis based on data expansion and KL entropy judgement
2024, Biomedical Signal Processing and ControlOptical remote sensing of crop biophysical and biochemical parameters: An overview of advances in sensor technologies and machine learning algorithms for precision agriculture
2024, Computers and Electronics in AgricultureIntegrating physical model-based features and spatial contextual information to estimate building height in complex urban areas
2024, International Journal of Applied Earth Observation and GeoinformationExplainable AI models for predicting drop coalescence in microfluidics device
2024, Chemical Engineering Journal