Trends in Genetics
Volume 36, Issue 6, June 2020, Pages 442-455
Journal home page for Trends in Genetics

Review
Opening the Black Box: Interpretable Machine Learning for Geneticists

https://doi.org/10.1016/j.tig.2020.03.005Get rights and content

Highlights

  • Machine learning (ML) has emerged as a powerful tool for harnessing big biological data. The complex structure underlying ML models can potentially provide insights into the problems they are used to solve.

  • Because of model complexity, their inner logic is not readily intelligible to a human, hence the common critique of ML models as black boxes.

  • However, advances in the field of interpretable ML have made it possible to identify important patterns and features underlying an ML model using various strategies.

  • These interpretation strategies have been applied in genetics and genomics to derive novel biological insights from ML models.

  • This area of research is becoming increasingly important as more complex and difficult-to-interpret ML approaches (i.e., deep learning) are being adopted by biologists.

Because of its ability to find complex patterns in high dimensional and heterogeneous data, machine learning (ML) has emerged as a critical tool for making sense of the growing amount of genetic and genomic data available. While the complexity of ML models is what makes them powerful, it also makes them difficult to interpret. Fortunately, efforts to develop approaches that make the inner workings of ML models understandable to humans have improved our ability to make novel biological insights. Here, we discuss the importance of interpretable ML, different strategies for interpreting ML models, and examples of how these strategies have been applied. Finally, we identify challenges and promising future directions for interpretable ML in genetics and genomics.

Section snippets

Importance of Interpretable Machine Learning (ML) and Overview of Strategies

Biological big data [1,2] has driven progresses in fields ranging from population genetics [3] to precision medicine [4]. Much of this progress is possible because of advances in ML (Box 1) [5., 6., 7., 8., 9., 10.], ‘[a] field of study that gives computers the ability to learn without being explicitly programmed’ [11]. ML works by identifying patterns in data in the form of a model (see Glossary) that can be used to make predictions about new data. While powerful, a common criticism is that

Probing Strategies Dissect the Inner Structure of ML Models

Training an ML model involves identifying the set of parameters best able to predict the label of an instance (e.g., gene Y is upregulated). After training, these parameters can be ‘probed’ (i.e., inspected) to better understand what the model learned (Figure 1B). Probing strategies provide global interpretations with some exceptions (see later). Because the type of parameters (e.g., coefficient weight, decision nodes) and how those parameters relate to each other (e.g., linear combination,

Perturbing Strategies for Interpreting ML Models

Perturbing strategies involve modifying the input data and observing some change in the model output (Figure 1C). Because modifications to the input data can be made regardless of the ML algorithm used, perturbing strategies are generally model agnostic. We discuss two general perturbation-based strategies: sensitivity analysis and what-if methods (Figure 3).

Surrogate Strategies for Interpreting ML Models

Image you have an ML model that is truly a black box, meaning that it cannot be probed, and perturbation strategies do not provide useful information. In such a case, one can train a more interpretable model to approximate the black box model. Examples of interpretable models include linear models where coefficients reflect feature importance, or decision trees where mean decrease node impurity can be calculated. These inherently interpretable models are referred to as surrogate models. For

Concluding Remarks

Interpretability is critical for applications of ML in genetics and beyond and will therefore see substantial advances in the coming years. Just like there is no one universally best ML algorithm, there will not likely be one ML interpretation strategy that works best on all data or for all questions. Rather, the interpretation strategy should be tailored to what one wants to learn from the ML model and confidence in the interpretation will come when multiple approaches tell the same story.

Acknowledgments

We thank Yuying Xie for his insightful feedback on this review. This work was partly supported by the National Science Foundation (NSF) Graduate Research Fellowship (Fellow ID: 2015196719) to C.B.A.; NSF (IIS-1907704, IIS-1845081, CNS-1815636) grants to J.T.; and the US Department of Energy Great Lakes Bioenergy Research Center (BER DE-SC0018409) and NSF (IOS-1546617, DEB-1655386) grants to S-H.S.

Glossary

Algorithm
the procedure taken to solve a problem/build a model.
Decision tree
a model made up of a hierarchical series of true/false questions.
Deep learning
a subset of ML algorithms inspired by the structure of the brain that can find complex, nonlinear patterns in data.
Feature
an explanatory (i.e., independent) variable used to build a model.
Global interpretation
a type of ML interpretation that explains the overall relationship between the features and the label for all instances.
Instance
a single

References (65)

  • C. Angermueller

    Deep learning for computational biology

    Mol. Syst. Biol.

    (2016)
  • D. Chicco

    Ten quick tips for machine learning in computational biology

    BioData Min.

    (2017)
  • M. Cuperlovic-Culf

    Machine learning methods for analysis of metabolic data and metabolic pathway modeling

    Metabolites

    (2018)
  • M.W. Libbrecht et al.

    Machine learning applications in genetics and genomics

    Nat. Rev. Genet.

    (2015)
  • A.L. Tarca

    Machine learning and its applications to biology

    PLoS Comput. Biol.

    (2007)
  • A.L. Samuel

    Some studies in machine learning using the game of checkers

    IBM J. Res. Dev.

    (1959)
  • Z.C. Lipton

    The mythos of model interpretability

    Comm. ACM

    (2018)
  • R. Guidotti

    A survey of methods for explaining black box models

    ACM Comput. Surv.

    (2018)
  • C. Molnar

    Interpretable Machine Learning: A Guide for Making Black Box Models Explainable

    (2019)
  • J. Peters

    Elements of Causal Inference

    (2017)
  • R. Ronen

    Learning natural selection from the site frequency spectrum

    Genetics

    (2013)
  • A. Ben-Hur

    Support vector machines and kernels for computational biology

    PLoS Comput. Biol.

    (2008)
  • C. Leslie

    The spectrum kernel: a string kernel for SVM protein classification

  • B. Schölkopf

    Accurate splice site detection for Caenorhabditis elegans

  • M. Ghandi

    Enhanced regulatory sequence prediction using gapped k-mer features

    PLoS Comput. Biol.

    (2014)
  • S. Sonnenburg

    POIMs: positional oligomer importance matrices—understanding support vector machine-based signal detectors

    Bioinformatics

    (2008)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • R.E. Schapire

    The boosting approach to machine learning: an overview

  • J.H. Friedman

    Greedy function approximation: a gradient boosting machine

    Ann. Stat.

    (2001)
  • F. Petralia

    Integrative random forest for gene regulatory network inference

    Bioinformatics

    (2015)
  • S. Uygun

    Cis-regulatory code for predicting plant cell-type transcriptional response to high salinity

    Plant Physiol.

    (2019)
  • S. Basu

    Iterative random forests to discover predictive and stable high-order interactions

    Proc. Natl. Acad. Sci. U. S. A.

    (2018)
  • Cited by (188)

    View all citing articles on Scopus
    View full text