A novel approach to modeling multifactorial diseases using Ensemble Bayesian Rule classifiers

https://doi.org/10.1016/j.jbi.2020.103455Get rights and content
Under an Elsevier user license
open archive

Highlights

  • Modeling high-dimensional, multifactorial disease datasets is challenging.

  • Ensemble Bayesian Rule Learning classifier efficiently models such tasks.

  • Bagging has better predictive performance than Bayesian model averaging.

  • Bayesian model combination is better than bagging when contains poor base classifiers.

  • Proposed a novel visualization method to interpret an ensemble of BRL models.

Abstract

Modeling factors influencing disease phenotypes, from biomarker profiling study datasets, is a critical task in biomedicine. Such datasets are typically generated from high-throughput ’omic’ technologies, which help examine disease mechanisms at an unprecedented resolution. These datasets are challenging because they are high-dimensional. The disease mechanisms they study are also complex because many diseases are multifactorial, resulting from the collective activity of several factors, each with a small effect. Bayesian rule learning (BRL) is a rule model inferred from learning Bayesian networks from data, and has been shown to be effective in modeling high-dimensional datasets. However, BRL is not efficient at modeling multifactorial diseases since it suffers from data fragmentation during learning. In this paper, we overcome this limitation by implementing and evaluating three types of ensemble model combination strategies with BRL— uniform combination (UC; same as Bagging), Bayesian model averaging (BMA), and Bayesian model combination (BMC)— collectively called Ensemble Bayesian Rule Learning (EBRL). We also introduce a novel method to visualize EBRL models, called the Bayesian Rule Ensemble Visualizing tool (BREVity), which helps extract interpret the most important rule patterns guiding the predictions made by the ensemble model. Our results using twenty-five public, high-dimensional, gene expression datasets of multifactorial diseases, suggest that, both EBRL models using UC and BMC achieve better predictive performance than BMA and other classic machine learning methods. Furthermore, BMC is found to be more reliable than UC, when the ensemble includes sub-optimal models resulting from the stochasticity of the model search process. Together, EBRL and BREVity provides researchers a promising and novel tool for modeling multifactorial diseases from high-dimensional datasets that leverages strengths of ensemble methods for predictive performance, while also providing interpretable explanations for its predictions.

Keywords

Bayesian methods
Rule learning
Ensemble methods
Interpretability

Cited by (0)

1

Present address: National Cancer Institute, 9609 Medical Center Dr, SG 7E310, Rockville, MD 20850, United States.