Robust biomarker discovery for microbiome-wide association studies

doi:10.1016/j.ymeth.2019.06.012

Methods

Volume 173, 15 February 2020, Pages 44-51

https://doi.org/10.1016/j.ymeth.2019.06.012 Get rights and content

Highlights

•
Ensemble feature selection for microbiome studies with stability and robustness.
•
Deep Forest for understanding the disease-microorganisms relationships.
•
Method for the discovery of disease-related microbial markers.

Abstract

According to the advances of high-throughput sequencing technology, massive microbiome data accumulated from environmental investigations to human studies. The microbiome-wide association studies are to study the relationship between the microbiome and human health or environment. Recently, Deep Neural Networks (DNNs) are encouraging due to their layer-wise learning ability for representation learning. However, DNNs are considered as black boxes and they require a large amount of training data which makes them impractical to conduct microbiome-wide association studies directly. Meanwhile, the microbiome data is high dimension with many features and noise. A single feature selection method for dealing with the kind of dataset is often unstable. In this work, we introduced a deep learning model named Deep Forest to conduct the microbiome-wide association studies and an ensemble feature selection method is proposed to guide microbial biomarkers’ identification. The experiments showed that our ensemble feature method based on Deep Forest had good stability and robustness. The results of feature selection could guide the discovery of microbial biomarkers and help to diagnose microbial-related diseases. The code is available at https://github.com/MicroAVA/MWAS-Biomarkers.git.

Introduction

Our human body is inhabited by a large number of microorganisms (such as fungi, viruses and archaea) that are distributed in various parts of the human body such as the mouth, skin and gut [1]. Many studies have shown the microbial imbalance (microbial composition or metabolic changes) in human can lead to diseases, for example, diseases caused by disorders of human intestinal microbes including the obesity [2], inflammatory bowel disease [3] and type 2 diabetes [4]. Even our mental illnesses (such as depression, Alzheimer's disease, etc.) may be closely related to intestinal microbial disorders. The pathway which the intestinal microbes impact on the human brain is known as the gut brain axis [5]. Analogous to genome-wide association studies (GWAS), the associations of microbiome and phenotypes are called the microbiome-wide association studies (MWAS), which attempt to find disease-associated microbial biomarkers. It will benefit the early diagnosis of diseases, the development of microbial treatment and the understanding of the diseases’ mechanism [6], [7].

With the development of high-throughput sequencing technology, microbial DNA in samples can be sequenced simultaneously. After the pipeline analysis, we can figure out most of the microbial species in the environment and their functions in the community [8]. However, the analysis of microbiome and the identification of microbial biomarkers for disease prediction are challenging. Firstly, the metagenomic sample is presented by its microbial taxonomic composition (the relative abundance of microbial taxa) which is sparse with high noise. Secondly, the sample size is small and the dimension is high. In addition, there is a dependency between the features due to the interactions between microorganisms [9]. These problems restrict the accuracy and reliability of metagenomic sample classification and feature selection. Recent studies have shown the performance of deep learning model was encouraging due to its representation learning ability [10], [11]. The purpose of representation learning is to learn representation of the data that will make it more easier to extract useful information when building classifiers or other predictors [12]. However, Deep Neural Networks (DNNs) may not be suitable in the microbiome studies because DNNs are required excessive amount of training data, which are often impractical. In addition, DNNs are often considered as black boxes which make them difficult to conduct feature selection.

By removing features that may be either redundant or irrelevant to the problem, feature selection is becoming indispensable in knowledge discovery tasks. It usually requires the identification of most informative subset features. Based on high noise of biological data itself (usually from sample processing, sequencing, etc.), small sample size and high dimensional characteristics, a good feature selection algorithm should achieve a robust result. There are many research efforts to obtain a stable feature selection algorithm [13]. Among them, the ensemble feature selection has been considered as a promising way that does not require complex feature transformations or prior knowledge on the underlying domain. The ensemble feature selection method is considering multiple feature selection methods instead of one, similar to ensemble learning which combines multiple classifiers in order to reach a good classification. There are two main steps in the ensemble feature selection. Firstly, each feature selection algorithm gets a subset features. Secondly, combine all these features to form a final feature list, which is usually named rank aggregation [14]. The ensemble learning is believed to benefit from the following three aspects. From the perspective of statistical convenience, a single learner may lead to over-fitting and poor generalization. Combining multiple classifiers will reduce this risk. From a computational point of view, learning algorithms tend to fall into local optimum. The risk of falling into local optimum can be alleviated by combining them after multiple iterations. Finally, from the perspective of representation, multiple learners consider hypothetical space larger than single one [15]. Therefore, it is reasonable to believe that the ensemble feature selection could get promising results.

In this paper, we proposed an ensemble feature selection method based on Deep Forest to conduct microbiome-wide association studies. Deep Forest can conduct not only the microbial sample classification but also feature selection task. Compared to the traditional machine learning methods, Deep Forest could reach a deeper level through a layer-wise learning in the classification task. Therefore, it could be used as a deep learning model to do representation learning. More importantly, Deep Forest does not require a lot of training data but it shows better performance with less hyper-parameter than DNNs. In the feature selection task, Deep Forest is considered as an ensemble of random forest, which enables it to conduct ensemble feature selection. Our experiment show Deep Forest has good stability and robustness. In other words, Deep Forest can benefit from both deep representation learning and ensemble learning.

Section snippets

Related work

The microbiome-wide association studies are not only required to conduct metagenomic sample classification but also feature selection tasks. Therefore, in this section we introduce the related work from two aspects. There are many classifiers in machine learning, we mainly introduce the classification of metagenomic samples and present the feature selection methods especially the related research work of ensemble feature selection.

Our approach

Feature selection is an important step in microbiome-wide association studies. As of now, there is no relevant ensemble feature selection method focusing on the microbiome-wide association studies, so we introduce a deep learning model (unlike Deep Neural Networks), then we will propose our ensemble feature selection method.

Datasets

We use three different data types from MetAML package [19]: Cirrhosis, Type 2 Diabetes (T2D) and Obesity (Table 1). The cirrhosis dataset is composed of 114 patients and 118 healthy subjects. The T2D dataset is a total of 170 patients and 174 healthy subjects. The obesity dataset comes from a study of 292 individuals among which 89 individuals with the body mass index (BMI) lower than 25 kg/m² and 164 individuals with the BMI greater than 30 kg/m². Each dataset is produced by whole metagenome

Discussion

The biomarker discovery are important to the microbiome-wide association studies. However, dealing with multi-omics data is challenging because of its biological complexity and high level of technical and biological noise. As a result, a good feature selection algorithm should be robust. Our feature selection method based on Deep Forest is an ensemble approach, which can benefit from ensemble learning and deep learning. However, there are many aggregation methods for ensemble feature selection.

Conclusion

In this work, we proposed a method of ensemble feature selection based on Deep Forest to conduct microbiome-wide association studies. Our method achieved good results in three data sets. First, Deep Forest got better classification performance than traditional machine learning methods such as SVMs and kNNs. Compared to CNNs, it had also achieved better classification results. More importantly, Deep Forest hold a cascade structure which enabled it to do deep representation learning with less

Acknowledgement

This research is supported by the National Key Research and Development Program of China (2017YFC0909502) and the National Natural Science Foundation of China (No. 61532008 and 61872157).

References (41)

A.D. Kostic et al.
The microbiome in inflammatory bowel disease: current status and the future ahead
Gastroenterology
(2014)
D. Knights et al.
Human-associated microbial signatures: examining their predictive value
Cell Host Microbe
(2011)
J. Li
Feature selection: a data perspective
ACM Comput. Surv. CSUR
(2018)
L. Wang et al.
Feature selection methods for big data bioinformatics: a survey from the search perspective
Methods
(2016)
B. Pes et al.
Exploiting the ensemble paradigm for stable feature selection: a case study on high-dimensional genomic data
Inf. Fusion
(2017)
B. Seijo-Pardo et al.
Ensemble feature selection: homogeneous and heterogeneous approaches
Knowl.-Based Syst.
(2017)
Z. He et al.
Stable feature selection for biomarker discovery
Comput. Biol. Chem.
(2010)
P.J. Turnbaugh et al.
The human microbiome project
Nature
(2007)
G.K. John et al.
The gut microbiome and obesity
Curr. Oncol. Rep.
(2016)
J. Qin
A metagenome-wide association study of gut microbiota in type 2 diabetes
Nature
(2012)

M. Carabotti et al.

The gut-brain axis: interactions between enteric microbiota, central and enteric nervous systems

Ann. Gastroenterol. Q. Publ. Hell. Soc. Gastroenterol.

(2015)

J.A. Gilbert

Microbiome-wide association studies link dynamic microbial consortia to disease

Nature

(2016)

J. Wang et al.

Metagenome-wide association studies: fine-mining the microbiome

Nat. Rev. Microbiol.

(2016)

A.M. Comeau et al.

Microbiome helper: a custom and streamlined workflow for microbiome research

mSystems

(2017)

H. Li

Microbiome, metagenomics, and high-dimensional compositional data analysis

Annu. Rev. Stat. Its Appl.

(2015)

Ching Travers

Opportunities and obstacles for deep learning in biology and medicine

J. R. Soc. Interface

(Apr. 2018)

M. Wainberg et al.

Deep learning in biomedicine

Nat. Biotechnol.

(2018)

Y. Bengio et al.

Representation learning: a review and new perspectives

IEEE Trans. Pattern Anal. Mach. Intell.

(2013)

A.B. Brahim et al.

Ensemble feature selection for high dimensional data: a new method and a comparative study

Adv. Data Anal. Classif.

(2018)

P. Yang et al.

A review of ensemble methods in bioinformatics

Curr. Bioinforma.

(2010)

Cited by (8)

Commentary: Lung cancer and dysbiosis: Debugging the studies for the future
2021, Journal of Thoracic and Cardiovascular Surgery
Multiscale network-based approaches in bioinformatics and biomedicine
2020, Methods
Microbial Biomarkers Identification for Human Gut Disease Prediction using Microbial Interaction Network Embedded Deep Learning
2023, International Journal of Advanced Computer Science and Applications
Diagnosis of Inflammatory Bowel Disease and Colorectal Cancer through Multi-View Stacked Generalization Applied on Gut Microbiome Data
2022, Diagnostics
Analyzing Large Microbiome Datasets Using Machine Learning and Big Data
2021, BioMedInformatics
Application of Deep Learning in Plant–Microbiota Association Analysis
2021, Frontiers in Genetics

View all citing articles on Scopus

View full text

Robust biomarker discovery for microbiome-wide association studies

Highlights

Abstract

Introduction

Section snippets

Related work

Our approach

Datasets

Discussion

Conclusion

Acknowledgement

Gastroenterology

Cell Host Microbe

ACM Comput. Surv. CSUR

Methods

Inf. Fusion

Knowl.-Based Syst.

Comput. Biol. Chem.

The human microbiome project

Nature

The gut microbiome and obesity

Curr. Oncol. Rep.

A metagenome-wide association study of gut microbiota in type 2 diabetes

Nature

The gut-brain axis: interactions between enteric microbiota, central and enteric nervous systems

Ann. Gastroenterol. Q. Publ. Hell. Soc. Gastroenterol.

Microbiome-wide association studies link dynamic microbial consortia to disease

Nature

Metagenome-wide association studies: fine-mining the microbiome

Nat. Rev. Microbiol.

Microbiome helper: a custom and streamlined workflow for microbiome research

mSystems

Microbiome, metagenomics, and high-dimensional compositional data analysis

Annu. Rev. Stat. Its Appl.

Opportunities and obstacles for deep learning in biology and medicine

J. R. Soc. Interface

Deep learning in biomedicine

Nat. Biotechnol.

Representation learning: a review and new perspectives

IEEE Trans. Pattern Anal. Mach. Intell.

Ensemble feature selection for high dimensional data: a new method and a comparative study

Adv. Data Anal. Classif.

A review of ensemble methods in bioinformatics

Curr. Bioinforma.