Robust biomarker discovery for microbiome-wide association studies
Introduction
Our human body is inhabited by a large number of microorganisms (such as fungi, viruses and archaea) that are distributed in various parts of the human body such as the mouth, skin and gut [1]. Many studies have shown the microbial imbalance (microbial composition or metabolic changes) in human can lead to diseases, for example, diseases caused by disorders of human intestinal microbes including the obesity [2], inflammatory bowel disease [3] and type 2 diabetes [4]. Even our mental illnesses (such as depression, Alzheimer's disease, etc.) may be closely related to intestinal microbial disorders. The pathway which the intestinal microbes impact on the human brain is known as the gut brain axis [5]. Analogous to genome-wide association studies (GWAS), the associations of microbiome and phenotypes are called the microbiome-wide association studies (MWAS), which attempt to find disease-associated microbial biomarkers. It will benefit the early diagnosis of diseases, the development of microbial treatment and the understanding of the diseases’ mechanism [6], [7].
With the development of high-throughput sequencing technology, microbial DNA in samples can be sequenced simultaneously. After the pipeline analysis, we can figure out most of the microbial species in the environment and their functions in the community [8]. However, the analysis of microbiome and the identification of microbial biomarkers for disease prediction are challenging. Firstly, the metagenomic sample is presented by its microbial taxonomic composition (the relative abundance of microbial taxa) which is sparse with high noise. Secondly, the sample size is small and the dimension is high. In addition, there is a dependency between the features due to the interactions between microorganisms [9]. These problems restrict the accuracy and reliability of metagenomic sample classification and feature selection. Recent studies have shown the performance of deep learning model was encouraging due to its representation learning ability [10], [11]. The purpose of representation learning is to learn representation of the data that will make it more easier to extract useful information when building classifiers or other predictors [12]. However, Deep Neural Networks (DNNs) may not be suitable in the microbiome studies because DNNs are required excessive amount of training data, which are often impractical. In addition, DNNs are often considered as black boxes which make them difficult to conduct feature selection.
By removing features that may be either redundant or irrelevant to the problem, feature selection is becoming indispensable in knowledge discovery tasks. It usually requires the identification of most informative subset features. Based on high noise of biological data itself (usually from sample processing, sequencing, etc.), small sample size and high dimensional characteristics, a good feature selection algorithm should achieve a robust result. There are many research efforts to obtain a stable feature selection algorithm [13]. Among them, the ensemble feature selection has been considered as a promising way that does not require complex feature transformations or prior knowledge on the underlying domain. The ensemble feature selection method is considering multiple feature selection methods instead of one, similar to ensemble learning which combines multiple classifiers in order to reach a good classification. There are two main steps in the ensemble feature selection. Firstly, each feature selection algorithm gets a subset features. Secondly, combine all these features to form a final feature list, which is usually named rank aggregation [14]. The ensemble learning is believed to benefit from the following three aspects. From the perspective of statistical convenience, a single learner may lead to over-fitting and poor generalization. Combining multiple classifiers will reduce this risk. From a computational point of view, learning algorithms tend to fall into local optimum. The risk of falling into local optimum can be alleviated by combining them after multiple iterations. Finally, from the perspective of representation, multiple learners consider hypothetical space larger than single one [15]. Therefore, it is reasonable to believe that the ensemble feature selection could get promising results.
In this paper, we proposed an ensemble feature selection method based on Deep Forest to conduct microbiome-wide association studies. Deep Forest can conduct not only the microbial sample classification but also feature selection task. Compared to the traditional machine learning methods, Deep Forest could reach a deeper level through a layer-wise learning in the classification task. Therefore, it could be used as a deep learning model to do representation learning. More importantly, Deep Forest does not require a lot of training data but it shows better performance with less hyper-parameter than DNNs. In the feature selection task, Deep Forest is considered as an ensemble of random forest, which enables it to conduct ensemble feature selection. Our experiment show Deep Forest has good stability and robustness. In other words, Deep Forest can benefit from both deep representation learning and ensemble learning.
Section snippets
Related work
The microbiome-wide association studies are not only required to conduct metagenomic sample classification but also feature selection tasks. Therefore, in this section we introduce the related work from two aspects. There are many classifiers in machine learning, we mainly introduce the classification of metagenomic samples and present the feature selection methods especially the related research work of ensemble feature selection.
Our approach
Feature selection is an important step in microbiome-wide association studies. As of now, there is no relevant ensemble feature selection method focusing on the microbiome-wide association studies, so we introduce a deep learning model (unlike Deep Neural Networks), then we will propose our ensemble feature selection method.
Datasets
We use three different data types from MetAML package [19]: Cirrhosis, Type 2 Diabetes (T2D) and Obesity (Table 1). The cirrhosis dataset is composed of 114 patients and 118 healthy subjects. The T2D dataset is a total of 170 patients and 174 healthy subjects. The obesity dataset comes from a study of 292 individuals among which 89 individuals with the body mass index (BMI) lower than 25 kg/m2 and 164 individuals with the BMI greater than 30 kg/m2. Each dataset is produced by whole metagenome
Discussion
The biomarker discovery are important to the microbiome-wide association studies. However, dealing with multi-omics data is challenging because of its biological complexity and high level of technical and biological noise. As a result, a good feature selection algorithm should be robust. Our feature selection method based on Deep Forest is an ensemble approach, which can benefit from ensemble learning and deep learning. However, there are many aggregation methods for ensemble feature selection.
Conclusion
In this work, we proposed a method of ensemble feature selection based on Deep Forest to conduct microbiome-wide association studies. Our method achieved good results in three data sets. First, Deep Forest got better classification performance than traditional machine learning methods such as SVMs and kNNs. Compared to CNNs, it had also achieved better classification results. More importantly, Deep Forest hold a cascade structure which enabled it to do deep representation learning with less
Acknowledgement
This research is supported by the National Key Research and Development Program of China (2017YFC0909502) and the National Natural Science Foundation of China (No. 61532008 and 61872157).
References (41)
- et al.
The microbiome in inflammatory bowel disease: current status and the future ahead
Gastroenterology
(2014) - et al.
Human-associated microbial signatures: examining their predictive value
Cell Host Microbe
(2011) Feature selection: a data perspective
ACM Comput. Surv. CSUR
(2018)- et al.
Feature selection methods for big data bioinformatics: a survey from the search perspective
Methods
(2016) - et al.
Exploiting the ensemble paradigm for stable feature selection: a case study on high-dimensional genomic data
Inf. Fusion
(2017) - et al.
Ensemble feature selection: homogeneous and heterogeneous approaches
Knowl.-Based Syst.
(2017) - et al.
Stable feature selection for biomarker discovery
Comput. Biol. Chem.
(2010) - et al.
The human microbiome project
Nature
(2007) - et al.
The gut microbiome and obesity
Curr. Oncol. Rep.
(2016) A metagenome-wide association study of gut microbiota in type 2 diabetes
Nature
(2012)
The gut-brain axis: interactions between enteric microbiota, central and enteric nervous systems
Ann. Gastroenterol. Q. Publ. Hell. Soc. Gastroenterol.
Microbiome-wide association studies link dynamic microbial consortia to disease
Nature
Metagenome-wide association studies: fine-mining the microbiome
Nat. Rev. Microbiol.
Microbiome helper: a custom and streamlined workflow for microbiome research
mSystems
Microbiome, metagenomics, and high-dimensional compositional data analysis
Annu. Rev. Stat. Its Appl.
Opportunities and obstacles for deep learning in biology and medicine
J. R. Soc. Interface
Deep learning in biomedicine
Nat. Biotechnol.
Representation learning: a review and new perspectives
IEEE Trans. Pattern Anal. Mach. Intell.
Ensemble feature selection for high dimensional data: a new method and a comparative study
Adv. Data Anal. Classif.
A review of ensemble methods in bioinformatics
Curr. Bioinforma.
Cited by (8)
Commentary: Lung cancer and dysbiosis: Debugging the studies for the future
2021, Journal of Thoracic and Cardiovascular SurgeryMicrobial Biomarkers Identification for Human Gut Disease Prediction using Microbial Interaction Network Embedded Deep Learning
2023, International Journal of Advanced Computer Science and ApplicationsAnalyzing Large Microbiome Datasets Using Machine Learning and Big Data
2021, BioMedInformaticsApplication of Deep Learning in Plant–Microbiota Association Analysis
2021, Frontiers in Genetics