A Self-organizing Deep Auto-Encoder approach for Classification of Complex Diseases using SNP Genomics Data

https://doi.org/10.1016/j.asoc.2020.106718Get rights and content

Highlights

  • Self-Organizing Deep Auto-Encoders are designed for classifying complex diseases based on SNP genomics data.

  • Self-Organizing Deep Auto-Encoder can construct its structure automatically.

  • It removes the need of trial and error for discovering the network structure.

  • It decreases the time spent and the computational burden in the training and operation phase.

  • The proposed algorithm has achieved better accuracy and F-score than existing algorithms.

Abstract

Recently, many Machine Learning algorithms have been utilized to identify significant Single Nucleotide Polymorphisms (SNPs) in various human diseases. However, some principal obstacles are challenging in the field of SNP detection and healthy-patient classification. The curse of dimensionality is the main challenge. On the other hand, the number of samples is decidedly smaller than the number of SNPs. In addition, the number of healthy and patient samples can be unequal. These challenges make the feature selection and classification very difficult. The main goal of the current study is the combination of the various algorithms to find out the most effective way of SNP data analysis. Therefore, an efficient method is proposed to identify significant SNPs and classify healthy and patient samples. In this regard, firstly, the Mean Encoding, as an intelligent method, is utilized to convert the nominal SNP data to numeric. Then a two-step filter method is used for feature selection, which removes the irrelevant and redundant features. Finally, the proposed deep auto-encoder is employed to classify so that it can construct its structure based on input data, automatically. To evaluate, we apply the proposed approach to five different SNP datasets, including thyroid cancer, mental retardation, breast cancer, colorectal cancer, and autism, which obtained from the Gene Expression Omnibus (GEO) dataset. The proposed method has succeeded in feature selection and classification so that it can classify healthy and patient samples based on selected features in thyroid cancer, mental retardation, breast cancer, colorectal cancer, and autism with 100%, 94.4%, 100%, 96%, and 99.1% accuracy, respectively. The results indicate that it has succeeded with high efficiency, compared with other published works.

Introduction

Deoxyribonucleic acid (DNA) is the building block of the human genome. There are approximately three billion base pairs in the double helix of DNA. More than 99% of them are the same among all populations, and less than 1% differ among individuals. The majority of DNA changes happen as Single Nucleotide Polymorphisms (SNPs). Studies have shown that SNPs can be very significant in association with complex diseases.

Many Machine Learning algorithms are employed to identify significant SNPs, and many models are developed to classify the healthy and patient samples based on SNP data. D. T. Evans proposed two methods to select significant SNPs, which is containing Chi-squared Sort and Difference Sort [1]. Also, he applied the Support Vector Machine (SVM) to classify healthy and patient samples. In another study, N. Batnyam et al. utilized popular feature selection algorithms to choose the significant SNPs, which involving Relief-F, Feature Selection Based on Distance Discriminant (FSDD), Feature Selection Based on R-value (RFS), and Algorithm Based on Feature Clearness (CBFS) [2]. Then the authors employed conventional classifiers like K-Nearest Neighbor (KNN), Artificial Gene Making (AGM), and SVM to classify SNP data. In addition, Feature Fusion Method (FFM) is utilized to generate new features by combining features to improve classification accuracy. A. Boutorh et al. proposed a novel method based on hybrid Association Rule Mining (ARM) and Artificial Neural Network (ANN) [3]. The authors applied ARM to select informative features, and Grammatical Evolution (GE) is used to optimize ARM. SNP data is classified by ANN, in which the Genetic Algorithm is utilized for setting ANN parameters. In recent years, R. Alzubi et al. recommended hybrid feature selection, involving Conditional Mutual Information Maximization (CMIM) as a filter method and Support Vector Machine-Recursive Feature Elimination (SVM-REF) as a wrapper method [4]. The authors employed four classifiers, including Naive Bayes, Linear Discriminant analysis, KNN, and SVM, to distinguish between healthy and patient groups. In recent studies, Uppu et al. applied a deep feedforward neural network to classify healthy and patient samples based on SNPs, which existing in simulated datasets. Also, the authors did not utilize any feature selection algorithm [5], [6]. However, some principal obstacles are challenging in the field of SNPs detection. The curse of dimensionality is the main challenge because the dimension of SNP data is very high (up to one million). In high dimensional data, typically, many features are irrelevant or redundant; these properties decrease the performance of the classifier and increase the computational cost. Additionally, the number of samples (healthy or patient) are decidedly smaller than the number of SNPs that means the SNP data are sparse. Also, the amount of healthy and patient samples can be unequal, which means the SNP data are unbalanced. Data with sparse and unbalance properties are other difficulties in most studies. Besides, we need to convert nominal data to numeric data in the SNP dataset, and which encoding method for this purpose must be used. Considering all these factors, the improvement of an efficient algorithm, involving feature selection and classification, is hard and complicated.

In this context, Feature Selection (FS) algorithms play an essential role because these algorithms can identify irrelevant and redundant features so that they can reduce dimensionality by removing these features. There are many FS algorithms that each algorithm can be suitable for each particular data. Therefore, we discover which of them can be the best for each specific SNP data and will be able to select significant SNPs that cause to separate the healthy and patient samples with high accuracy.

Additionally, we need a powerful model to classify SNP data into healthy and patient groups. The classification task is carried out based on the significant SNPs, which are selected by the FS algorithm. The performance of classification illustrates that the selected SNPs have a meaningful impact on complex diseases or not. In this study, we focus on deep learning methods like deep auto-encoders. A deep learning algorithm is a specific subfield of the representation learning procedure, which detects multiple levels of representation. High-level representation (or features) illustrate more aspects of the data. The deep learning study was begun by Geoff Hinton’s group in 2006 [7]. These methods are constructed by the combination of multiple nonlinear mappings (or transformations) to obtain the data representation with more abstract. A deep model is made by using stacking unsupervised representation learning models to make a deep representation. This area has been growing rapidly so that many deep learning methods have been widely discussed and reviewed in recent years. All of them have succeeded in many applications, especially in computer vision tasks, therefore it can be useful in health informatics applications. Thus, Researchers mainly have focused on important applications of deep learning in the fields of translational bioinformatics, medical imaging, pervasive sensing, medical informatics, and public health [8]. Although deep learning methods have succeeded in many applications, however, the underlying theory is not well understood. Also, there is no clear realization of which architecture accomplishes better than the others. It is challenging to characterize which structure, how many layers, and how many nodes in each layer are appropriate for a specific task [9]. It also requires specialized knowledge to determine sensible values such as the learning rate and the coefficient of the regularization. Bengio offered practical recommendations for the gradient-based training of deep architectures to determine parameters such as the learning rate, momentum, regularization coefficient, number of training iteration, and other parameters [10]. He believed that using the same size of nodes for all layers works generally better than or the same as using a decreasing or increasing size [11]. In another study, the authors declared that over-fitting is a serious problem in deep neural networks with a large number of parameters and also is slow to use. They proposed the dropout technique to address this problem [12]. In [13], the authors recommended a framework to select the hyper-parameters, involving the number of layers and nodes. It contains two basic approaches, which are manual and automatical approach. In the manual method, an expert determines hyper-parameters, which needs to understand what the hyper-parameters operate at the model. Automatical selection utilizes grid search and random search, which eliminates the necessity of experts in the model design process, but they increase time spent and computational cost. In another study, the authors applied the new penalty term to the loss function, which forces the parameters of some neurons to be zero [14]. However, the main question remains: how many layers (auto-encoders) and how many nodes (neurons) in each layer are suitable for a specific task?

Here, we discuss some of the essential questions that have been driving research in the field of Artificial Intelligence (AI) application in medical science; specifically, which encoding method is suitable for SNP data that can improve feature selection and classification performance? Which FS algorithm can be the best in each specific SNP data and can select significant SNPs? What makes one representation better than the other in deep auto-encoders? How many layers or how many nodes in each layer are appropriate? We try to answer these questions and propose a new approach to identify the significant SNPs and a new classifier to classify the (case and control) samples in complex diseases. Eventually, the proposed method is applied to five different SNP datasets, and the results display that our approach has succeeded in this area with high accuracy.

This paper is structured as follows; Section 2 provides a framework of the whole study. The details of SNP datasets are described in Section 2.1. In Sections 2.2, 2.3, and 2.4, systems and background theories are discussed, involving the pre-processing method, the proposed feature selection algorithm, and the proposed classifier. In Section 3, we display our experimental results, where we apply the proposed approach to several SNP data and compare it with the previous works.

Section snippets

Methodology

The proposed process involves three stages: (A) Apre-processing stage, which consists of encoding nominal SNP data, and removing or replacing missing values in SNP data. (B) A feature selection stage; in this stage, significant SNPs are selected by a suitable FS algorithm. (C) A classification stage, the self-organizing auto-encoders are utilized to classify SNP data in this stage. Also, selected SNPs are evaluated according to some classification metrics such as accuracy and F1-score. The

Experimental results

Briefly, the following steps were done for all SNP data, and then the simulation results were reported in this section.

1- The SNP data was preprocessed so that features in which the number of missing values more than the determined threshold, user-defined parameters such as 10%, was removed and also the other missing values were replaced by suitable values.

2- The SNP data were divided into three parts: namely training, validation, and test, involving 70%, 10%, and 20% of data, respectively.

3-

Conclusion

The human genome sequencing has obtained an excellent success in medical science, and illustrated the importance and effectiveness of genotype in complex diseases. In the current study, we tried to build a framework that has the potential to analyze SNP data. In this regard, we proposed a new method for the significant SNPs selection and classification of them in complex diseases. According to our proposed method, the SNP data were preprocessed, leading to eliminate SNPs with high missing

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The Cancer Control Research Center, Cancer Control Foundation, Iran University of Medical Sciences, Tehran, Iran supported this work as project No: CCF-97065. In addition, thanks are due to any ones, which their suggestions have helped to improve the quality of this paper. All authors have read and approved the submitted manuscript.

References (42)

  • AlzubiR. et al.

    A hybrid Feature Selection Method for complex diseases SNPs

    IEEE Access

    (2018)
  • UppuS.

    A deep learning appraoch to detect SNP interactions

    J. Softw.

    (2016)
  • S. Uppu, A. Krishna, P. Gopalan, Towards deep learning in genome-wide association interaction studies, in: Pacific Asia...
  • HintonG.E. et al.

    Reducing the dimensionality of data with neural networks

    Science

    (2006)
  • RavìD.

    Deep learning for health informatics

    IEEE J. Biomed. Health Inf.

    (2017)
  • BengioY.

    Practical recommendations for gradient-based training of deep architectures

  • LarochelleH. et al.

    Exploring strategies for training deep neural networks

    J. Mach. Learn. Res.

    (2009)
  • SrivastavaN. et al.

    Dropout: a simple way to prevent neural networks from overfitting

    J. Mach. Learn. Res.

    (2014)
  • GoodfellowI. et al.

    Deep Learning

    (2016)
  • AlvarezJ.M. et al.

    Learning the number of neurons in deep networks

  • B. Luzón-Toro, et al. Identification of epistatic interactions through genome-wide association studies in sporadic...
  • Cited by (0)

    1

    All authors each made a significant contribution to the research.

    View full text