Ensemble-based active learning using fuzzy-rough approach for cancer sample classification

https://doi.org/10.1016/j.engappai.2020.103591Get rights and content

Abstract

Background and Objective: Classification of cancer from gene expression data is one of the major research areas in the field of machine learning and medical science. Generally, conventional supervised methods are not able to produce desired classification accuracy due to inadequate training samples present in gene expression data to train the system. Ensemble-based active learning technique in this situation can be effective as it determines few informative samples by all the base classifiers and ensemble the decisions of all the base classifiers to get the most informative samples. Most informative samples are labeled by the subject experts and those are added to the training set, which can improve the classification accuracy.

Method: We propose a novel ensemble-based active learning using fuzzy-rough approach for cancer sample classification from microarray gene expression data. The proposed method is able to deal with the uncertainty, overlap and indiscernibility usually present in the subtype classes of the gene expression data and can improve the accuracy of the individual base classifier in presence of limited training samples.

Results: The proposed method is validated using eight microarray gene expression datasets. The performance of the proposed method in terms of classification accuracy, precision, recall, F1-measures and kappa is compared with six other methods. The improvements in accuracy achieved by the proposed method compared to its nearest competitive methods are 2.96%, 9.34%, 0.93%, 3.69%, 7.2% and 4.53% respectively for Colon cancer, Prostate cancer, SRBCT, Ovarian cancer, DLBCL and Central nervous system datasets. Results of the paired t-test justify the statistical relevance of the results in favor of the proposed method for most of the datasets.

Conclusion: The proposed method is an effective general purpose ensemble-based active learning adopting the fuzzy-rough concept and therefore can be applied for other classification problem in future.

Introduction

Cancer is one of the terrible disease caused by abnormal and uncontrolled growth of cells. Cancer cells usually behave differently than normal cells and can spread to other parts of the body. It is the second-leading cause of death worldwide and an approximately 9.6 million people die every year from cancer according to the Union for International Cancer Control (UICC), Switzerland (https://www.worldcancerday.org/what-cancer). Therefore, cancer sub-type classes classification at initial stage has become a vital area of research worldwide to the researchers and scientists.

Cancer classifications by the traditional methods are based on the morphological appearance of the tumor and the clinical test. Traditional methods for diagnostic process are often time consuming, usually expensive and sometime inaccurate. These methods are also often confined to the expert’s observation in differentiating distinct cancer subtype classes as the most cancers are molecularly different and follow distinct clinical procedure.

Microarray (Stekel, 2003, Maroulis et al., 2006) data is often being adopted to classify cancer samples in order to provide a low cost diagnosis. Microarray measures thousands of gene expression profiles simultaneously (Stekel, 2003). In microarray data the number of genes present is very large as compared to the number of samples (Du et al., 2014) which yields the problem of curse-of-dimensionality. Additionally, the subtypes of cancer classes are also often indiscernible, vague, overlapping and ambiguous (Pawlak, 1991). Therefore, traditional computational methods may not achieve desired accuracy as the number of training samples are not adequate. Hence, it becomes crucial to design classifier that would handle the above mentioned challenges to produce high classification accuracy (Lu and Han, 2003).

Several machine learning algorithms have been widely applied to microarray gene expression data analysis using supervised (i.e., classification) (Dettling and Buhlmann, 2003, Rapaport et al., 2007, Tan et al., 2011, Maniruzzaman et al., 2019), unsupervised (i.e., clustering) (Dettling and Buhlmann, 2002, Sturn et al., 2002, Jiang et al., 2004), semi-supervised clustering (Doan et al., 2011, Priscilla and Swamynathan, 2013, Wang and Pan, 2014), semi-supervised classification (Shi and Zhang, 2011, Halder and Misra, 2014, Maulik and Chakraborty, 2014), active learning based classification (Halder et al., 2015, Liu, 2004, Vogiatzis and Tsapatsoulis, 2008, Halder and Kumar, 2019, Kumar and Halder, 2019a), and ensemble based classification (Dettling and Buhlmann, 2003, Yang et al., 2010, Osareh and Shadgar, 2013, Tan and Gilbert, 2003, Xiao et al., 2018) frameworks. Researchers have also done uncertainty processing using fuzzy set for multi-criteria decision making based on belief entropy (Xiao, 2019b), workflow scheduling in distributed environment (Xiao et al., 2019), and for pattern classification problem (Xiao, 2019a) in recent past.

Generally, classical supervised learning algorithms (Bishop, 2006) require large number of training samples to classify the unlabeled samples. Although labeled samples are often expensive, time consuming and difficult to obtain, however the unlabeled samples are relatively easy to collect. Moreover, in microarray gene expression data, classes present are also often vague, indiscernable, imprecise and overlapping in nature (Maji and Pal, 2012). Therefore, the traditional classifiers often fail to achieve desired accuracy for cancer classification. Therefore, researchers have tried to improve the prediction accuracy by using the ensemble based classification (Dettling and Buhlmann, 2003, Yang et al., 2010, Osareh and Shadgar, 2013, Tan and Gilbert, 2003) where a few base classifiers’ decisions are combined (by majority voting Polikar, 2006, weighted majority voting Polikar, 2006 etc. techniques). Alternatively, people have also tried to enhance the classification accuracy by adopting the concept of active learning (Halder et al., 2015, Liu, 2004, Vogiatzis and Tsapatsoulis, 2008, Halder and Kumar, 2019, Kumar and Halder, 2019a). In active learning the most informative samples are computationally being chosen to get their labels from the experts which in turn are added to the limited training set. Thereby, the active learning method iteratively increases the number of training samples which ultimately help to improve the prediction accuracy.

Motivated from the individual advantages of ensemble and active learning techniques and to improve the accuracy further, in this article a novel ensemble-based active learning technique using fuzzy-rough approach is proposed to classify cancer samples from gene expression data which can handle the above said challenges such as (i) ambiguity, overlappingness, vagueness and indiscernibility, (ii) low predictive accuracy produced by individual classifier, and (iii) scarcity of the clinically labeled samples.

In this context, ensemble-based active learning technique is supposed to be useful as it judiciously combines the advantages of ensemble learning and active learning strategy where ensemble learning technique amalgamates the decisions of multiple base classifiers to produce the final decision which is expected to be better than any individual base classifier. Whereas, active learning technique computationally selects very few most informative samples with the help of ensemble based multiple base classifiers and that unlabeled samples are labeled by the subject experts, subsequently added with small number of training samples to improve the classification accuracy.

Although, ensemble-based active learning techniques were applied for different areas like gesture recognition (Schumacher et al., 2012), image classification (Beluch et al., 2018), web documents classification (Schnitzer et al., 2014) etc., with promising results, however this technique is not exposed so far for microarray gene expression data analysis. To the authors’ best knowledge the proposed method in this article is the first of its kind to address the cancer classification problem from gene expression data adopting the concept of ensemble-based active learning technique using rough-fuzzy theory.

The structure of the remaining article is organized as follows. The background theory related to the proposed method is briefly given in Section 2. Section 3 presents description of the proposed ensemble-based active learning using fuzzy-rough nearest neighbor classifier (EnALFRNN) method. Section 4 describes the experimental evaluation and Section 5 reports the experimental results and discussions. Finally, the concluding remarks and future direction of work are given in Section 6.

Section snippets

Background

The proposed method ensemble-based active learning using fuzzy-rough nearest neighbor classifier (EnALFRNN) is a merger of fuzzy set, rough set, active learning and ensemble learning thus brief descriptions of those are presented below:

Ensemble-based active learning using fuzzy-rough nearestneighbor classifier

The proposed ensemble-based active learning using fuzzy-rough nearest neighbor classifier (EnALFRNN) comprises of (i) ensemble-based active learning step and (ii) ensemble-based testing step. In the first step, ensemble-based active learning approach is adopted to search the ‘most informative’ unlabeled samples by taking the consensus (intersection) of P number informative sample sets to get the labels from the subject experts, so that the labeled ‘most informative samples’ can be augmented

Experimental evaluation

In this section details of the datasets used in the present investigation along with the other compared methods are reported followed by performance evaluation measures. Finally, experimental setup is also summarized.

Results and discussions

Table 3 summarizes the average experimental results of 10 simulations on eight gene expression datasets in terms of percentage accuracy, precision, recall, macro F1, micro F1, and kappa achieved by the proposed and comparing methods. Bold font values represent the best results obtained by the methods and the standard deviations of accuracies of 10 simulations are also shown using ± sign corresponding to each percentage accuracy in Table 3.

It can be seen from the summarized experimental results

Conclusions

The gene expression dataset comprises of limited number of samples compared to the number of genes. However, basic classification techniques require ‘sufficient’ number of samples to achieve desired classification accuracy and usually, single classifier is not adequate to take accurate decision which yields low prediction accuracy. To address the said problems, ensemble-based active learning technique is proposed as ensemble learning technique combines the decision of multiple base classifiers

References (59)

  • WangY. et al.

    Semi-supervised consensus clustering for gene expression data analysis

    BioData Min.

    (2014)
  • XiaoY. et al.

    A deep learning-based multi-model ensemble method for cancer prediction

    Comput. Methods Programs Biomed.

    (2018)
  • ZadehL.

    Fuzzy sets

    Inf. Control

    (1965)
  • AhaD.W. et al.

    Instance-based learning algorithms

    Mach. Learn.

    (1991)
  • AlonU. et al.

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

  • BeluchW.H. et al.

    The power of ensembles for active learning in image classification

  • BishopC.

    Pattern Recognition and Machine Learning

    (2006)
  • BreimanL.

    Bagging predictors

    Mach. Learn.

    (1996)
  • CohenJ.

    A coefficient of agreement for nominal scales

    Educ. Psychol. Meas.

    (1960)
  • DettlingM.

    Bagboosting for tumor classification with gene expression data

    Bioinformatics

    (2004)
  • DettlingM. et al.

    Supervised clustering of genes

    Genome Biol.

    (2002)
  • DettlingM. et al.

    Boosting for tumor classification with gene expression data

    Bioinformatics

    (2003)
  • DoanD. et al.

    Utilization of gene ontology in semi-supervised clustering

  • DudaR. et al.

    Pattern Classification

    (2000)
  • GolubT. et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • HalderA. et al.

    Active learning using fuzzy k-NN for cancer classification from microarray gene expression data

  • HalderA. et al.

    Semi-supervised fuzzy k-NN for cancer classification from microarray gene expression data

  • JiangD. et al.

    Cluster analysis for gene expression data: A survey

    IEEE Trans. Knowl. Data Eng.

    (2004)
  • KellerJ. et al.

    A fuzzy k-nearest neighbor algorithm

    IEEE Trans. Syst. Man Cybern.

    (1985)
  • Cited by (13)

    • Density-based one-shot active learning for image segmentation

      2023, Engineering Applications of Artificial Intelligence
    • Coupling digital simulation and machine learning metamodel through an active learning approach in Industry 4.0 context

      2021, Computers in Industry
      Citation Excerpt :

      This allows them to consider other sources of uncertainty. For example, Kumar and Halder (2020) proposed an Active Learning strategy based on fuzzy rough logic for the classification of cancer samples. Whereas most of the above methods propose a heuristic to select the points, the ultimate goal of the AL engine is to select data to reduce the prediction error of the ML model being trained, as defined in Eq. (1).

    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2020.103591.

    1

    Both authors contributed equally.

    View full text