Ensemble-based active learning using fuzzy-rough approach for cancer sample classification

doi:10.1016/j.engappai.2020.103591

Engineering Applications of Artificial Intelligence

Volume 91, May 2020, 103591

https://doi.org/10.1016/j.engappai.2020.103591 Get rights and content

Abstract

Background and Objective: Classification of cancer from gene expression data is one of the major research areas in the field of machine learning and medical science. Generally, conventional supervised methods are not able to produce desired classification accuracy due to inadequate training samples present in gene expression data to train the system. Ensemble-based active learning technique in this situation can be effective as it determines few informative samples by all the base classifiers and ensemble the decisions of all the base classifiers to get the most informative samples. Most informative samples are labeled by the subject experts and those are added to the training set, which can improve the classification accuracy.

Method: We propose a novel ensemble-based active learning using fuzzy-rough approach for cancer sample classification from microarray gene expression data. The proposed method is able to deal with the uncertainty, overlap and indiscernibility usually present in the subtype classes of the gene expression data and can improve the accuracy of the individual base classifier in presence of limited training samples.

Results: The proposed method is validated using eight microarray gene expression datasets. The performance of the proposed method in terms of classification accuracy, precision, recall, $F_{1}$ -measures and kappa is compared with six other methods. The improvements in accuracy achieved by the proposed method compared to its nearest competitive methods are 2.96%, 9.34%, 0.93%, 3.69%, 7.2% and 4.53% respectively for Colon cancer, Prostate cancer, SRBCT, Ovarian cancer, DLBCL and Central nervous system datasets. Results of the paired $t$ -test justify the statistical relevance of the results in favor of the proposed method for most of the datasets.

Conclusion: The proposed method is an effective general purpose ensemble-based active learning adopting the fuzzy-rough concept and therefore can be applied for other classification problem in future.

Introduction

Cancer is one of the terrible disease caused by abnormal and uncontrolled growth of cells. Cancer cells usually behave differently than normal cells and can spread to other parts of the body. It is the second-leading cause of death worldwide and an approximately 9.6 million people die every year from cancer according to the Union for International Cancer Control (UICC), Switzerland (https://www.worldcancerday.org/what-cancer). Therefore, cancer sub-type classes classification at initial stage has become a vital area of research worldwide to the researchers and scientists.

Cancer classifications by the traditional methods are based on the morphological appearance of the tumor and the clinical test. Traditional methods for diagnostic process are often time consuming, usually expensive and sometime inaccurate. These methods are also often confined to the expert’s observation in differentiating distinct cancer subtype classes as the most cancers are molecularly different and follow distinct clinical procedure.

Microarray (Stekel, 2003, Maroulis et al., 2006) data is often being adopted to classify cancer samples in order to provide a low cost diagnosis. Microarray measures thousands of gene expression profiles simultaneously (Stekel, 2003). In microarray data the number of genes present is very large as compared to the number of samples (Du et al., 2014) which yields the problem of curse-of-dimensionality. Additionally, the subtypes of cancer classes are also often indiscernible, vague, overlapping and ambiguous (Pawlak, 1991). Therefore, traditional computational methods may not achieve desired accuracy as the number of training samples are not adequate. Hence, it becomes crucial to design classifier that would handle the above mentioned challenges to produce high classification accuracy (Lu and Han, 2003).

Several machine learning algorithms have been widely applied to microarray gene expression data analysis using supervised (i.e., classification) (Dettling and Buhlmann, 2003, Rapaport et al., 2007, Tan et al., 2011, Maniruzzaman et al., 2019), unsupervised (i.e., clustering) (Dettling and Buhlmann, 2002, Sturn et al., 2002, Jiang et al., 2004), semi-supervised clustering (Doan et al., 2011, Priscilla and Swamynathan, 2013, Wang and Pan, 2014), semi-supervised classification (Shi and Zhang, 2011, Halder and Misra, 2014, Maulik and Chakraborty, 2014), active learning based classification (Halder et al., 2015, Liu, 2004, Vogiatzis and Tsapatsoulis, 2008, Halder and Kumar, 2019, Kumar and Halder, 2019a), and ensemble based classification (Dettling and Buhlmann, 2003, Yang et al., 2010, Osareh and Shadgar, 2013, Tan and Gilbert, 2003, Xiao et al., 2018) frameworks. Researchers have also done uncertainty processing using fuzzy set for multi-criteria decision making based on belief entropy (Xiao, 2019b), workflow scheduling in distributed environment (Xiao et al., 2019), and for pattern classification problem (Xiao, 2019a) in recent past.

Generally, classical supervised learning algorithms (Bishop, 2006) require large number of training samples to classify the unlabeled samples. Although labeled samples are often expensive, time consuming and difficult to obtain, however the unlabeled samples are relatively easy to collect. Moreover, in microarray gene expression data, classes present are also often vague, indiscernable, imprecise and overlapping in nature (Maji and Pal, 2012). Therefore, the traditional classifiers often fail to achieve desired accuracy for cancer classification. Therefore, researchers have tried to improve the prediction accuracy by using the ensemble based classification (Dettling and Buhlmann, 2003, Yang et al., 2010, Osareh and Shadgar, 2013, Tan and Gilbert, 2003) where a few base classifiers’ decisions are combined (by majority voting Polikar, 2006, weighted majority voting Polikar, 2006 etc. techniques). Alternatively, people have also tried to enhance the classification accuracy by adopting the concept of active learning (Halder et al., 2015, Liu, 2004, Vogiatzis and Tsapatsoulis, 2008, Halder and Kumar, 2019, Kumar and Halder, 2019a). In active learning the most informative samples are computationally being chosen to get their labels from the experts which in turn are added to the limited training set. Thereby, the active learning method iteratively increases the number of training samples which ultimately help to improve the prediction accuracy.

Motivated from the individual advantages of ensemble and active learning techniques and to improve the accuracy further, in this article a novel ensemble-based active learning technique using fuzzy-rough approach is proposed to classify cancer samples from gene expression data which can handle the above said challenges such as (i) ambiguity, overlappingness, vagueness and indiscernibility, (ii) low predictive accuracy produced by individual classifier, and (iii) scarcity of the clinically labeled samples.

In this context, ensemble-based active learning technique is supposed to be useful as it judiciously combines the advantages of ensemble learning and active learning strategy where ensemble learning technique amalgamates the decisions of multiple base classifiers to produce the final decision which is expected to be better than any individual base classifier. Whereas, active learning technique computationally selects very few most informative samples with the help of ensemble based multiple base classifiers and that unlabeled samples are labeled by the subject experts, subsequently added with small number of training samples to improve the classification accuracy.

Although, ensemble-based active learning techniques were applied for different areas like gesture recognition (Schumacher et al., 2012), image classification (Beluch et al., 2018), web documents classification (Schnitzer et al., 2014) etc., with promising results, however this technique is not exposed so far for microarray gene expression data analysis. To the authors’ best knowledge the proposed method in this article is the first of its kind to address the cancer classification problem from gene expression data adopting the concept of ensemble-based active learning technique using rough-fuzzy theory.

The structure of the remaining article is organized as follows. The background theory related to the proposed method is briefly given in Section 2. Section 3 presents description of the proposed ensemble-based active learning using fuzzy-rough nearest neighbor classifier (EnALFRNN) method. Section 4 describes the experimental evaluation and Section 5 reports the experimental results and discussions. Finally, the concluding remarks and future direction of work are given in Section 6.

Section snippets

Background

The proposed method ensemble-based active learning using fuzzy-rough nearest neighbor classifier (EnALFRNN) is a merger of fuzzy set, rough set, active learning and ensemble learning thus brief descriptions of those are presented below:

Ensemble-based active learning using fuzzy-rough nearestneighbor classifier

The proposed ensemble-based active learning using fuzzy-rough nearest neighbor classifier (EnALFRNN) comprises of (i) ensemble-based active learning step and (ii) ensemble-based testing step. In the first step, ensemble-based active learning approach is adopted to search the ‘most informative’ unlabeled samples by taking the consensus (intersection) of $P$ number informative sample sets to get the labels from the subject experts, so that the labeled ‘most informative samples’ can be augmented

Experimental evaluation

In this section details of the datasets used in the present investigation along with the other compared methods are reported followed by performance evaluation measures. Finally, experimental setup is also summarized.

Results and discussions

Table 3 summarizes the average experimental results of $10$ simulations on eight gene expression datasets in terms of percentage accuracy, precision, recall, macro $F_{1}$ , micro $F_{1}$ , and kappa achieved by the proposed and comparing methods. Bold font values represent the best results obtained by the methods and the standard deviations of accuracies of $10$ simulations are also shown using $\pm$ sign corresponding to each percentage accuracy in Table 3.

It can be seen from the summarized experimental results

Conclusions

The gene expression dataset comprises of limited number of samples compared to the number of genes. However, basic classification techniques require ‘sufficient’ number of samples to achieve desired classification accuracy and usually, single classifier is not adequate to take accurate decision which yields low prediction accuracy. To address the said problems, ensemble-based active learning technique is proposed as ensemble learning technique combines the decision of multiple base classifiers

References (59)

DuD. et al.
A novel forward gene selection algorithm for microarray data
Neurocomputing
(2014)
HalderA. et al.
Aggregation pheromone metaphor for semi-supervised classification
Pattern Recognit.
(2013)
HalderA. et al.
Active learning using rough fuzzy classifier for cancer prediction from microarray gene expression data
J. Biomed. Inform.
(2019)
JensenR. et al.
Fuzzy-rough nearest neighbour classification and prediction
Theoret. Comput. Sci.
(2011)
LuY. et al.
Cancer classification using gene expression data
Inf. Syst. Spec. Issue: Data Manage. Bioinform.
(2003)
ManiruzzamanM. et al.
Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms
Comput. Methods Programs Biomed.
(2019)
MaroulisD. et al.
Microarray-md: A system for exploratory analysis of microarray gene expression data
Comput. Methods Programs Biomed.
(2006)
RadzikowskaA.M. et al.
A comparative study of fuzzy rough sets
Fuzzy Sets and Systems
(2002)
SinghD. et al.
Gene expression correlates of clinical prostate cancer behavior
Cancer Cell
(2002)
VogiatzisD. et al.
Active learning for microarray data
Internat. J. Approx. Reason.
(2008)

WangY. et al.

Semi-supervised consensus clustering for gene expression data analysis

BioData Min.

(2014)

XiaoY. et al.

A deep learning-based multi-model ensemble method for cancer prediction

Comput. Methods Programs Biomed.

(2018)

ZadehL.

Fuzzy sets

Inf. Control

(1965)

AhaD.W. et al.

Instance-based learning algorithms

Mach. Learn.

(1991)

AlonU. et al.

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

BeluchW.H. et al.

The power of ensembles for active learning in image classification

BishopC.

Pattern Recognition and Machine Learning

(2006)

BreimanL.

Bagging predictors

Mach. Learn.

(1996)

CohenJ.

A coefficient of agreement for nominal scales

Educ. Psychol. Meas.

(1960)

DettlingM.

Bagboosting for tumor classification with gene expression data

Bioinformatics

(2004)

DettlingM. et al.

Supervised clustering of genes

Genome Biol.

(2002)

DettlingM. et al.

Boosting for tumor classification with gene expression data

Bioinformatics

(2003)

DoanD. et al.

Utilization of gene ontology in semi-supervised clustering

DudaR. et al.

Pattern Classification

(2000)

GolubT. et al.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

(1999)

HalderA. et al.

Active learning using fuzzy k-NN for cancer classification from microarray gene expression data

HalderA. et al.

Semi-supervised fuzzy k-NN for cancer classification from microarray gene expression data

JiangD. et al.

Cluster analysis for gene expression data: A survey

IEEE Trans. Knowl. Data Eng.

(2004)

KellerJ. et al.

A fuzzy k-nearest neighbor algorithm

IEEE Trans. Syst. Man Cybern.

(1985)

Cited by (13)

Label-free model evaluation and weighted uncertainty sample selection for domain adaptive instance segmentation
2024, Engineering Applications of Artificial Intelligence
This paper addresses the challenges of model evaluation and optimization that arise from domain differences between the target and source domains during model deployment. Current methods for model accuracy evaluation require a fully annotated test set. However, obtaining additional human labels for every unique application scenario can be costly and time-intensive. To tackle this problem, this paper proposes an instance segmentation model evaluation method based on domain differences, which can give the prediction accuracy of the model on unlabeled test sets. Moreover, to enhance deployment accuracy cost-effectively, this paper proposes an “effective operation”-based labeling cost computation method and a weighted uncertainty sample selection method. The former accurately computes labeling costs for instance segmentation, while the latter selects the most valuable samples from the target domain for labeling and training. Model evaluation experiments demonstrate that this method’s root mean square error (RMSE) on Cityscapes is approximately 4% less than that of other existing model evaluation methods. Model optimization experiments demonstrate that the proposed method achieves greater model accuracy than comparative methods under four distinct data partitioning protocols. The code is available at https://github.com/licongguan/Lamer.
Density-based one-shot active learning for image segmentation
2023, Engineering Applications of Artificial Intelligence
Image segmentation is a key step in image processing tasks, which has significant applications in computer vision field such as medical image analysis, scene understanding and video monitoring, etc. However, image segmentation tasks usually require a large number of labeled samples to obtain great performance of convolutional neural networks (CNNs). Active learning (AL) can select valuable samples for annotation, so as to reduce the annotation cost as much as possible while maintaining the performance of CNNs. Further, one-shot AL can select valuable samples by once, which eliminates the need for iterative sample selection and annotation. However, existing one-shot AL approaches extremely rely on complex clustering algorithm, which brings a limitation in practice, i.e., we often do not know how to set the hyperparameters. In this paper, we propose a clustering-free one-shot AL framework, which is based on self-supervised feature learning and density-based query strategy. Our framework can select samples with high local density robustly against hyperparameters. The experimental results are impressive that state-of-the-art one-shot active learning performance can be achieved with simple density-based sampling.
Coupling digital simulation and machine learning metamodel through an active learning approach in Industry 4.0 context
2021, Computers in Industry
Citation Excerpt :
This allows them to consider other sources of uncertainty. For example, Kumar and Halder (2020) proposed an Active Learning strategy based on fuzzy rough logic for the classification of cancer samples. Whereas most of the above methods propose a heuristic to select the points, the ultimate goal of the AL engine is to select data to reduce the prediction error of the ML model being trained, as defined in Eq. (1).
Although digital simulations are becoming increasingly important in the industrial world owing to the transition toward Industry 4.0, as well as the development of digital twin technologies, they have become increasingly computationally intensive. Many authors have proposed the use of machine learning (ML) metamodels to alleviate this cost and take advantage of the enormous amount of data that are currently available in industry. In an industrial context, it is necessary to continuously train predictive models integrated into decision support systems to ensure the consistency of their prediction quality over time. This led the authors to investigate active learning (AL) concepts in the particular context of the sawmilling industry. In this paper, a method based on AL is proposed to combine simulation and an ML metamodel that is trained incrementally using only selected data (smart data). A case study based on the sawmilling industry and experiments are shown, the results of which prove the possible advantages of this approach.
Fuzzy rough sets: Survey and proposal of an enhanced knowledge representation model based on automatic noisy sample detection
2020, Cognitive Systems Research
Fuzzy Rough Set (FRS) theory, which has been emerged thanks to unifying Rough Set and Fuzzy Set ones, is a powerful mathematical tool for handling and processing real data of imprecise, incomplete, inconsistent and uncertain nature. It has drawn attention of many researchers, scientists and industrials in various domains over the last three decades. However, different studies have showed that its classical knowledge representation model has a main weakness linked to its sensitivity to data noise which decreases both its effectiveness and application scope. In this paper, we survey the current FRS paradigms developed to deal with this issue and propose a new FRS model based on the Automatic Noisy Sample Detection (ANSD-FRS) able to cope with noise influence in classification tasks. Besides, we study the principal properties of this new model and reformulate the most applied FRS concepts relying on its operators. Numerous experiments have been conducted to analyze the ANSD-FRS behavior compared to the commonly used FRS models reputed as the most noise-resistant paradigms. These experiment results have proved the performance and robustness of the ANSD-FRS in comparison with those renowned models.
Big data analytics enabled deep convolutional neural network for the diagnosis of cancer
2024, Knowledge and Information Systems
An efficient feature selection and classification system for microarray cancer data using genetic algorithm and deep belief networks
2024, Multimedia Tools and Applications

View all citing articles on Scopus

^☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2020.103591.

¹: Both authors contributed equally.

View full text

Ensemble-based active learning using fuzzy-rough approach for cancer sample classification☆

Abstract

Introduction

Section snippets

Background

Ensemble-based active learning using fuzzy-rough nearestneighbor classifier

Experimental evaluation

Results and discussions

Conclusions

Neurocomputing

Pattern Recognit.

J. Biomed. Inform.

Theoret. Comput. Sci.

Inf. Syst. Spec. Issue: Data Manage. Bioinform.

Comput. Methods Programs Biomed.

Comput. Methods Programs Biomed.

Fuzzy Sets and Systems

Cancer Cell

Internat. J. Approx. Reason.

BioData Min.

Comput. Methods Programs Biomed.

Inf. Control

Instance-based learning algorithms

Mach. Learn.

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

The power of ensembles for active learning in image classification

Pattern Recognition and Machine Learning

Bagging predictors

Mach. Learn.

A coefficient of agreement for nominal scales

Educ. Psychol. Meas.

Bagboosting for tumor classification with gene expression data

Bioinformatics

Supervised clustering of genes

Genome Biol.

Boosting for tumor classification with gene expression data

Bioinformatics

Utilization of gene ontology in semi-supervised clustering

Pattern Classification

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

Active learning using fuzzy k-NN for cancer classification from microarray gene expression data

Semi-supervised fuzzy k-NN for cancer classification from microarray gene expression data

Cluster analysis for gene expression data: A survey

IEEE Trans. Knowl. Data Eng.

A fuzzy k-nearest neighbor algorithm

IEEE Trans. Syst. Man Cybern.