Detecting biomarkers from microarray data using distributed correlation based gene selection

Shukla, Alok Kumar; Tripathi, Diwakar

doi:10.1007/s13258-020-00916-w

Detecting biomarkers from microarray data using distributed correlation based gene selection

Research Article
Published: 10 February 2020

Volume 42, pages 449–465, (2020)
Cite this article

Genes & Genomics Aims and scope Submit manuscript

Alok Kumar Shukla¹ &
Diwakar Tripathi²

511 Accesses
26 Citations
1 Altmetric
Explore all metrics

Abstract

Background

Over the past few decades, DNA microarray technology has emerged as a prevailing process for early identification of cancer subtypes. Several feature selection (FS) techniques have been widely applied for identifying cancer from microarray gene data but only very few studies have been conducted on distributing the feature selection process for detecting cancer subtypes.

Objective

Not all the gene expressions are needed in prediction, this research article objective is to select discriminative biomarkers by using distributed FS method which helps in accurately diagnosis of cancer subtype. Traditional feature selection techniques have several drawbacks like unrelated features that could perform well in terms of classification accuracy with a suitable subset of genes will be left out of the selection.

Method

To overcome the issue, in this paper a new filter-based method for gene selection is introduced which can select the highly relevant genes for distinguishing tissues from the gene expression dataset. In addition, it is used to compute the relation between gene–gene and gene–class and simultaneously identify subset of essential genes. Our method is tested on Diffuse Large B cell Lymphoma (DLBCL) dataset by using well-known classification techniques such as support vector machine, naïve Bayes, k-nearest neighbor, and decision tree.

Results

Results on biological DLBCL dataset demonstrate that the proposed method provides promising tools for the prediction of cancer type, with the prediction accuracy of 97.62%, precision of 94.23%, sensitivity of 94.12%, F-measure of 90.12%, and ROC value of 99.75%.

Conclusion

The experimental results reveal the fact that the proposed method is significantly improved classification accuracy and execution time, compared to existing standard algorithms when applied to the non-partitioned dataset. Furthermore, the extracted genes are biologically sound and agree with the outcome of relevant biomedical studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A proficient two stage model for identification of promising gene subset and accurate cancer classification

Article 10 March 2023

Sayantan Dass, Sujoy Mistry, … Keshav Dahal

An Experimental Analysis of Gene Feature Selection and Classification Methods for Cancer Microarray

An Optimize Gene Selection Approach for Cancer Classification Using Hybrid Feature Selection Methods

References

Agarwalla P, Mukhopadhyay S (2018) Bi-stage hierarchical selection of pathway genes for cancer progression using a swarm based computational approach. Appl Soft Comput 62:230–250
Article Google Scholar
Alirezaei M, Taghi S, Niaki A, Armin S, Niaki A (2019) A bi-objective hybrid optimization algorithm to reduce noise and data dimension in diabetes diagnosis using support vector machines. Expert Syst Appl 127:47–57
Article Google Scholar
Ang JC, Mirzal A, Haron H, Nuzly H, Hamed A (2016) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans Comput Biol Bioinf 13(5):971–989
Article Google Scholar
Apolloni J, Leguizamón G, Alba E (2016) Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments. Appl Soft Comput 38:922–932
Article Google Scholar
Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150
Article Google Scholar
Daniel RP, Luis R (2019) Distributed ReliefF based feature selection in spark. Knowl Inf Syst 57(1):1–20
Google Scholar
Dara RA, Makrehchi M, Kamel MS (2010) Filter-based data partitioning for training multiple classifier systems. IEEE Trans Knowl Data Eng 22(4):508–522
Article Google Scholar
Edsgärd D, Johnsson P, Sandberg R (2018) Identification of spatial expression trends in single-cell gene expression data. Nat Methods 15(5):339–342
Article CAS PubMed PubMed Central Google Scholar
Fabris F, Freitas AA, Tullet JMA (2016) An extensive empirical comparison of probabilistic hierarchical classifiers in datasets of ageing-related genes. IEEE ACM Trans Comput Biol Bioinf 13(6):1045–1058
Article Google Scholar
Ferreira AJ, Figueiredo MAT (2012) Efficient feature selection filters for high-dimensional data. Pattern Recognit Lett 33(13):1794–1804
Article Google Scholar
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163
Article Google Scholar
Gonzalez-lopez J, Ventura S, Cano A (2019) Distributed multi-label feature selection using individual mutual information measures. Knowl based Syst 188:105052
Article Google Scholar
Gutkin M, Shamir R, Dror G (2009) SlimPLS: a method for feature selection in gene expression-based DISEASE classification. PLoS One 4(7):6416
Article CAS Google Scholar
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(3):1157–1182
Google Scholar
Han J, Pei J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann Elsevier, San Francisco
Google Scholar
Hu L, Gao W, Zhao K, Zhang P, Wang F (2018) Feature selection considering two types of feature relevancy and feature interdependency. Expert Syst Appl 93:423–434
Article Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14(2):1137–1145
Google Scholar
Liu J, Lin Y, Lin M (2017) Feature selection based on quality of information. Neurocomputing 255(10):11–22
Google Scholar
Macgregor PF, Squire JA (2002) Application of microarrays to the analysis of gene expression in cancer. Clin Chem 48(8):1170–1177
Article CAS PubMed Google Scholar
Maulik U, Mukhopadhyay A, Chakraborty D (2013) Gene-expression-based cancer subtypes prediction through feature selection and transductive SVM. IEEE Trans Biomed Eng 60(4):1111–1117
Article PubMed Google Scholar
Medjahed SA, Saadi TA, Benyettou A, Ouali M (2017) Kernel-based learning and feature selection analysis for cancer diagnosis. Appl Soft Comput 51(04):39–48
Article Google Scholar
Mollaee M, Moattar MH (2016) A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for microarray data classification. Biocybern Biomed Eng 36(3):1–9
Article Google Scholar
Mukhopadhyay A, Maulik U (2013) An SVM-wrapped multiobjective evolutionary feature selection approach for identifying cancer-MicroRNA markers. IEEE Trans Nanobiosci 12(4):275–281
Article Google Scholar
Nguyen BH, Xue B, Andreae P (2019) A new binary particle swarm optimization approach : momentum and dynamic balance between exploration and exploitation. IEEE Trans Cybern 1–15
Palma-Mendoza R-J, de-Marcos L, Rodriguez D (2018) Distributed correlation-based feature selection in spark. Inf Sci (NY) 496:287–299
Article Google Scholar
Pang H, Goerge SL, Hui K, Tong T, George SL, Hui K, Tong T (2012) Gene selection using iterative feature elimination random forests for survival outcomes. IEEE ACM Trans Comput Biol Bioinf 9(5):997–1003
Google Scholar
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article PubMed Google Scholar
Qu Y, Li R, Deng A, Shang C, Shen Q (2019). Non-unique decision differential entropy-based feature selection. Neurocomputing
Quinlan JR (1993) C4.5: programs for machine learning. Elsevier, New York
Google Scholar
Ruiz R, Riquelme JC, Aguilar-ruiz JS (2006) Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recognit Lett 39:2383–2392
Article Google Scholar
Shukla AK (2020) Multi-population adaptive genetic algorithm for selection of microarray biomarkers. Neural Comput Appl 1–30
Shukla AK, Singh P, Vardhan M (2019a) A hybrid framework for optimal feature subset selection. J Intell Fuzzy Syst 36(3):2247–2259
Article Google Scholar
Shukla AK, Singh P, Vardhan M (2019b) A new hybrid wrapper TLBO and SA with SVM approach for gene expression data. Inf Sci (NY) 503:238–254
Article Google Scholar
Shukla AK, Singh P, Vardhan M (2019c) A new hybrid feature subset selection framework based on binary genetic algorithm and information theory. Int J Comput Intell Appl 18(03):1950020
Article Google Scholar
Shukla AK, Singh P, Vardhan M (2020) An adaptive inertia weight teaching-learning-based optimization algorithm and its applications. Appl Math Model 77:309–326
Article Google Scholar
Stevens KN, Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27
Article Google Scholar
Sun Y (2007) Iterative RELIEF for feature weighting: algorithms, theories, and applications. IEEE Trans Pattern Anal Mach Intell 29(6):1035–1051
Article PubMed Google Scholar
Tang J, Zhou S (2016) A new approach for feature selection from microarray data based on mutual information. IEEE ACM Trans Comput Biol Bioinf 13(6):1004–1015
Article Google Scholar
Venkataramana L, Gracia S, Rajavel J, Dodda R (2019) Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data. Genes Genom 41(11):1301–1313
Article Google Scholar
Wang A, An N, Chen G, Li L, Alterovitz G (2015) Accelerating wrapper-based feature selection with K-nearest-neighbor. Knowl Based Syst 83:81–91
Article Google Scholar
Wang A, An N, Yang J, Chen G, Li L, Alterovitz G (2017) Wrapper-based gene selection with Markov blanket. Comput Biol Med 81:11–23
Article CAS PubMed Google Scholar
Wang H, Tan L, Niu B (2019) Feature selection for classification of microarray gene expression cancers using bacterial colony optimization with multi-dimensional population. Swarm Evol Comput 48:172–181
Article Google Scholar
Wu X, Kumar V, Ross QJ, Ghosh J, Yang Q, Motoda H, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Article Google Scholar
Wu HC, Wei XG, Chan SC (2017) Novel consensus gene selection criteria for distributed gpu partial least squares-based gene microarray analysis in diffused large B cell lymphoma (DLBCL) and related findings. IEEE ACM Trans Comput Biol Bioinf 59:1–14
Google Scholar
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Google Scholar
Zhao L, Chen Z, Hu Y, Min G, Jiang Z (2016) Distributed feature selection for efficient economic big data analysis. IEEE Trans Big Data 13(9):1–10
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, G L Bajaj Institute of Technology and Management, Greater Noida, Uttar Pradesh, India
Alok Kumar Shukla
SRM University, Amaravati, India
Diwakar Tripathi

Authors

Alok Kumar Shukla
View author publications
You can also search for this author in PubMed Google Scholar
Diwakar Tripathi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alok Kumar Shukla.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This study was performed using available datasets, as per my compliance with ethical standards there were no human or animal participants, and therefore, the study did not require ethics approval.

Research involving human and animal participants

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shukla, A.K., Tripathi, D. Detecting biomarkers from microarray data using distributed correlation based gene selection. Genes Genom 42, 449–465 (2020). https://doi.org/10.1007/s13258-020-00916-w

Download citation

Received: 10 February 2019
Accepted: 23 January 2020
Published: 10 February 2020
Issue Date: April 2020
DOI: https://doi.org/10.1007/s13258-020-00916-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Detecting biomarkers from microarray data using distributed correlation based gene selection