Skip to main content
Log in

Detecting biomarkers from microarray data using distributed correlation based gene selection

  • Research Article
  • Published:
Genes & Genomics Aims and scope Submit manuscript

Abstract

Background

Over the past few decades, DNA microarray technology has emerged as a prevailing process for early identification of cancer subtypes. Several feature selection (FS) techniques have been widely applied for identifying cancer from microarray gene data but only very few studies have been conducted on distributing the feature selection process for detecting cancer subtypes.

Objective

Not all the gene expressions are needed in prediction, this research article objective is to select discriminative biomarkers by using distributed FS method which helps in accurately diagnosis of cancer subtype. Traditional feature selection techniques have several drawbacks like unrelated features that could perform well in terms of classification accuracy with a suitable subset of genes will be left out of the selection.

Method

To overcome the issue, in this paper a new filter-based method for gene selection is introduced which can select the highly relevant genes for distinguishing tissues from the gene expression dataset. In addition, it is used to compute the relation between gene–gene and gene–class and simultaneously identify subset of essential genes. Our method is tested on Diffuse Large B cell Lymphoma (DLBCL) dataset by using well-known classification techniques such as support vector machine, naïve Bayes, k-nearest neighbor, and decision tree.

Results

Results on biological DLBCL dataset demonstrate that the proposed method provides promising tools for the prediction of cancer type, with the prediction accuracy of 97.62%, precision of 94.23%, sensitivity of 94.12%, F-measure of 90.12%, and ROC value of 99.75%.

Conclusion

The experimental results reveal the fact that the proposed method is significantly improved classification accuracy and execution time, compared to existing standard algorithms when applied to the non-partitioned dataset. Furthermore, the extracted genes are biologically sound and agree with the outcome of relevant biomedical studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Agarwalla P, Mukhopadhyay S (2018) Bi-stage hierarchical selection of pathway genes for cancer progression using a swarm based computational approach. Appl Soft Comput 62:230–250

    Article  Google Scholar 

  • Alirezaei M, Taghi S, Niaki A, Armin S, Niaki A (2019) A bi-objective hybrid optimization algorithm to reduce noise and data dimension in diabetes diagnosis using support vector machines. Expert Syst Appl 127:47–57

    Article  Google Scholar 

  • Ang JC, Mirzal A, Haron H, Nuzly H, Hamed A (2016) Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection. IEEE/ACM Trans Comput Biol Bioinf 13(5):971–989

    Article  Google Scholar 

  • Apolloni J, Leguizamón G, Alba E (2016) Two hybrid wrapper-filter feature selection algorithms applied to high-dimensional microarray experiments. Appl Soft Comput 38:922–932

    Article  Google Scholar 

  • Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A (2015) Distributed feature selection: an application to microarray data classification. Appl Soft Comput 30:136–150

    Article  Google Scholar 

  • Daniel RP, Luis R (2019) Distributed ReliefF based feature selection in spark. Knowl Inf Syst 57(1):1–20

    Google Scholar 

  • Dara RA, Makrehchi M, Kamel MS (2010) Filter-based data partitioning for training multiple classifier systems. IEEE Trans Knowl Data Eng 22(4):508–522

    Article  Google Scholar 

  • Edsgärd D, Johnsson P, Sandberg R (2018) Identification of spatial expression trends in single-cell gene expression data. Nat Methods 15(5):339–342

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  • Fabris F, Freitas AA, Tullet JMA (2016) An extensive empirical comparison of probabilistic hierarchical classifiers in datasets of ageing-related genes. IEEE ACM Trans Comput Biol Bioinf 13(6):1045–1058

    Article  Google Scholar 

  • Ferreira AJ, Figueiredo MAT (2012) Efficient feature selection filters for high-dimensional data. Pattern Recognit Lett 33(13):1794–1804

    Article  Google Scholar 

  • Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163

    Article  Google Scholar 

  • Gonzalez-lopez J, Ventura S, Cano A (2019) Distributed multi-label feature selection using individual mutual information measures. Knowl based Syst 188:105052

    Article  Google Scholar 

  • Gutkin M, Shamir R, Dror G (2009) SlimPLS: a method for feature selection in gene expression-based DISEASE classification. PLoS One 4(7):6416

    Article  CAS  Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3(3):1157–1182

    Google Scholar 

  • Han J, Pei J, Kamber M (2006) Data mining: concepts and techniques. Morgan Kaufmann Elsevier, San Francisco

    Google Scholar 

  • Hu L, Gao W, Zhao K, Zhang P, Wang F (2018) Feature selection considering two types of feature relevancy and feature interdependency. Expert Syst Appl 93:423–434

    Article  Google Scholar 

  • Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14(2):1137–1145

    Google Scholar 

  • Liu J, Lin Y, Lin M (2017) Feature selection based on quality of information. Neurocomputing 255(10):11–22

    Google Scholar 

  • Macgregor PF, Squire JA (2002) Application of microarrays to the analysis of gene expression in cancer. Clin Chem 48(8):1170–1177

    Article  CAS  PubMed  Google Scholar 

  • Maulik U, Mukhopadhyay A, Chakraborty D (2013) Gene-expression-based cancer subtypes prediction through feature selection and transductive SVM. IEEE Trans Biomed Eng 60(4):1111–1117

    Article  PubMed  Google Scholar 

  • Medjahed SA, Saadi TA, Benyettou A, Ouali M (2017) Kernel-based learning and feature selection analysis for cancer diagnosis. Appl Soft Comput 51(04):39–48

    Article  Google Scholar 

  • Mollaee M, Moattar MH (2016) A novel feature extraction approach based on ensemble feature selection and modified discriminant independent component analysis for microarray data classification. Biocybern Biomed Eng 36(3):1–9

    Article  Google Scholar 

  • Mukhopadhyay A, Maulik U (2013) An SVM-wrapped multiobjective evolutionary feature selection approach for identifying cancer-MicroRNA markers. IEEE Trans Nanobiosci 12(4):275–281

    Article  Google Scholar 

  • Nguyen BH, Xue B, Andreae P (2019) A new binary particle swarm optimization approach : momentum and dynamic balance between exploration and exploitation. IEEE Trans Cybern 1–15

  • Palma-Mendoza R-J, de-Marcos L, Rodriguez D (2018) Distributed correlation-based feature selection in spark. Inf Sci (NY) 496:287–299

    Article  Google Scholar 

  • Pang H, Goerge SL, Hui K, Tong T, George SL, Hui K, Tong T (2012) Gene selection using iterative feature elimination random forests for survival outcomes. IEEE ACM Trans Comput Biol Bioinf 9(5):997–1003

    Google Scholar 

  • Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  PubMed  Google Scholar 

  • Qu Y, Li R, Deng A, Shang C, Shen Q (2019). Non-unique decision differential entropy-based feature selection. Neurocomputing

  • Quinlan JR (1993) C4.5: programs for machine learning. Elsevier, New York

    Google Scholar 

  • Ruiz R, Riquelme JC, Aguilar-ruiz JS (2006) Incremental wrapper-based gene selection from microarray data for cancer classification. Pattern Recognit Lett 39:2383–2392

    Article  Google Scholar 

  • Shukla AK (2020) Multi-population adaptive genetic algorithm for selection of microarray biomarkers. Neural Comput Appl 1–30

  • Shukla AK, Singh P, Vardhan M (2019a) A hybrid framework for optimal feature subset selection. J Intell Fuzzy Syst 36(3):2247–2259

    Article  Google Scholar 

  • Shukla AK, Singh P, Vardhan M (2019b) A new hybrid wrapper TLBO and SA with SVM approach for gene expression data. Inf Sci (NY) 503:238–254

    Article  Google Scholar 

  • Shukla AK, Singh P, Vardhan M (2019c) A new hybrid feature subset selection framework based on binary genetic algorithm and information theory. Int J Comput Intell Appl 18(03):1950020

    Article  Google Scholar 

  • Shukla AK, Singh P, Vardhan M (2020) An adaptive inertia weight teaching-learning-based optimization algorithm and its applications. Appl Math Model 77:309–326

    Article  Google Scholar 

  • Stevens KN, Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13(1):21–27

    Article  Google Scholar 

  • Sun Y (2007) Iterative RELIEF for feature weighting: algorithms, theories, and applications. IEEE Trans Pattern Anal Mach Intell 29(6):1035–1051

    Article  PubMed  Google Scholar 

  • Tang J, Zhou S (2016) A new approach for feature selection from microarray data based on mutual information. IEEE ACM Trans Comput Biol Bioinf 13(6):1004–1015

    Article  Google Scholar 

  • Venkataramana L, Gracia S, Rajavel J, Dodda R (2019) Improving classification accuracy of cancer types using parallel hybrid feature selection on microarray gene expression data. Genes Genom 41(11):1301–1313

    Article  Google Scholar 

  • Wang A, An N, Chen G, Li L, Alterovitz G (2015) Accelerating wrapper-based feature selection with K-nearest-neighbor. Knowl Based Syst 83:81–91

    Article  Google Scholar 

  • Wang A, An N, Yang J, Chen G, Li L, Alterovitz G (2017) Wrapper-based gene selection with Markov blanket. Comput Biol Med 81:11–23

    Article  CAS  PubMed  Google Scholar 

  • Wang H, Tan L, Niu B (2019) Feature selection for classification of microarray gene expression cancers using bacterial colony optimization with multi-dimensional population. Swarm Evol Comput 48:172–181

    Article  Google Scholar 

  • Wu X, Kumar V, Ross QJ, Ghosh J, Yang Q, Motoda H, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  • Wu HC, Wei XG, Chan SC (2017) Novel consensus gene selection criteria for distributed gpu partial least squares-based gene microarray analysis in diffused large B cell lymphoma (DLBCL) and related findings. IEEE ACM Trans Comput Biol Bioinf 59:1–14

    Google Scholar 

  • Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224

    Google Scholar 

  • Zhao L, Chen Z, Hu Y, Min G, Jiang Z (2016) Distributed feature selection for efficient economic big data analysis. IEEE Trans Big Data 13(9):1–10

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alok Kumar Shukla.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This study was performed using available datasets, as per my compliance with ethical standards there were no human or animal participants, and therefore, the study did not require ethics approval.

Research involving human and animal participants

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shukla, A.K., Tripathi, D. Detecting biomarkers from microarray data using distributed correlation based gene selection. Genes Genom 42, 449–465 (2020). https://doi.org/10.1007/s13258-020-00916-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13258-020-00916-w

Keywords

Navigation