Abstract
Traditional classifiers often fail to produce desired classification accuracy because of inadequate training samples present in microRNA (miRNA) gene expression cancer datasets. In this context, we propose a novel semi-supervised ensemble learning (SSEL) strategy combining the (advantages of) semi-supervised learning and ensemble learning which is able to produce better results than the individual constituent classifiers. The proposed method is validated using eight publicly available miRNA gene expression datasets of pancreatic and colorectal cancers with respect to classification accuracy, precision, recall, macro \(F_{1}\)-measure and kappa in comparison to six other state-of-the-art methods. The experimental results reveal that the proposed SSEL method significantly dominates other compared methods for cancer sample classification. The results of the statistical significance tests, receiver operating characteristic curve and area under curve justify the relevance of the better results in favor of the proposed method.
Similar content being viewed by others
References
Blows, W.T.: The Biological Basis of Nursing: Cancer, 1st edn. Routledge, London (2005)
ICMR-NCDIR: National Cancer Registry Programme Report 2020 by Indian Council of Medical Research (ICMR) and National Centre for Disease Informatics & Research (NCDIR), Bengaluru, India (2020)
Esquela-Kerscher, E., Slack, F.J.: Oncomirs—microRNAs with a role in cancer. Nat. Rev. cancer 6(4), 259–269 (2006)
Alaimo, S., Giugno, R., Pulvirenti, A.: ncPred: ncRNA-disease association prediction through tripartite network-based inference. Front. Bioeng. Biotechnol. 2, 71 (2014)
Barracchia, E.P., Pio, G., D’Elia, D., Ceci, M.: Prediction of new associations between ncRNAs and diseases exploiting multi-type hierarchical clustering. BMC Bioinform. 21(1), 1–24 (2020)
Hwang, H.W., Mendell, J.T.: MicroRNAs in cell proliferation, cell death, and tumorigenesis. Br. J. Cancer 96(6), 776–780 (2006)
Bartel, D.P.: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 116(2), 281–297 (2004)
Pirooznia, M., Yang, J., Yang, M.Q., Deng, Y.: A comparative study of different machine learning methods on microarray gene expression data. BMC Genom. 9(1), 1–13 (2008)
Tarek, S., El-Khoribi, R., Shoman, M.: Gene expression based cancer classification. Egypt. Inform. J. 18(3), 151–159 (2017)
Guillen, P., Ebalunode, J.: Cancer classification based on microarray gene expression data using deep learning. In: Proceedings of the 2016 International Conference on Computational Science and Computational Intelligence, pp. 1403–1405. IEEE, New York (2016)
Haider, A.A., Asghar, S.: A survey of logic based classifiers. Int. J. Future Comput. Commun. 2(2), 126–129 (2013)
Kotsiantis, S.B., Zaharakis, I., Pintelas, P.: Supervised machine learning: a review of classification techniques. Emerg. Artif. Intell. Appl. Comput. Eng. 160, 3–24 (2007)
Vanitha, C.D.A., Devaraj, D., Venkatesulu, M.: Gene expression data classification using support vector machine and mutual information-based gene selection. Procedia Comput. Sci. 47, 13–21 (2015)
Ernst, J., Beg, Q.K., Kay, K.A., Balzsi, G., Oltvai, Z.N., Bar-Joseph, Z.: Semi-supervised method for predicting transcription factor-gene interactions in Escherichia coli. PLoS Computat. Biol. 4(3), e1000044 (2008)
Ibrahim, R., Yousri, N.A., Ismail, M., El-Makky, N.M.: miRNA and gene expression based cancer classification using self-learning and co-training approaches. In: Proccedings of the 2013 IEEE International Conference on Bioinformatics and Biomedicine, pp. 495–498. IEEE, China (2013)
Halder, A., Misra, S.: Semi-supervised fuzzy k-NN for cancer classification from microarray gene expression data. In: Proceedings of the 1st International Conference on Automation, Control, Energy and Systems (ACES 2014), pp. 1–5. IEEE Computer Society Press, India (2014)
Kumar, A., Halder, A.: Active learning using fuzzy-rough nearest neighbour classifier for cancer prediction from microarray gene expression data. Int. J. Pattern Recognit. Artif. Intell. 34(1), 2057001 (2020)
Halder, A., Kumar, A.: Active learning using rough fuzzy classifier for cancer predication from microarray gene expression data. J. Biomed. Inform. 92, 103136 (2019)
Halder, A., Dey, S., Kumar, A.: Active learning using fuzzy k-NN for cancer classification from microarray gene expression data. In: Bora, P., Prasanna, S., Sarma, K., Saikia, N. (eds.) Advances in Communication and Computing, vol. 347, no. 4, pp. 103–113. Springer, Assam, India (2015)
Chen, X., Ishwaran, H.: Random forests for genomic data analysis. Genomics 99(6), 323–329 (2012)
Tan, A.C., Gilbert, D.: Ensemble machine learning on gene expression data for cancer classification. Appl. Bioinformatics 2(3 Suppl), S75–83 (2003)
Dettling, M., Bhlmann, P.: Boosting for tumor classification with gene expression data. Bioinformatics 19(9), 1061–1069 (2003)
Zhou, Z.H.: When semi-supervised learning meets ensemble learning. Front. Electr. Electron. Eng. China 6(1), 6–16 (2011)
Li, C., Xie, Y., Chen, X.: Semi-supervised ensemble classification method based on near neighbor and its application. Processes 8(4), 415 (2020)
Kim, A., Cho, S.: An ensemble semi-supervised learning method for predicting defaults in social lending. Eng. Appl. Artif. Intell. 81, 193–199 (2019)
Stanescu, A., Caragea, D.: Ensemble-based semi-supervised learning approaches for imbalanced splice site datasets. In: Proccedings of the 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 432–437. IEEE, UK (2014)
Ceci, M., Pio, G., Kuzmanovski, V., Dzeroski, S.: Semi-supervised multi-view learning for gene network reconstruction. PLoS One 10(12), 1–27 (2015)
Livieris, I.: A new ensemble self-labeled semi-supervised algorithm. Informatica 43, 221–234 (2019)
Krasakis, A.M., Tsatsaronis, G.: Semi-supervised ensemble learning with weak supervision for biomedical relationship extraction. In: Proccedings of the Automated Knowledge Base Construction (AKBC), UK (2019)
Pio, G., Ceci, M., D’Elia, D., Malerba, D.: Learning to combine miRNA target predictions: a semi-supervised ensemble learning approach. In: Proceedings of the 22nd Italian Symposium on Advanced Database Systems (SEBD), pp. 21–28. Italy (2014)
Hoi, S.C.H., Jin, R.: Semi-supervised ensemble ranking. In: Proceedings of the 23rd National Conference on Artificial Intelligence - Volume 2, pp. 634–639. AAAI Press, Chicago, Illinois (2008)
Liu, J., Zhao, S., Wang, G.: SSEL-ADE: a semi-supervised ensemble learning framework for extracting adverse drug events from social media. Artif. Intell. Med. 84, 34–49 (2018)
Kumar, A., Halder, A.: Ensemble-based active learning using fuzzy-rough approach for cancer sample classification. Eng. Appl. Artif. Intell. 91, 103591 (2020)
Kamisawa, T., Wood, L.D., Itoi, T., Takaori, K.: Pancreatic cancer. Lancet 388(10039), 73–85 (2016)
Simmonds, P.D., Best, L., George, S., Baughan, C., Buchanan, R., Davis, C., Fentiman, I., Gosney, M., Northover, J., Williams, C.: Surgery for colorectal cancer in elderly patients: a systematic review. Lancet 356(9234), 968–974 (2000)
Mihalcea, R.: Semi-supervised self training of object detection models. In: Proceedings of the 8th Conference on Computational Natural Language Learning at HLT-NAACL, pp. 33–40. Association for Computational Linguistics, Massachusetts, USA (2004)
Schapire, R.E.: Explaining adaboost. In: Empirical Inference, pp. 37–52. Springer, Berlin, Heidelberg (2013)
Zhang, Y., Cao, G., Wang, B., Li, X.: A novel ensemble method for k-nearest neighbor. Pattern Recognit. 85, 13–25 (2019)
Valentini, G., Muselli, M., Ruffino, F.: Cancer recognition with bagged ensembles of support vector machines. Neurocomputing 56, 461–466 (2004)
Li, M., Zhou, Z.H.: Improve computer-aided diagnosis with machine learning techniques using undiagnosed samples. IEEE Trans. Syst. Man Cybern. Part A: Syst. Hum. 37(6), 1088–1098 (2007)
Burbidge, R., Buxton, B.: An introduction to support vector machines for data mining. Keynote Papers, Young OR12, pp. 3–15, University of Nottingham (2001)
Ceriani, L., Verme, P.: The origins of the Gini index: extracts from variabilità e mutabilità (1912) by Corrado Gini. J. Econ. Inequal. 10(3), 421–443 (2012)
Zhu, X., Goldberg, A.B.: Introduction to semi-supervised learning. Synth. Lect. Artif. Intell. Mach. Learn. 3(1), 1–10 (2009)
Chapelle, O., Scholkopf, B., Zien, A.: Semi-supervised Learning (Adaptive Computation and Machine Learning), 1st edn. MIT Press, Cambridge (2010)
Rosenberg, C., Hebert, M., Schneiderman, H.: Semi-supervised self training of object detection models. In: Proceedings of the 7th IEEE Workshop on Applications of Computer Vision/IEEE Workshop on Motion and Video Computing (WACV/MOTION), pp. 29–36. IEEE Computer Society Press, Breckenridge, New York (2005)
Zhang, C., Ma, Y.: Ensemble Machine Learning: Methods and Applications. Springer Science & Business Media, Berlin (2012)
Bühlmann, P.: Bagging, boosting and ensemble methods. In: Gentle, J.E., Härdle, W.K., Mori, Y. (eds.) Handbook of Computational Statistics, pp. 985–1022. Springer, Berlin, Heidelberg (2012)
Yang, P., Yang, Y., Zhou, B., Zomaya, A.: A review of ensemble methods in bioinformatics. Mach. Learn. 5(4), 296–308 (2010)
Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv. Large Margin Classif. 10(3), 61–74 (1999)
Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing multiple parameters for support vector machines. Mach. Learn. 46(1–3), 131–159 (2002)
Clough, E., Barrett, T.: The gene expression omnibus database. Meth. Mole. Biol. 93–110 (2016)
Settouti, N., Daho, M.E.H., Lazouni, M.E.A., Chikh, M.A.: Random forest in semi-supervised learning (co-forest). In: Proccedings of the 8th International Workshop on Systems, Signal Processing and their Applications (WoSSPA) IEEE, pp. 326–329. IEEE Computer Society Press, Piscataway, NJ, Zeralda, Algeria (2013)
Ferri, C., Hernández-Orallo, J., Modroiu, R.: An experimental comparison of performance measures for classification. Pattern Recognit. Lett. 30(1), 27–38 (2009)
Gu, Q., Zhu, L., Cai, Z.: Evaluation measures of the classification performance of imbalanced data sets. In: Proccedings of the International Symposium on Intelligence Computation and Applications. Springer, Berlin (2009)
Williamson, D.F., Parker, R.A., Kendrick, J.S.: The box plot: a simple visual method to interpret data. Ann. Intern. Med. 110(11), 916–921 (1989)
Oyeka, I.C.A., Ebuh, G.U.: Modified Wilcoxon signed-rank test. Open J. Stat. 2(2), 172–176 (2012)
Armstrong, R.A.: When to use the Bonferroni correction. Ophthalmic Physiol. Opt. 34(5), 502–508 (2014)
Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Marak, D.C.B., Halder, A. & Kumar, A. Semi-supervised Ensemble Learning for Efficient Cancer Sample Classification from miRNA Gene Expression Data. New Gener. Comput. 39, 487–513 (2021). https://doi.org/10.1007/s00354-021-00123-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00354-021-00123-5