Skip to main content
Log in

Missing data techniques in classification for cardiovascular dysautonomias diagnosis

  • Original Article
  • Published:
Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Abstract

Missing data (MD) is a common and inevitable problem facing data mining (DM)–based decision systems in e-health since many medical historical datasets contain a huge number of missing values. Therefore, a pre-processing stage is usually required to deal with missing values before building any DM–based decision system. The purpose of this paper is to evaluate the impact of MD techniques on classification systems in cardiovascular dysautonomias diagnosis. We analyzed and compared the accuracy rates of four classification techniques: random forest (RF), support vector machines (SVM), C4.5 decision tree, and Naive Bayes (NB), using two MD techniques: deletion or imputation with k-nearest neighbors (KNN). A total of 216 experiments were therefore carried out using three missingness mechanisms (MCAR: missing completely at random, MAR: missing at random and NMAR: not missing at random), two MD techniques (deletion and KNN imputation), nine MD percentages from 10 to 90% over a dataset collected from the autonomic nervous system (ANS) unit of the University Hospital Avicenne in Morocco. The results obtained suggest that using KNN imputation rather than deletion enhances the accuracy rates of the four classifiers. Moreover, the MD percentages have a negative impact on the performance of classification techniques regardless of the MD mechanisms and MD techniques used. In fact, the accuracy rates of the four classifiers decrease as the MD percentage increases.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig 3.
Fig 4.
Fig 5
Fig 6
Fig. 7

Similar content being viewed by others

References

  1. Gaziano T, Reddy KS, Paccaud F et al (2006) Cardiovascular disease. disease control priorities in developing countries, 2nd edn. World Bank, Washington (DC)

    Google Scholar 

  2. World Health Organization (2017) http://www.who.int/. Acessed 02 Mar 2017

  3. Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI Mag 17:37–54

    Google Scholar 

  4. Kadi I, Idri A, Fernandez-Aleman JL (2017) Knowledge discovery in cardiology: a systematic literature review. Int J Med Inform 97:12–32

    Article  CAS  Google Scholar 

  5. Liou DM, Chang WP (2014) Applying data mining for the analysis of breast cancer data. Data Mining in Clinical Medicine, Volume of the series. Methods Mol Biol 1246:175–189

    Article  Google Scholar 

  6. Marinov M, Mosa AM, Yoo I, Boren SA (2011) Data-mining technologies for diabetes: a systematic review. J Diabetes Sci Technol 5:1549–1556

    Article  Google Scholar 

  7. Kadi I, Idri A, Fernandez-Aleman JL (2017) Systematic mapping study of data mining-based empirical studies in cardiology. Health Inf J 1–30

  8. Han J, Kamber M (2011) Data mining: concepts and techniques. 2nd edition, The Morgan Kaufmann Series in “Data Management Systems”, Morgan Kaufmann Publishers

  9. Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23:3–13

    Google Scholar 

  10. Lenzerini M (2002) Data integration: a theoretical perspective. PODS 233–246

  11. Familia A, Shen WM, Weber R, Simoudis E (1997) Data preprocessing and intelligent data analysis. Intell Data Anal 1:3–23

    Article  Google Scholar 

  12. Cismondi F, Fialhoa AS, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN (2013) Missing data in medical databases: impute, delete or classify? Artif Intell Med 58:63–72

    Article  Google Scholar 

  13. Kaiser J (2014) Dealing with missing values in data. J Syst Integr 5:42–51

    Article  Google Scholar 

  14. Idri A, Abnane I, Abran A (2016) Missing data techniques in analogy-based software development effort estimation. J Syst Softw 117:595–611

    Article  Google Scholar 

  15. Abnane I. and Idri A (2016) Evaluating fuzzy analogy on incomplete software projects data. IEEE Symposium Series on Computational Intelligence (SSCI)

  16. Fichman M, Cummings JN (2003) Multiple imputation for missing data: making the most of what you know. Organ Res Methods 6:282–295

    Article  Google Scholar 

  17. Newman DA (2003) Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ Res Methods 6:328–339

    Article  Google Scholar 

  18. Stinebrickner TR (1999) Estimation of a duration model in the presence of missing data. Rev Econ Stat 81:529–546

    Article  Google Scholar 

  19. Idri A, Abnane I, Abran A (2015) Systematic mapping study of missing values techniques in software engineering data. In: International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp 1–8

  20. Bhat VH, Rao PG, Krishna S, Shenoy PD, Venugopal KR, Patnaik LM (2011) An efficient framework for prediction in healthcare data using soft computing techniques. Commun Comput Inf Sci 192

  21. Grzymala-Busse JW, Hu M (2005) A comparison of several approaches to missing attribute values in data mining. In: Rough Sets and Current Trends in Computing, pp 378–385

  22. Setiawan NA, Venkatachalam PA, Hani AFM (2007) Missing data estimation on heart disease using artificial neural network and rough set theory, International Conference on Intelligent and Advanced Systems, Kuala Lumpur, Malaysia

  23. Zhang Y, Kambhampati C, Davis DN, Goode K, Cleland JGF (2012) A comparative study of missing value imputation with multiclass classification for clinical heart failure data. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery

  24. Poolsawad N, Moore L, Kambhampati C, Cleland JGF (2012) Handling missing values in data mining - a case study of heart failure dataset. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery

  25. Al Shalabi L, Najjar M, Al Kayed A (2006) A framework to deal with missing data in data sets. J Comput Sci 2:740–745

    Article  Google Scholar 

  26. Blankers M, Koeter MWJ, Schippers GM (2010) Missing data approaches in eHealth Research: simulation study and a tutorial for nonmathematically inclined researchers. J Med Internet Res 12:e54

    Article  Google Scholar 

  27. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  Google Scholar 

  28. Little RJA, Rubin D (1987) Statistical analysis with missing data. Wiley, New York

    Google Scholar 

  29. Li J, Ruhe G, Al-Emran A, Richter MM (2007) A flexible method for soft- ware effort estimation by analogy. Empir Softw Eng 12:65–106

    Article  CAS  Google Scholar 

  30. Song Q, Shepperd M, Chen X, Liu J (2008) Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw 81:2361–2370

    Article  Google Scholar 

  31. Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533

    Article  Google Scholar 

  32. Grzymala-Busse JW, Grzymala-Busse WJ (2005) Handling missing attribute values. In: Data Mining and Knowledge Discovery Handbook, pp 37–57

  33. Yenduri S (2005) An empirical study of imputation techniques for software data sets. Louisiana State

  34. Setiawan NA, Venkatachalam PA, Hani AFM (2008) A comparative study of imputation methods to predict missing attribute values in coronary heart disease data set. In: 4th Kuala Lumpur International Conference on Biomedical Engineering 21, IFMBE Proceedings, Springer

  35. Idri A, Kadi I (2015) Evaluating a decision tree-based approach for cardiovascular dysautonomias diagnosis. SpringerPlus 5:81

    Article  Google Scholar 

  36. Kadi I, Idri A (2016) Cardiovascular dysautonomias diagnosis using crisp and fuzzy decision tree: a comparative study. Stud Health Technol Inf 223:1–8

    Google Scholar 

  37. Chawla NV (2010) Data mining for imbalanced datasets: an overview. Data Mining and Knowledge Discovery Handbook, pp 853–867

  38. Quinlan JR (1993) C4.5 Programs for Machine Learning. Morgan Kaufmann, CA, pp 1–302

    Google Scholar 

  39. Quinlan JR (1986) Induction of decision trees. Mach. Learn. 1, p. 81–106RUBIN, D. B., 1976. Inference and missing data. Biometrika 63:581–592

    Google Scholar 

  40. Vapnik V (1982) Estimation of dependences based on empirical data. Springer, Verlag

    Google Scholar 

  41. Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Clusters, orders, and trees: methods and applications 92:119–150

  42. Ho TM (2001) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 1998(20):832–844

    Google Scholar 

  43. Breiman L Random forests. Mach Learn 45:5–32

  44. Song Q, Ni J, Wang G (2013) A fast clustering based feature selection algorithm for high dimensional data. IEEE Trans Knowl Data Eng 25(1)

  45. Tan PN et al. (2006) Introduction to data mining, Pearson Education.

  46. Salzberg SL (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1:317–327

    Article  Google Scholar 

  47. Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874

    Article  Google Scholar 

  48. Sheskin D (1997) Handbook of parametric and non-parametric procedures. CRC Press

  49. Abdi H (2010) 1 Overview 2 Preliminary : the different meanings of alpha. Encycl Res Des:1–8. https://doi.org/10.4135/9781412961288.n178

  50. Liu-Peng LL (2005) A review of missing data treatment methods. Int J Intell Inf Syst Tech 412–419

  51. Soley-Bori M (2013) Dealing with missing data: key assumptions and methods for applied analysis. Boston University School of Public Health, Boston

    Google Scholar 

Download references

Acknowledgment

This work was conducted within the research project MPHR- PPR1/09-2015-2018. The authors would like to thank the Moroccan MESRSFC and CNRST for their support.

Funding

This work is also part of the GINSENG-UMU (TIN2015-70259-C2-2-R) projects, supported by the Spanish Ministry of Economy, Industry and Competitiveness and European FEDER funds.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ali Idri.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Idri, A., Kadi, I., Abnane, I. et al. Missing data techniques in classification for cardiovascular dysautonomias diagnosis. Med Biol Eng Comput 58, 2863–2878 (2020). https://doi.org/10.1007/s11517-020-02266-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11517-020-02266-x

Keywords

Navigation