Missing data techniques in classification for cardiovascular dysautonomias diagnosis

Idri, Ali; Kadi, Ilham; Abnane, Ibtissam; Fernandez-Aleman, José Luis

doi:10.1007/s11517-020-02266-x

Missing data techniques in classification for cardiovascular dysautonomias diagnosis

Original Article
Published: 24 September 2020

Volume 58, pages 2863–2878, (2020)
Cite this article

Medical & Biological Engineering & Computing Aims and scope Submit manuscript

Ali Idri^1,2,
Ilham Kadi¹,
Ibtissam Abnane¹ &
…
José Luis Fernandez-Aleman³

451 Accesses
6 Citations
Explore all metrics

Abstract

Missing data (MD) is a common and inevitable problem facing data mining (DM)–based decision systems in e-health since many medical historical datasets contain a huge number of missing values. Therefore, a pre-processing stage is usually required to deal with missing values before building any DM–based decision system. The purpose of this paper is to evaluate the impact of MD techniques on classification systems in cardiovascular dysautonomias diagnosis. We analyzed and compared the accuracy rates of four classification techniques: random forest (RF), support vector machines (SVM), C4.5 decision tree, and Naive Bayes (NB), using two MD techniques: deletion or imputation with k-nearest neighbors (KNN). A total of 216 experiments were therefore carried out using three missingness mechanisms (MCAR: missing completely at random, MAR: missing at random and NMAR: not missing at random), two MD techniques (deletion and KNN imputation), nine MD percentages from 10 to 90% over a dataset collected from the autonomic nervous system (ANS) unit of the University Hospital Avicenne in Morocco. The results obtained suggest that using KNN imputation rather than deletion enhances the accuracy rates of the four classifiers. Moreover, the MD percentages have a negative impact on the performance of classification techniques regardless of the MD mechanisms and MD techniques used. In fact, the accuracy rates of the four classifiers decrease as the MD percentage increases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance Analysis of Various Missing Value Imputation Methods on Heart Failure Dataset

Multistage Model for Accurate Prediction of Missing Values Using Imputation Methods in Heart Disease Dataset

Filter-based feature selection methods in the presence of missing data for medical prediction models

Article 10 August 2023

References

Gaziano T, Reddy KS, Paccaud F et al (2006) Cardiovascular disease. disease control priorities in developing countries, 2nd edn. World Bank, Washington (DC)
Google Scholar
World Health Organization (2017) http://www.who.int/. Acessed 02 Mar 2017
Fayyad U, Piatetsky-Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases. AI Mag 17:37–54
Google Scholar
Kadi I, Idri A, Fernandez-Aleman JL (2017) Knowledge discovery in cardiology: a systematic literature review. Int J Med Inform 97:12–32
Article CAS Google Scholar
Liou DM, Chang WP (2014) Applying data mining for the analysis of breast cancer data. Data Mining in Clinical Medicine, Volume of the series. Methods Mol Biol 1246:175–189
Article Google Scholar
Marinov M, Mosa AM, Yoo I, Boren SA (2011) Data-mining technologies for diabetes: a systematic review. J Diabetes Sci Technol 5:1549–1556
Article Google Scholar
Kadi I, Idri A, Fernandez-Aleman JL (2017) Systematic mapping study of data mining-based empirical studies in cardiology. Health Inf J 1–30
Han J, Kamber M (2011) Data mining: concepts and techniques. 2nd edition, The Morgan Kaufmann Series in “Data Management Systems”, Morgan Kaufmann Publishers
Rahm E, Do HH (2000) Data cleaning: problems and current approaches. IEEE Data Eng Bull 23:3–13
Google Scholar
Lenzerini M (2002) Data integration: a theoretical perspective. PODS 233–246
Familia A, Shen WM, Weber R, Simoudis E (1997) Data preprocessing and intelligent data analysis. Intell Data Anal 1:3–23
Article Google Scholar
Cismondi F, Fialhoa AS, Vieira SM, Reti SR, Sousa JMC, Finkelstein SN (2013) Missing data in medical databases: impute, delete or classify? Artif Intell Med 58:63–72
Article Google Scholar
Kaiser J (2014) Dealing with missing values in data. J Syst Integr 5:42–51
Article Google Scholar
Idri A, Abnane I, Abran A (2016) Missing data techniques in analogy-based software development effort estimation. J Syst Softw 117:595–611
Article Google Scholar
Abnane I. and Idri A (2016) Evaluating fuzzy analogy on incomplete software projects data. IEEE Symposium Series on Computational Intelligence (SSCI)
Fichman M, Cummings JN (2003) Multiple imputation for missing data: making the most of what you know. Organ Res Methods 6:282–295
Article Google Scholar
Newman DA (2003) Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ Res Methods 6:328–339
Article Google Scholar
Stinebrickner TR (1999) Estimation of a duration model in the presence of missing data. Rev Econ Stat 81:529–546
Article Google Scholar
Idri A, Abnane I, Abran A (2015) Systematic mapping study of missing values techniques in software engineering data. In: International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp 1–8
Bhat VH, Rao PG, Krishna S, Shenoy PD, Venugopal KR, Patnaik LM (2011) An efficient framework for prediction in healthcare data using soft computing techniques. Commun Comput Inf Sci 192
Grzymala-Busse JW, Hu M (2005) A comparison of several approaches to missing attribute values in data mining. In: Rough Sets and Current Trends in Computing, pp 378–385
Setiawan NA, Venkatachalam PA, Hani AFM (2007) Missing data estimation on heart disease using artificial neural network and rough set theory, International Conference on Intelligent and Advanced Systems, Kuala Lumpur, Malaysia
Zhang Y, Kambhampati C, Davis DN, Goode K, Cleland JGF (2012) A comparative study of missing value imputation with multiclass classification for clinical heart failure data. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery
Poolsawad N, Moore L, Kambhampati C, Cleland JGF (2012) Handling missing values in data mining - a case study of heart failure dataset. 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery
Al Shalabi L, Najjar M, Al Kayed A (2006) A framework to deal with missing data in data sets. J Comput Sci 2:740–745
Article Google Scholar
Blankers M, Koeter MWJ, Schippers GM (2010) Missing data approaches in eHealth Research: simulation study and a tutorial for nonmathematically inclined researchers. J Med Internet Res 12:e54
Article Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Article Google Scholar
Little RJA, Rubin D (1987) Statistical analysis with missing data. Wiley, New York
Google Scholar
Li J, Ruhe G, Al-Emran A, Richter MM (2007) A flexible method for soft- ware effort estimation by analogy. Empir Softw Eng 12:65–106
Article CAS Google Scholar
Song Q, Shepperd M, Chen X, Liu J (2008) Can k-NN imputation improve the performance of C4.5 with small software project data sets? A comparative evaluation. J Syst Softw 81:2361–2370
Article Google Scholar
Batista GE, Monard MC (2003) An analysis of four missing data treatment methods for supervised learning. Appl Artif Intell 17:519–533
Article Google Scholar
Grzymala-Busse JW, Grzymala-Busse WJ (2005) Handling missing attribute values. In: Data Mining and Knowledge Discovery Handbook, pp 37–57
Yenduri S (2005) An empirical study of imputation techniques for software data sets. Louisiana State
Setiawan NA, Venkatachalam PA, Hani AFM (2008) A comparative study of imputation methods to predict missing attribute values in coronary heart disease data set. In: 4th Kuala Lumpur International Conference on Biomedical Engineering 21, IFMBE Proceedings, Springer
Idri A, Kadi I (2015) Evaluating a decision tree-based approach for cardiovascular dysautonomias diagnosis. SpringerPlus 5:81
Article Google Scholar
Kadi I, Idri A (2016) Cardiovascular dysautonomias diagnosis using crisp and fuzzy decision tree: a comparative study. Stud Health Technol Inf 223:1–8
Google Scholar
Chawla NV (2010) Data mining for imbalanced datasets: an overview. Data Mining and Knowledge Discovery Handbook, pp 853–867
Quinlan JR (1993) C4.5 Programs for Machine Learning. Morgan Kaufmann, CA, pp 1–302
Google Scholar
Quinlan JR (1986) Induction of decision trees. Mach. Learn. 1, p. 81–106RUBIN, D. B., 1976. Inference and missing data. Biometrika 63:581–592
Google Scholar
Vapnik V (1982) Estimation of dependences based on empirical data. Springer, Verlag
Google Scholar
Pappu V, Pardalos PM (2014) High-dimensional data classification. In: Clusters, orders, and trees: methods and applications 92:119–150
Ho TM (2001) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 1998(20):832–844
Google Scholar
Breiman L Random forests. Mach Learn 45:5–32
Song Q, Ni J, Wang G (2013) A fast clustering based feature selection algorithm for high dimensional data. IEEE Trans Knowl Data Eng 25(1)
Tan PN et al. (2006) Introduction to data mining, Pearson Education.
Salzberg SL (1997) On comparing classifiers: pitfalls to avoid and a recommended approach. Data Min Knowl Disc 1:317–327
Article Google Scholar
Fawcett T (2006) An introduction to ROC analysis. Pattern Recogn Lett 27:861–874
Article Google Scholar
Sheskin D (1997) Handbook of parametric and non-parametric procedures. CRC Press
Abdi H (2010) 1 Overview 2 Preliminary : the different meanings of alpha. Encycl Res Des:1–8. https://doi.org/10.4135/9781412961288.n178
Liu-Peng LL (2005) A review of missing data treatment methods. Int J Intell Inf Syst Tech 412–419
Soley-Bori M (2013) Dealing with missing data: key assumptions and methods for applied analysis. Boston University School of Public Health, Boston
Google Scholar

Download references

Acknowledgment

This work was conducted within the research project MPHR- PPR1/09-2015-2018. The authors would like to thank the Moroccan MESRSFC and CNRST for their support.

Funding

This work is also part of the GINSENG-UMU (TIN2015-70259-C2-2-R) projects, supported by the Spanish Ministry of Economy, Industry and Competitiveness and European FEDER funds.

Author information

Authors and Affiliations

Software Project Management Research Team, Mohammed V University, Rabat, Morocco
Ali Idri, Ilham Kadi & Ibtissam Abnane
CSEHS-MSDA, Mohammed VI Polytechnic University, Ben Guerir, Morocco
Ali Idri
Department of Informatics and Systems, Faculty of Computer Science, University of Murcia, Murcia, Spain
José Luis Fernandez-Aleman

Authors

Ali Idri
View author publications
You can also search for this author in PubMed Google Scholar
Ilham Kadi
View author publications
You can also search for this author in PubMed Google Scholar
Ibtissam Abnane
View author publications
You can also search for this author in PubMed Google Scholar
José Luis Fernandez-Aleman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ali Idri.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Idri, A., Kadi, I., Abnane, I. et al. Missing data techniques in classification for cardiovascular dysautonomias diagnosis. Med Biol Eng Comput 58, 2863–2878 (2020). https://doi.org/10.1007/s11517-020-02266-x

Download citation

Received: 22 November 2018
Accepted: 08 September 2020
Published: 24 September 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s11517-020-02266-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Missing data techniques in classification for cardiovascular dysautonomias diagnosis

Abstract

Access this article

Similar content being viewed by others

Performance Analysis of Various Missing Value Imputation Methods on Heart Failure Dataset

Multistage Model for Accurate Prediction of Missing Values Using Imputation Methods in Heart Disease Dataset

Filter-based feature selection methods in the presence of missing data for medical prediction models

References

Acknowledgment

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Missing data techniques in classification for cardiovascular dysautonomias diagnosis

Abstract

Access this article

Similar content being viewed by others

Performance Analysis of Various Missing Value Imputation Methods on Heart Failure Dataset

Multistage Model for Accurate Prediction of Missing Values Using Imputation Methods in Heart Disease Dataset

Filter-based feature selection methods in the presence of missing data for medical prediction models

References

Acknowledgment

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation