Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect

Pascual-Triana, José Daniel; Charte, David; Andrés Arroyo, Marta; Fernández, Alberto; Herrera, Francisco

doi:10.1007/s10115-021-01577-1

Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect

Regular Paper
Published: 01 June 2021

Volume 63, pages 1961–1989, (2021)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

José Daniel Pascual-Triana ORCID: orcid.org/0000-0003-4106-3182¹,
David Charte¹,
Marta Andrés Arroyo¹,
Alberto Fernández¹ &
…
Francisco Herrera¹

473 Accesses
10 Citations
6 Altmetric
Explore all metrics

Abstract

Data Science and Machine Learning have become fundamental assets for companies and research institutions alike. As one of its fields, supervised classification allows for class prediction of new samples, learning from given training data. However, some properties can cause datasets to be problematic to classify. In order to evaluate a dataset a priori, data complexity metrics have been used extensively. They provide information regarding different intrinsic characteristics of the data, which serve to evaluate classifier compatibility and a course of action that improves performance. However, most complexity metrics focus on just one characteristic of the data, which can be insufficient to properly evaluate the dataset towards the classifiers’ performance. In fact, class overlap, a very detrimental feature for the classification process (especially when imbalance among class labels is also present) is hard to assess. This research work focuses on revisiting complexity metrics based on data morphology. In accordance to their nature, the premise is that they provide both good estimates for class overlap, and great correlations with the classification performance. For that purpose, a novel family of metrics has been developed. Being based on ball coverage by classes, they are named after Overlap Number of Balls. Finally, some prospects for the adaptation of the former family of metrics to singular (more complex) problems are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

Article 09 November 2022

Vitor Werner de Vargas, Jorge Arthur Schneider Aranda, … Jorge Luis Victória Barbosa

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

Bartosz Krawczyk

Notes

References

Aggarwal C (2014) Data classification: algorithms and applications data classification: algorithms and applications. Chapman & Hall/CRC. https://doi.org/10.1201/b17320
Article MATH Google Scholar
Ahmed M (2019) Data summarization: a survey. Knowl Information Syst 58(2):249–273. https://doi.org/10.1007/s10115-018-1183-0
Article Google Scholar
Alejo R, Valdovinos RM, García V, Pacheco-Sanchez JH (2013) A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recognit Lett 34(4):380–388. https://doi.org/10.1016/j.patrec.2012.09.003
Article Google Scholar
Alpaydin E (2016) Machine learning: the new AI. MIT Press, Cambridge
Google Scholar
Alshomrani S, Bawakid A, Shim SO, Fernández A, Herrera F (2015) A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl-Based Syst 73:1–17. https://doi.org/10.1016/j.knosys.2014.09.002
Article Google Scholar
Anuradha Gupta G (2014) A self explanatory review of decision tree classifiers. ICRAIE. https://doi.org/10.1109/ICRAIE.2014.6909245
Article Google Scholar
Astorino A, Fuduli A, Gaudioso M, Vocaturo E (2019) Multiple instance learning algorithm for medical image classification. SEBD 2400:1–8
Google Scholar
Barboza F, Kimura H, Altman E (2017) Machine learning models and bankruptcy prediction. Expert Syst Appl 83:405–417. https://doi.org/10.1016/j.eswa.2017.04.006
Article Google Scholar
Baumgartner R, Somorjai R (2006) Data complexity assesment in undersampled classification of high dimensional biomedical data. Pattern Recog Lett 27:1383–1389. https://doi.org/10.1016/j.patrec.2006.01.006
Article Google Scholar
Ben-Israel D, Jacobs W, Casha S, Lang S, Ryu W, de Lotbiniere-Bassett M, Cadotte D (2020) The impact of machine learning on patient care: a systematic review. Artifi Intell Med. https://doi.org/10.1016/j.artmed.2019.101785
Article Google Scholar
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(10):281–305
MathSciNet MATH Google Scholar
Bernadó-Mansilla E, Ho T (2005) Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans Evol Comput 9(1):82–104. https://doi.org/10.1109/TEVC.2004.840153
Article Google Scholar
Bielza C, Li G, Larrañaga P (2011) Multi-dimensional classification with bayesian networks. Int J Approx Reason 52(6):705–727. https://doi.org/10.1016/j.ijar.2011.01.007
Article MathSciNet MATH Google Scholar
Borchani H, Varando G, Bielza C, Larrañaga P (2015) A survey on multi-output regression. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 5(5):216–233. https://doi.org/10.1002/widm.1157
Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771. https://doi.org/10.1016/j.patcog.2004.03.009
Article Google Scholar
Cano JR (2013) Analysis of data complexity measures for classification. Expert Syst Appl 40(12):4820–4831. https://doi.org/10.1016/j.eswa.2013.02.025
Article Google Scholar
Carbonneau MA, Cheplygina V, Granger E, Gagnon G (2016) Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognit. https://doi.org/10.1016/j.patcog.2017.10.009
Article Google Scholar
Charte D, Charte F, García S, Herrera F (2019) A snapshot on nonstandard supervised learning problems: taxonomy, relationships, problem transformations and algorithm adaptations. Prog Artif Intell 8(1):1–14. https://doi.org/10.1007/s13748-018-00167-7
Article Google Scholar
Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163:3–16. https://doi.org/10.1016/j.neucom.2014.08.091
Article Google Scholar
Cózar J, Fernández A, Herrera F, Gámez JA (2019) A metahierarchical rule decision system to design robust fuzzy classifiers based on data complexity. IEEE Trans Fuzzy Syst 27(4):701–715. https://doi.org/10.1109/TFUZZ.2018.2866967
Article Google Scholar
Das S, Datta S, Chaudhuri BB (2018) Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recognit. 81:674–693. https://doi.org/10.1016/j.patcog.2018.03.008
Diedenhofen B, Musch J (2015) cocor: a comprehensive solution for the statistical comparison of correlations. PLOS ONE 10(4):1–12. https://doi.org/10.1371/journal.pone.0121945
Article Google Scholar
Diederhofen B cocor function | R Documentation. URL https://www.rdocumentation.org/packages/cocor/versions/1.1-3/topics/cocor
Fernández A, Carmona CJ, Del Jesus MJ, Herrera F (2017) A pareto based ensemble with feature and instance selection for learning from multi-class imbalanced datasets. Int J Neural Syst. https://doi.org/10.1142/S0129065717500289
Article Google Scholar
Fernández A, García S, Galar M, Prati R, Krawczyk B, Herrera F (2018). Learning from Imbalanced Data Sets Springer. https://doi.org/10.1007/978-3-319-98074-4
Feurer M, Hutter F (2019) Hyperparameter optimization. Springer, Berlin. https://doi.org/10.1007/978-3-030-05318-5_1
Book Google Scholar
Galar M, Fernández A, Barrenechea E, Herrera F (2014) Empowering difficult classes with a similarity-based aggregation in multi-class classification problems. Inf Sci 264:135–157. https://doi.org/10.1016/j.ins.2013.12.053
Article MathSciNet MATH Google Scholar
Galar M, Fernández A, Tartas EB, Sola HB, Herrera F (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit 44(8):1761–1776. https://doi.org/10.1016/j.patcog.2011.01.017
Garcia LPF, Carvalho ACPdLFd, Lorena AC (2015) Effect of label noise in the complexity of classification problems. Neurocomputing. https://doi.org/10.1016/j.neucom.2014.10.085
Article Google Scholar
García S, Luengo J, Herrera F (2016) Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl-Based Syst 98:1–29. https://doi.org/10.1016/j.knosys.2015.12.006
Article Google Scholar
Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748. https://doi.org/10.1109/TKDE.2016.2545658
Article Google Scholar
Gu B, Sheng V, Tay K, Romano W, Li S (2015) Incremental support vector learning for ordinal regression. IEEE Trans Neural Netw Learn Syst 26(7):1403–1416. https://doi.org/10.1109/TNNLS.2014.2342533
Article MathSciNet Google Scholar
Gupta MR, Bengio S, Weston J (2014) Training highly multiclass classifiers. J Mach Learn Res 15(1):1461–1492. https://dl.acm.org/doi/10.5555/2627435.2638582
Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186. https://doi.org/10.1023/A:1010920819831
Article MATH Google Scholar
Herrera F, Charte F, Rivera AJ, Jesus MJd (2016) Multilabel Classification : Problem Analysis, Metrics and Techniques. Springer, Berlin. https://doi.org/10.1007/978-3-319-41111-8
Book Google Scholar
Herrera F, Ventura S, Bello R, Cornelis C, Zafra A, Sánchez-Tarragó D, Vluymans S (2016) Multiple instance learning: foundations and algorithms. Springer, Berlin. https://doi.org/10.1007/978-3-319-47759-6
Book MATH Google Scholar
Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300. https://doi.org/10.1109/34.990132
Article Google Scholar
Hoekstra A, Duin R (1996) On the nonlinearity of pattern classifiers. In: Proceedings of 13th International Conference on Pattern Recognition, vol. 4, pp. 271–275 vol.4. https://doi.org/10.1109/ICPR.1996.547429. ISSN: 1051-4651
Hornik K Weka\(\_\)classifier\(\_\)trees function | R Documentation. URL https://www.rdocumentation.org/packages/RWeka/versions/0.4-42/topics/Weka_classifier_trees
Hüllermeier E, Fürnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artif Intell 172(16–17):1897–1916. https://doi.org/10.1016/j.artint.2008.08.002
Article MathSciNet MATH Google Scholar
Hutter F, Kotthoff L, Vanschoren J (2019) Automated machine learning - methods, systems challenges. Springer, Berlin
Book Google Scholar
Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. Proc. ECML PKDD08 Discovery Challenge p. 9
Krawczyk B, Triguero I, García S, Woźniak M, Herrera F (2019) Instance reduction for one-class classification. Knowl Inf Syst 59(3):601–628. https://doi.org/10.1007/s10115-018-1220-z
Article Google Scholar
Leevy J, Khoshgoftaar T, Bauder R, Seliya N (2018) A survey on addressing high-class imbalance in big data. J Big Data. https://doi.org/10.1186/s40537-018-0151-6
Article Google Scholar
Leyva E, González A, Pérez R (2015) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367. https://doi.org/10.1109/TKDE.2014.2327034
Article Google Scholar
Lorena A, Costa I, Spolaôr N, de Souto M (2012) Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75:33–42. https://doi.org/10.1016/j.neucom.2011.03.054
Article Google Scholar
Lorena AC, Garcia LPF, Lehmann J, Souto MCP, Ho TK (2019) How Complex is your classification problem? A survey on measuring classification complexity. ACM Comput Surv 52(5):34. https://doi.org/10.1145/3347711
Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936. https://doi.org/10.1007/s00500-010-0625-8
Article Google Scholar
Luengo J, García-Gil D, Ramírez-Gallego S, García S, Herrera F (2020) Big data preprocessing: enabling smart data. Springer, Berlin. https://doi.org/10.1007/978-3-030-39105-8
Book Google Scholar
Luengo J, Herrera F (2010) Domains of competence of fuzzy rule based classification systems with data complexity measures: A case of study using a fuzzy hybrid genetic based machine learning method. Fuzzy Sets Syst 161(1):3–19. https://doi.org/10.1016/j.fss.2009.04.001
Article MathSciNet Google Scholar
Luengo J, Herrera F (2012) Shared domains of competence of approximate learning models using measures of separability of classes. Inf Sci 185(1):43–65. https://doi.org/10.1016/j.ins.2011.09.022
Article MathSciNet Google Scholar
Luengo J, Herrera F (2015) An automatic extraction method of the domains of competence for learning classifiers using data complexity measures. Knowl Inf Syst 42(1):147–180. https://doi.org/10.1007/s10115-013-0700-4
Article Google Scholar
Luo G (2016) A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw Model Anal Health Inform Bioinform. https://doi.org/10.1007/s13721-016-0125-6
Article Google Scholar
Luque A, Carrasco A, Martín A, de lasde las Heras AA (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 91:216–231. https://doi.org/10.1016/j.patcog.2019.02.023
Article Google Scholar
López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. https://doi.org/10.1016/j.ins.2013.07.007
Article Google Scholar
Ma Y (2018) Data complexity analysis for software defect detection. Int J Perform Eng. https://doi.org/10.23940/ijpe.18.08.p5.16951704
Article Google Scholar
Manukyan A, Ceyhan E (2016) Classification of Imbalanced Data with a Geometric Digraph Family. J. Mach. Learn. Res. https://dl.acm.org/doi/abs/10.5555/2946645.3053471
Martínez Torres J, Iglesias Comesaña C, García-Nieto PJ (2019) Review: machine learning techniques applied to cybersecurity. Int J Mach Learn Cybern 10(10):2823–2836. https://doi.org/10.1007/s13042-018-00906-1
Article Google Scholar
Mazurowski M, Malof J, Tourassi G (2011) Comparative analysis of instance selection algorithms for instance-based classifiers in the context of medical decision support. Phys Med Biol 56(2):473–489. https://doi.org/10.1088/0031-9155/56/2/012
Article Google Scholar
Meyer D naiveBayes function | R Documentation. URL https://www.rdocumentation.org/packages/e1071/versions/1.7-2/topics/naiveBayes
Morais G, Prati RC (2013) Complex Network Measures for Data Set Characterization. In: 2013 Brazilian Conference on Intelligent Systems, pp. 12–18. https://doi.org/10.1109/BRACIS.2013.11
Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2016) Can classification performance be predicted by complexity measures? a study using microarray data. Knowl Inf Syst. https://doi.org/10.1007/s10115-016-1003-3
Article Google Scholar
Orriols-Puig A, Macia N, Ho TK (2010) Documentation for the data complexity library in C++. Universitat Ramon Llull, La Salle 196:1–40
Google Scholar
Prati RC, Luengo J, Herrera F (2019) Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise. Knowl Inf Syst 60:63–97. https://doi.org/10.1007/s10115-018-1244-4
Rodriguez D, Dolado J, Tuya J (2015) Bayesian concepts in software testing: An initial review. In: A-TEST 2015: Proceedings of the 6th International Workshop on Automating Test Case Design, Selection and Evaluation, pp. 41–46. https://doi.org/10.1145/2804322.2804329
Schliep K kknn function | R Documentation. https://www.rdocumentation.org/packages/kknn/versions/1.3.1%20/topics/kknn
Scopus: Document Search. URL https://www.scopus.com/search/form.uri?display=basic
Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA
Book Google Scholar
Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2003.1251146
Article Google Scholar
Sun S, Mao L, Dong Z, Wu L (2019) Multiview Machine Learning, 1st edn. Springer, Berlin
Book Google Scholar
Sáez JA, Luengo J, Herrera F (2013) Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognit 46(1):355–364. https://doi.org/10.1016/j.patcog.2012.07.009
Article Google Scholar
Tanwani AK, Farooq M (2010) Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets. In: Bacardit J, Browne W, Drugowitsch, J Bernadó-Mansilla E, Butz MV (eds) Learning Classifier Systems Lecture Notes in Computer Science, (pp. 127–144) Springer, Berlin. doi: https://doi.org/10.1007/978-3-642-17508-4_9
Triguero I, González S, Moyano JM, García S, Alcalá-Fdez J, Luengo J, Fernández A, Jesús MJd, Sánchez L, Herrera F (2017) KEEL 3.0: an open source software for multi-stage analysis in data mining. Int J Comput Intell Syst 10(1):1238–1249. https://doi.org/10.2991/ijcis.10.1.82
Article Google Scholar
Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70. https://doi.org/10.1016/j.ins.2019.08.062
Article Google Scholar
Wojciechowski S, Wilk S (2017) Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data. Found Comput Decis Sci 42(2):149–176. https://doi.org/10.1515/fcds-2017-0007
Article MATH Google Scholar
Zhao J, Xie X, Xu X, Sun S (2017) Multi-view learning overview: recent progress and new challenges. Inf Fusion 38:43–54. https://doi.org/10.1016/j.inffus.2017.02.007
Article Google Scholar
Zhu X (2005) Semi-supervised learning with graphs. phd, Carnegie Mellon University, USA. AAI3179046 ISBN-10: 0542190591
Zou GY (2007) Toward using confidence intervals to compare correlations. Psychol Methods 12(4):399–413. https://doi.org/10.1037/1082-989X.12.4.399
Article Google Scholar

Download references

Acknowledgements

This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under project TIN2017-89517-P, including European Regional Development Funds, and the Andalusian regional project P18-FR-4961. This work is part of the PRII2018-02 Intensification Program from the University of Granada and the FPU National Program (Ref. FPU17/04069).

Author information

Authors and Affiliations

Andalusian Institute of Data Science and Computational Intelligence (DASCI), University of Granada, 18071, Granada, Spain
José Daniel Pascual-Triana, David Charte, Marta Andrés Arroyo, Alberto Fernández & Francisco Herrera

Authors

José Daniel Pascual-Triana
View author publications
You can also search for this author in PubMed Google Scholar
David Charte
View author publications
You can also search for this author in PubMed Google Scholar
Marta Andrés Arroyo
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Fernández
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Herrera
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José Daniel Pascual-Triana.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Description of other complexity metrics

1.1 Feature overlap

This group of metrics assesses the capability of features for the discerning of the classes. If there is at least one feature with low overlap of different classes, the classification should be easier and, thus, obtain better results. The same works for combinations of features and areas of the n-dimensional space of each dataset. Five different metrics can be enumerated:

F1: this is the Maximum Fisher’s Discriminant Ratio, which measures the ease to separate the classes using the features (columns) of the data. It compares the dispersion inside each class and the dispersion of the classes. Higher values indicate less overlapping features and, thus, a less complex dataset.
F1v: this is the Directional Vector Maximum Fisher’s Discriminant Ratio from [63]. It complements F1, looking for the hyperplane, generated by a vector, that best discerns the classes instead of using the separate features. Greater values indicate a lower complexity.
F2: this metric estimates the volume of the overlapping region, using the ratio between the value limits of each class for each feature and the full range of said feature, and multiplying them to get the ratio of volume overlap. Greater values indicate more overlap, which increases the complexity.
F3: it is the Maximum Individual Feature Efficiency, which is the biggest ratio of points outside the overlapping region of a feature and the total number of points. Greater values indicate less complexity.
F4: this is the Collective Feature Efficiency from [63]. It is based on an iterative use of F3 over the dataset, each time choosing the most efficient feature and setting aside the non-overlapped points of that feature, until there are no more points or features. F4 indicates the ratio of points that have been discerned over the total number of points, so greater values of F4 signal lower complexity.

1.2 Linearity

These measures assess the ease of separability of the different classes by hyperplanes, which would lead to easier classification. There are three main metrics in this category:

L1: it is a measure of the Sum of the Error Distance by Linear Programming. After using the linear classifier, the total error distance of the misclassified points to their closest hyperplane is computed and divided by the total number of points. The bigger this ratio, the bigger the L1 will be, indicating more complexity.
L2: this is the Error Rate of the Linear Classifier, that is, the number of misclassified points divided by the total number of points. The bigger the L2, the more complex a problem will be.
L3: it is the Non-Linearity of a Linear Classifier, from [38]. For this metric, new points are generated using pairs of points sharing a class, and they are classified using the initial data as the training set for the generation of the classification model. L3 is the ratio of the misclassified points from those interpolated. A higher value of L3 indicates more complex boundaries and problems.

1.3 Dimensionality

This set of measures reflect the data sparsity that can appear from a high dimensionality. When a dataset has low-density or even void areas, the model might fail to correctly classify new data there. Three metrics stand out:

T2: it is the Average Number of Features per Dimension, that is, the number of features of the dataset divided by the number of points. Greater values indicate less points per feature, which will cause sparsity and a higher complexity.
T3: it is the Average number of PCA Dimensions per points. This is the division of the dimensionality of the PCA selected attributes over the number of points of the dataset. Greater values signal a higher complexity.
T4: this is the Ratio of the PCA Dimension to the Original Dimension, which is the division of the dimensionality of the PCA selected attributes over the original dimensionality. Greater values indicate that more features are necessary to explain the data variability and, usually, higher complexity.

1.4 Class balance

These metrics measure the differences in the number of elements in each class, which could favour the classification of the predominant class. The two most common metrics are:

C1: it is the Entropy of Class Proportions. The higher the value (closer to 1), the most balanced the dataset will be, which usually indicates lower complexity.

C2: this is the Imbalance Ratio, using the multiclass modification in [72]. It has the value “0” for balanced problems, and higher values (up to 1) indicate more imbalance.

1.5 Network properties

These metrics study graph properties of the data, after using the distances between data points to generate it. To this end, each point becomes a node and the instructions in [29, 61] and [77] are followed in order to decide the edges, which only join close points that belong to the same class. The three basic metrics are the following:

Density: it is the Average Density of the Network, which is obtained from the ratio between the number of edges of the graph and the maximum possible amount of edges for that graph. The more edges, the lower the complexity.
Clustering Coefficient: this metric is derived from the mean of the ratio of edges between each point and its neighbours and the maximum number of edges between them, for every point. It signals the tendency to create cliques. The higher the value, the most complex the dataset.
Hubs: this is the Mean Hub Score of the graph. The hub score measures the importance of each node from both its connections and their hub scores, in an iterative way. The higher the value, the most complex the dataset.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pascual-Triana, J.D., Charte, D., Andrés Arroyo, M. et al. Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowl Inf Syst 63, 1961–1989 (2021). https://doi.org/10.1007/s10115-021-01577-1

Download citation

Received: 19 July 2020
Revised: 30 April 2021
Accepted: 05 May 2021
Published: 01 June 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s10115-021-01577-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Description of other complexity metrics

1.1 Feature overlap

1.2 Linearity

1.3 Dimensionality

1.4 Class balance

1.5 Network properties

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect

Abstract

Access this article

Similar content being viewed by others

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

Learning from imbalanced data: open challenges and future directions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher's Note

Description of other complexity metrics

Description of other complexity metrics

1.1 Feature overlap

1.2 Linearity

1.3 Dimensionality

1.4 Class balance

1.5 Network properties

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation