Skip to main content
Log in

Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Data Science and Machine Learning have become fundamental assets for companies and research institutions alike. As one of its fields, supervised classification allows for class prediction of new samples, learning from given training data. However, some properties can cause datasets to be problematic to classify. In order to evaluate a dataset a priori, data complexity metrics have been used extensively. They provide information regarding different intrinsic characteristics of the data, which serve to evaluate classifier compatibility and a course of action that improves performance. However, most complexity metrics focus on just one characteristic of the data, which can be insufficient to properly evaluate the dataset towards the classifiers’ performance. In fact, class overlap, a very detrimental feature for the classification process (especially when imbalance among class labels is also present) is hard to assess. This research work focuses on revisiting complexity metrics based on data morphology. In accordance to their nature, the premise is that they provide both good estimates for class overlap, and great correlations with the classification performance. For that purpose, a novel family of metrics has been developed. Being based on ball coverage by classes, they are named after Overlap Number of Balls. Finally, some prospects for the adaptation of the former family of metrics to singular (more complex) problems are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://github.com/jdpastri/morphology-metrics.

  2. https://github.com/jdpastri/morphology-metrics.

References

  1. Aggarwal C (2014) Data classification: algorithms and applications data classification: algorithms and applications. Chapman & Hall/CRC. https://doi.org/10.1201/b17320

    Article  MATH  Google Scholar 

  2. Ahmed M (2019) Data summarization: a survey. Knowl Information Syst 58(2):249–273. https://doi.org/10.1007/s10115-018-1183-0

    Article  Google Scholar 

  3. Alejo R, Valdovinos RM, García V, Pacheco-Sanchez JH (2013) A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recognit Lett 34(4):380–388. https://doi.org/10.1016/j.patrec.2012.09.003

    Article  Google Scholar 

  4. Alpaydin E (2016) Machine learning: the new AI. MIT Press, Cambridge

    Google Scholar 

  5. Alshomrani S, Bawakid A, Shim SO, Fernández A, Herrera F (2015) A proposal for evolutionary fuzzy systems using feature weighting: dealing with overlapping in imbalanced datasets. Knowl-Based Syst 73:1–17. https://doi.org/10.1016/j.knosys.2014.09.002

    Article  Google Scholar 

  6. Anuradha Gupta G (2014) A self explanatory review of decision tree classifiers. ICRAIE. https://doi.org/10.1109/ICRAIE.2014.6909245

    Article  Google Scholar 

  7. Astorino A, Fuduli A, Gaudioso M, Vocaturo E (2019) Multiple instance learning algorithm for medical image classification. SEBD 2400:1–8

    Google Scholar 

  8. Barboza F, Kimura H, Altman E (2017) Machine learning models and bankruptcy prediction. Expert Syst Appl 83:405–417. https://doi.org/10.1016/j.eswa.2017.04.006

    Article  Google Scholar 

  9. Baumgartner R, Somorjai R (2006) Data complexity assesment in undersampled classification of high dimensional biomedical data. Pattern Recog Lett 27:1383–1389. https://doi.org/10.1016/j.patrec.2006.01.006

    Article  Google Scholar 

  10. Ben-Israel D, Jacobs W, Casha S, Lang S, Ryu W, de Lotbiniere-Bassett M, Cadotte D (2020) The impact of machine learning on patient care: a systematic review. Artifi Intell Med. https://doi.org/10.1016/j.artmed.2019.101785

    Article  Google Scholar 

  11. Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13(10):281–305

    MathSciNet  MATH  Google Scholar 

  12. Bernadó-Mansilla E, Ho T (2005) Domain of competence of XCS classifier system in complexity measurement space. IEEE Trans Evol Comput 9(1):82–104. https://doi.org/10.1109/TEVC.2004.840153

    Article  Google Scholar 

  13. Bielza C, Li G, Larrañaga P (2011) Multi-dimensional classification with bayesian networks. Int J Approx Reason 52(6):705–727. https://doi.org/10.1016/j.ijar.2011.01.007

    Article  MathSciNet  MATH  Google Scholar 

  14. Borchani H, Varando G, Bielza C, Larrañaga P (2015) A survey on multi-output regression. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 5(5):216–233. https://doi.org/10.1002/widm.1157

  15. Boutell MR, Luo J, Shen X, Brown CM (2004) Learning multi-label scene classification. Pattern Recognit 37(9):1757–1771. https://doi.org/10.1016/j.patcog.2004.03.009

    Article  Google Scholar 

  16. Cano JR (2013) Analysis of data complexity measures for classification. Expert Syst Appl 40(12):4820–4831. https://doi.org/10.1016/j.eswa.2013.02.025

    Article  Google Scholar 

  17. Carbonneau MA, Cheplygina V, Granger E, Gagnon G (2016) Multiple instance learning: a survey of problem characteristics and applications. Pattern Recognit. https://doi.org/10.1016/j.patcog.2017.10.009

    Article  Google Scholar 

  18. Charte D, Charte F, García S, Herrera F (2019) A snapshot on nonstandard supervised learning problems: taxonomy, relationships, problem transformations and algorithm adaptations. Prog Artif Intell 8(1):1–14. https://doi.org/10.1007/s13748-018-00167-7

    Article  Google Scholar 

  19. Charte F, Rivera AJ, del Jesus MJ, Herrera F (2015) Addressing imbalance in multilabel classification: measures and random resampling algorithms. Neurocomputing 163:3–16. https://doi.org/10.1016/j.neucom.2014.08.091

    Article  Google Scholar 

  20. Cózar J, Fernández A, Herrera F, Gámez JA (2019) A metahierarchical rule decision system to design robust fuzzy classifiers based on data complexity. IEEE Trans Fuzzy Syst 27(4):701–715. https://doi.org/10.1109/TFUZZ.2018.2866967

    Article  Google Scholar 

  21. Das S, Datta S, Chaudhuri BB (2018) Handling data irregularities in classification: foundations, trends, and future challenges. Pattern Recognit. 81:674–693. https://doi.org/10.1016/j.patcog.2018.03.008

  22. Diedenhofen B, Musch J (2015) cocor: a comprehensive solution for the statistical comparison of correlations. PLOS ONE 10(4):1–12. https://doi.org/10.1371/journal.pone.0121945

    Article  Google Scholar 

  23. Diederhofen B cocor function | R Documentation. URL https://www.rdocumentation.org/packages/cocor/versions/1.1-3/topics/cocor

  24. Fernández A, Carmona CJ, Del Jesus MJ, Herrera F (2017) A pareto based ensemble with feature and instance selection for learning from multi-class imbalanced datasets. Int J Neural Syst. https://doi.org/10.1142/S0129065717500289

    Article  Google Scholar 

  25. Fernández A, García S, Galar M, Prati R, Krawczyk B, Herrera F (2018). Learning from Imbalanced Data Sets Springer. https://doi.org/10.1007/978-3-319-98074-4

  26. Feurer M, Hutter F (2019) Hyperparameter optimization. Springer, Berlin. https://doi.org/10.1007/978-3-030-05318-5_1

    Book  Google Scholar 

  27. Galar M, Fernández A, Barrenechea E, Herrera F (2014) Empowering difficult classes with a similarity-based aggregation in multi-class classification problems. Inf Sci 264:135–157. https://doi.org/10.1016/j.ins.2013.12.053

    Article  MathSciNet  MATH  Google Scholar 

  28. Galar M, Fernández A, Tartas EB, Sola HB, Herrera F (2011) An overview of ensemble methods for binary classifiers in multi-class problems: experimental study on one-vs-one and one-vs-all schemes. Pattern Recognit 44(8):1761–1776. https://doi.org/10.1016/j.patcog.2011.01.017

  29. Garcia LPF, Carvalho ACPdLFd, Lorena AC (2015) Effect of label noise in the complexity of classification problems. Neurocomputing. https://doi.org/10.1016/j.neucom.2014.10.085

    Article  Google Scholar 

  30. García S, Luengo J, Herrera F (2016) Tutorial on practical tips of the most influential data preprocessing algorithms in data mining. Knowl-Based Syst 98:1–29. https://doi.org/10.1016/j.knosys.2015.12.006

    Article  Google Scholar 

  31. Geng X (2016) Label distribution learning. IEEE Trans Knowl Data Eng 28(7):1734–1748. https://doi.org/10.1109/TKDE.2016.2545658

    Article  Google Scholar 

  32. Gu B, Sheng V, Tay K, Romano W, Li S (2015) Incremental support vector learning for ordinal regression. IEEE Trans Neural Netw Learn Syst 26(7):1403–1416. https://doi.org/10.1109/TNNLS.2014.2342533

    Article  MathSciNet  Google Scholar 

  33. Gupta MR, Bengio S, Weston J (2014) Training highly multiclass classifiers. J Mach Learn Res 15(1):1461–1492. https://dl.acm.org/doi/10.5555/2627435.2638582

  34. Hand DJ, Till RJ (2001) A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn 45(2):171–186. https://doi.org/10.1023/A:1010920819831

    Article  MATH  Google Scholar 

  35. Herrera F, Charte F, Rivera AJ, Jesus MJd (2016) Multilabel Classification : Problem Analysis, Metrics and Techniques. Springer, Berlin. https://doi.org/10.1007/978-3-319-41111-8

    Book  Google Scholar 

  36. Herrera F, Ventura S, Bello R, Cornelis C, Zafra A, Sánchez-Tarragó D, Vluymans S (2016) Multiple instance learning: foundations and algorithms. Springer, Berlin. https://doi.org/10.1007/978-3-319-47759-6

    Book  MATH  Google Scholar 

  37. Ho TK, Basu M (2002) Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 24(3):289–300. https://doi.org/10.1109/34.990132

    Article  Google Scholar 

  38. Hoekstra A, Duin R (1996) On the nonlinearity of pattern classifiers. In: Proceedings of 13th International Conference on Pattern Recognition, vol. 4, pp. 271–275 vol.4. https://doi.org/10.1109/ICPR.1996.547429. ISSN: 1051-4651

  39. Hornik K Weka\(\_\)classifier\(\_\)trees function | R Documentation. URL https://www.rdocumentation.org/packages/RWeka/versions/0.4-42/topics/Weka_classifier_trees

  40. Hüllermeier E, Fürnkranz J, Cheng W, Brinker K (2008) Label ranking by learning pairwise preferences. Artif Intell 172(16–17):1897–1916. https://doi.org/10.1016/j.artint.2008.08.002

    Article  MathSciNet  MATH  Google Scholar 

  41. Hutter F, Kotthoff L, Vanschoren J (2019) Automated machine learning - methods, systems challenges. Springer, Berlin

    Book  Google Scholar 

  42. Katakis I, Tsoumakas G, Vlahavas I (2008) Multilabel text classification for automated tag suggestion. Proc. ECML PKDD08 Discovery Challenge p. 9

  43. Krawczyk B, Triguero I, García S, Woźniak M, Herrera F (2019) Instance reduction for one-class classification. Knowl Inf Syst 59(3):601–628. https://doi.org/10.1007/s10115-018-1220-z

    Article  Google Scholar 

  44. Leevy J, Khoshgoftaar T, Bauder R, Seliya N (2018) A survey on addressing high-class imbalance in big data. J Big Data. https://doi.org/10.1186/s40537-018-0151-6

    Article  Google Scholar 

  45. Leyva E, González A, Pérez R (2015) A set of complexity measures designed for applying meta-learning to instance selection. IEEE Trans Knowl Data Eng 27(2):354–367. https://doi.org/10.1109/TKDE.2014.2327034

    Article  Google Scholar 

  46. Lorena A, Costa I, Spolaôr N, de Souto M (2012) Analysis of complexity indices for classification problems: cancer gene expression data. Neurocomputing 75:33–42. https://doi.org/10.1016/j.neucom.2011.03.054

    Article  Google Scholar 

  47. Lorena AC, Garcia LPF, Lehmann J, Souto MCP, Ho TK (2019) How Complex is your classification problem? A survey on measuring classification complexity. ACM Comput Surv 52(5):34. https://doi.org/10.1145/3347711

  48. Luengo J, Fernández A, García S, Herrera F (2011) Addressing data complexity for imbalanced data sets: analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Comput 15(10):1909–1936. https://doi.org/10.1007/s00500-010-0625-8

    Article  Google Scholar 

  49. Luengo J, García-Gil D, Ramírez-Gallego S, García S, Herrera F (2020) Big data preprocessing: enabling smart data. Springer, Berlin. https://doi.org/10.1007/978-3-030-39105-8

    Book  Google Scholar 

  50. Luengo J, Herrera F (2010) Domains of competence of fuzzy rule based classification systems with data complexity measures: A case of study using a fuzzy hybrid genetic based machine learning method. Fuzzy Sets Syst 161(1):3–19. https://doi.org/10.1016/j.fss.2009.04.001

    Article  MathSciNet  Google Scholar 

  51. Luengo J, Herrera F (2012) Shared domains of competence of approximate learning models using measures of separability of classes. Inf Sci 185(1):43–65. https://doi.org/10.1016/j.ins.2011.09.022

    Article  MathSciNet  Google Scholar 

  52. Luengo J, Herrera F (2015) An automatic extraction method of the domains of competence for learning classifiers using data complexity measures. Knowl Inf Syst 42(1):147–180. https://doi.org/10.1007/s10115-013-0700-4

    Article  Google Scholar 

  53. Luo G (2016) A review of automatic selection methods for machine learning algorithms and hyper-parameter values. Netw Model Anal Health Inform Bioinform. https://doi.org/10.1007/s13721-016-0125-6

    Article  Google Scholar 

  54. Luque A, Carrasco A, Martín A, de lasde las Heras AA (2019) The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit 91:216–231. https://doi.org/10.1016/j.patcog.2019.02.023

    Article  Google Scholar 

  55. López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141. https://doi.org/10.1016/j.ins.2013.07.007

    Article  Google Scholar 

  56. Ma Y (2018) Data complexity analysis for software defect detection. Int J Perform Eng. https://doi.org/10.23940/ijpe.18.08.p5.16951704

    Article  Google Scholar 

  57. Manukyan A, Ceyhan E (2016) Classification of Imbalanced Data with a Geometric Digraph Family. J. Mach. Learn. Res. https://dl.acm.org/doi/abs/10.5555/2946645.3053471

  58. Martínez Torres J, Iglesias Comesaña C, García-Nieto PJ (2019) Review: machine learning techniques applied to cybersecurity. Int J Mach Learn Cybern 10(10):2823–2836. https://doi.org/10.1007/s13042-018-00906-1

    Article  Google Scholar 

  59. Mazurowski M, Malof J, Tourassi G (2011) Comparative analysis of instance selection algorithms for instance-based classifiers in the context of medical decision support. Phys Med Biol 56(2):473–489. https://doi.org/10.1088/0031-9155/56/2/012

    Article  Google Scholar 

  60. Meyer D naiveBayes function | R Documentation. URL https://www.rdocumentation.org/packages/e1071/versions/1.7-2/topics/naiveBayes

  61. Morais G, Prati RC (2013) Complex Network Measures for Data Set Characterization. In: 2013 Brazilian Conference on Intelligent Systems, pp. 12–18. https://doi.org/10.1109/BRACIS.2013.11

  62. Morán-Fernández L, Bolón-Canedo V, Alonso-Betanzos A (2016) Can classification performance be predicted by complexity measures? a study using microarray data. Knowl Inf Syst. https://doi.org/10.1007/s10115-016-1003-3

    Article  Google Scholar 

  63. Orriols-Puig A, Macia N, Ho TK (2010) Documentation for the data complexity library in C++. Universitat Ramon Llull, La Salle 196:1–40

    Google Scholar 

  64. Prati RC, Luengo J, Herrera F (2019) Emerging topics and challenges of learning from noisy data in nonstandard classification: a survey beyond binary class noise. Knowl Inf Syst 60:63–97. https://doi.org/10.1007/s10115-018-1244-4

  65. Rodriguez D, Dolado J, Tuya J (2015) Bayesian concepts in software testing: An initial review. In: A-TEST 2015: Proceedings of the 6th International Workshop on Automating Test Case Design, Selection and Evaluation, pp. 41–46. https://doi.org/10.1145/2804322.2804329

  66. Schliep K kknn function | R Documentation. https://www.rdocumentation.org/packages/kknn/versions/1.3.1%20/topics/kknn

  67. Scopus: Document Search. URL https://www.scopus.com/search/form.uri?display=basic

  68. Shalev-Shwartz S, Ben-David S (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, USA

    Book  Google Scholar 

  69. Singh S (2003) Multiresolution estimates of classification complexity. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2003.1251146

    Article  Google Scholar 

  70. Sun S, Mao L, Dong Z, Wu L (2019) Multiview Machine Learning, 1st edn. Springer, Berlin

    Book  Google Scholar 

  71. Sáez JA, Luengo J, Herrera F (2013) Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification. Pattern Recognit 46(1):355–364. https://doi.org/10.1016/j.patcog.2012.07.009

    Article  Google Scholar 

  72. Tanwani AK, Farooq M (2010) Classification Potential vs. Classification Accuracy: A Comprehensive Study of Evolutionary Algorithms with Biomedical Datasets. In: Bacardit J, Browne W, Drugowitsch, J Bernadó-Mansilla E, Butz MV (eds) Learning Classifier Systems Lecture Notes in Computer Science, (pp. 127–144) Springer, Berlin. doi: https://doi.org/10.1007/978-3-642-17508-4_9

  73. Triguero I, González S, Moyano JM, García S, Alcalá-Fdez J, Luengo J, Fernández A, Jesús MJd, Sánchez L, Herrera F (2017) KEEL 3.0: an open source software for multi-stage analysis in data mining. Int J Comput Intell Syst 10(1):1238–1249. https://doi.org/10.2991/ijcis.10.1.82

    Article  Google Scholar 

  74. Vuttipittayamongkol P, Elyan E (2020) Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inf Sci 509:47–70. https://doi.org/10.1016/j.ins.2019.08.062

    Article  Google Scholar 

  75. Wojciechowski S, Wilk S (2017) Difficulty factors and preprocessing in imbalanced data sets: an experimental study on artificial data. Found Comput Decis Sci 42(2):149–176. https://doi.org/10.1515/fcds-2017-0007

    Article  MATH  Google Scholar 

  76. Zhao J, Xie X, Xu X, Sun S (2017) Multi-view learning overview: recent progress and new challenges. Inf Fusion 38:43–54. https://doi.org/10.1016/j.inffus.2017.02.007

    Article  Google Scholar 

  77. Zhu X (2005) Semi-supervised learning with graphs. phd, Carnegie Mellon University, USA. AAI3179046 ISBN-10: 0542190591

  78. Zou GY (2007) Toward using confidence intervals to compare correlations. Psychol Methods 12(4):399–413. https://doi.org/10.1037/1082-989X.12.4.399

    Article  Google Scholar 

Download references

Acknowledgements

This work has been partially supported by the Spanish Ministry of Economy and Competitiveness under project TIN2017-89517-P, including European Regional Development Funds, and the Andalusian regional project P18-FR-4961. This work is part of the PRII2018-02 Intensification Program from the University of Granada and the FPU National Program (Ref. FPU17/04069).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José Daniel Pascual-Triana.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Description of other complexity metrics

Description of other complexity metrics

1.1 Feature overlap

This group of metrics assesses the capability of features for the discerning of the classes. If there is at least one feature with low overlap of different classes, the classification should be easier and, thus, obtain better results. The same works for combinations of features and areas of the n-dimensional space of each dataset. Five different metrics can be enumerated:

  • F1: this is the Maximum Fisher’s Discriminant Ratio, which measures the ease to separate the classes using the features (columns) of the data. It compares the dispersion inside each class and the dispersion of the classes. Higher values indicate less overlapping features and, thus, a less complex dataset.

  • F1v: this is the Directional Vector Maximum Fisher’s Discriminant Ratio from [63]. It complements F1, looking for the hyperplane, generated by a vector, that best discerns the classes instead of using the separate features. Greater values indicate a lower complexity.

  • F2: this metric estimates the volume of the overlapping region, using the ratio between the value limits of each class for each feature and the full range of said feature, and multiplying them to get the ratio of volume overlap. Greater values indicate more overlap, which increases the complexity.

  • F3: it is the Maximum Individual Feature Efficiency, which is the biggest ratio of points outside the overlapping region of a feature and the total number of points. Greater values indicate less complexity.

  • F4: this is the Collective Feature Efficiency from [63]. It is based on an iterative use of F3 over the dataset, each time choosing the most efficient feature and setting aside the non-overlapped points of that feature, until there are no more points or features. F4 indicates the ratio of points that have been discerned over the total number of points, so greater values of F4 signal lower complexity.

1.2 Linearity

These measures assess the ease of separability of the different classes by hyperplanes, which would lead to easier classification. There are three main metrics in this category:

  • L1: it is a measure of the Sum of the Error Distance by Linear Programming. After using the linear classifier, the total error distance of the misclassified points to their closest hyperplane is computed and divided by the total number of points. The bigger this ratio, the bigger the L1 will be, indicating more complexity.

  • L2: this is the Error Rate of the Linear Classifier, that is, the number of misclassified points divided by the total number of points. The bigger the L2, the more complex a problem will be.

  • L3: it is the Non-Linearity of a Linear Classifier, from [38]. For this metric, new points are generated using pairs of points sharing a class, and they are classified using the initial data as the training set for the generation of the classification model. L3 is the ratio of the misclassified points from those interpolated. A higher value of L3 indicates more complex boundaries and problems.

1.3 Dimensionality

This set of measures reflect the data sparsity that can appear from a high dimensionality. When a dataset has low-density or even void areas, the model might fail to correctly classify new data there. Three metrics stand out:

  • T2: it is the Average Number of Features per Dimension, that is, the number of features of the dataset divided by the number of points. Greater values indicate less points per feature, which will cause sparsity and a higher complexity.

  • T3: it is the Average number of PCA Dimensions per points. This is the division of the dimensionality of the PCA selected attributes over the number of points of the dataset. Greater values signal a higher complexity.

  • T4: this is the Ratio of the PCA Dimension to the Original Dimension, which is the division of the dimensionality of the PCA selected attributes over the original dimensionality. Greater values indicate that more features are necessary to explain the data variability and, usually, higher complexity.

1.4 Class balance

These metrics measure the differences in the number of elements in each class, which could favour the classification of the predominant class. The two most common metrics are:

C1: it is the Entropy of Class Proportions. The higher the value (closer to 1), the most balanced the dataset will be, which usually indicates lower complexity.

C2: this is the Imbalance Ratio, using the multiclass modification in [72]. It has the value “0” for balanced problems, and higher values (up to 1) indicate more imbalance.

1.5 Network properties

These metrics study graph properties of the data, after using the distances between data points to generate it. To this end, each point becomes a node and the instructions in [29, 61] and [77] are followed in order to decide the edges, which only join close points that belong to the same class. The three basic metrics are the following:

  • Density: it is the Average Density of the Network, which is obtained from the ratio between the number of edges of the graph and the maximum possible amount of edges for that graph. The more edges, the lower the complexity.

  • Clustering Coefficient: this metric is derived from the mean of the ratio of edges between each point and its neighbours and the maximum number of edges between them, for every point. It signals the tendency to create cliques. The higher the value, the most complex the dataset.

  • Hubs: this is the Mean Hub Score of the graph. The hub score measures the importance of each node from both its connections and their hub scores, in an iterative way. The higher the value, the most complex the dataset.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pascual-Triana, J.D., Charte, D., Andrés Arroyo, M. et al. Revisiting data complexity metrics based on morphology for overlap and imbalance: snapshot, new overlap number of balls metrics and singular problems prospect. Knowl Inf Syst 63, 1961–1989 (2021). https://doi.org/10.1007/s10115-021-01577-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-021-01577-1

Keywords

Navigation