Abstract
In this paper, multivalued data or multiple values variables are defined. They are typical when there is some intrinsic uncertainty in data production, as the result of imprecise measuring instruments, such as in image recognition, in human judgments and so on. So far, contributions in symbolic data analysis literature provide data preprocessing criteria allowing for the use of standard methods such as factorial analysis, clustering, discriminant analysis, tree-based methods. As an alternative, this paper introduces a methodology for supervised classification, the so-called Dynamic CLASSification TREE (D-CLASS TREE), dealing simultaneously with both standard and multivalued data as well. For that, an innovative partitioning criterion with a tree-growing algorithm will be defined. Main result is a dynamic tree structure characterized by the simultaneous presence of binary and ternary partitions. A real world case study will be considered to show the advantages of the proposed methodology and main issues of the interpretation of the final results. A comparative study with other approaches dealing with the same types of data will be also shown. The comparison highlights that, even if the results are quite similar in terms of error rates, the proposed D-CLASS tree returns a more interpretable tree-based structure.
Similar content being viewed by others
References
Argenziano G, Fabbrocini G, Carli P, De Giorgi V, Sammarco E, Delfino M (1998) Epiluminescence microscopy for the diagnosis of doubtful melanocytic skin lesions: comparison of the abcd rule of dermatoscopy and a new 7-point checklist based on pattern analysis. Archiv Dermatol 134(12):1563–1570
Bergmann B, Hommel G (1988) Improvements of general multiple test procedures for redundant systems of hypogheses. In: Bauer P, Hommel G, Sonnemann E (eds) Multiple hypothesenprüfung (Multiple hypotheses testing). Springer, Berlin, pp 100–115
Bashir S, Qamar U, Khan FH (2014) Heterogeneous classifiers fusion for dynamic breast cancer diagnosis using weighted vote based ensemble. Qual Quant 49:2061–2076
Billard L, Diday E (2003) From the statistics of data to the statistics of knowledge: symbolic data analysis. J Am Stat Assoc 98(462):470–487
Bock HH, Diday E (2012) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer Science & Business Media, Berlin
Bono A, Tomatis S, Bartoli C, Tragni G, Radaelli G, Maurichi A, Marchesini R (1999) The abcd system of melanoma detection. Cancer 85(1):72–77
Borgoni R, Berrington A (2013) Evaluating a sequential tree-based procedure for multivariate imputation of complex missing data structures. Qual Quant 47(4):1991–2008
Box GE, Cox DR (1964) An analysis of transformations. J R Stat Soc Ser B 26(2):211–252
Bradley AP (1997) The use of the area under the roc curve in the evaluation of machine learning algorithms. Pattern Recognit. 30(7):1145–1159
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Friedman J, Olshen RA, Stone CJ (1984) Classification and regression trees. CRC Press, Boca Raton
Brier GW (1950) Verification of forecasts expressed in terms of probability. Mon Weather Rev 78(1):1–3
Cappelli C, Mola F, Siciliano R (2002) A statistical approach to growing a reliable honest tree. Comput Stat Data Anal 38(3):285–299
Celebi ME, Kingravi HA, Uddin B, Iyatomi H, Aslandogan YA, Stoecker WV, Moss RH (2007) A methodological approach to the classification of dermoscopy images. Comput Med Imag Graph 31(6):362–373
Couso I, Sánchez L (2011) Mark-recapture techniques in statistical tests for imprecise data. Int J Approx Reason 52(2):240–260
Cozza V, Guarracino MR, Maddalena L, Baroni A (2011) Dynamic clustering detection through multi-valued descriptors of dermoscopic images. Stat Med 30(20):2536–2550
D’Ambrosio A, Aria M, Siciliano R (2012) Accurate tree-based missing data imputation and data fusion within the statistical learning paradigm. J Classif 29(2):227–258
D’Ambrosio A, Aria M, Iorio C, Siciliano R (2017) Regression trees for multivalued numerical response variables. Expert Syst Appl 69:21–28
Dietterich TG (2000) Ensemble methods in machine learning. In: Kittler J, Roli F (eds) Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science, vol 1857. Springer, Berlin, pp 1–15
Ferraro MB, Coppi R, Rodríguez GG, Colubi A (2010) A linear regression model for imprecise response. Int J Approx Reason 51(7):759–770
Ferraro MB, Colubi A, González-Rodríguez G, Coppi R (2011) A determination coefficient for a linear regression model with imprecise response. Environmetrics 22(4):516–529
Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recognit Lett 30(1):27–38
Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Garcia S, Herrera F (2008) An extension on ”statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec):2677–2694
Gil MÁ, Montenegro M, González-Rodríguez G, Colubi A, Casals MR (2006) Bootstrap approach to the multi-sample test of means with imprecise data. Comput Stat Data Anal 51(1):148–162
Górecki T, Krzyśko M, Waszak L, Wołyński W (2016) Selected statistical methods of data analysis for multivariate functional data. Stat Pap 59(1):1–30. https://doi.org/10.1007/s00362-016-0757-8
Hastie T, Tibshirani R, Friedman J, Franklin J (2005) The elements of statistical learning: data mining, inference and prediction. Math Intell 27(2):83–85
Iman RL, Davenport JM (1980) Approximations of the critical region of the fbietkan statistic. Commun Stat Theory Methods 9(6):571–595
Iorio C, Frasso G, DAmbrosio A, Siciliano R (2016) Parsimonious time series clustering using p-splines. Expert Syst Appl 52:26–38
Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47(260):583–621
Lange T, Mosler K, Mozharovskyi P (2014) Fast nonparametric classification based on data depth. Stat Pap 55:49–69
Limam M, Diday E, Winsberg S (2003) Symbolic class description with interval data. J Symb Data Anal 1(1)
Maglogiannis I, Kosmopoulos DI (2006) Computational vision systems for the detection of malignant melanoma. Oncol Rep 15(4):1027–1032
Makinde OS (2016) Classification rules based on distribution functions of functional depth. Stat Pap. https://doi.org/10.1007/s00362-016-0841-0
Mballo C, Diday E (2005) Decision trees on interval valued variables. Electron J Symb Data Anal 3(1):8–18
Mosler K, Mozharovskyi P (2015) Fast dd-classification of functional data. Stat Pap. https://doi.org/10.1007/s00362-015-0738-3
Nachbar F, Stolz W, Merkle T, Cognetta AB, Vogt T, Landthaler M, Bilek P, Braun-Falco O, Plewig G (1994) The abcd rule of dermatoscopy: high prospective value in the diagnosis of doubtful melanocytic skin lesions. J Am Acad Dermatol 30(4):551–559
Otsu N (1975) A threshold selection method from gray-level histograms. Automatica 11(285–296):23–27
Périnel E, Lechevallier Y (2000) Symbolic discrimination rules. In: Bock HH, Diday E (eds) Analysis of symbolic data: exploratory methods for extracting statistical information from complex data. Springer, Berlin, pp 244–265
Siciliano R, Aria M, Conversano C (2004) Harvesting trees: methods, software and applications. In: Proceedings in Computational Statistics: 16th Symposium of IASC. COMPSTAT2004, held Prague
Siciliano R, Tutore VA, Aria M, D’Ambrosio A (2010) Trees with leaves and without leaves. In: Proceedings of the 45th Scientific Meeting of the Italian Statistical Society. Italian Statistical Society
Situ N, Yuan X, Zouridakis G (2011) Assisting main task learning by heterogeneous auxiliary tasks with applications to skin cancer screening. J Mach Learn Res 15:688
Tarpey T, Kinateder KK (2003) Clustering functional data. J Classif 20(1):093–114
Tutore VA, Siciliano R, Aria M (2007) Conditional classification trees using instrumental variables. In: Berthold M, Shawe-Taylor J, Lavrač N (eds) Advances in intelligent data analysis VII. IDA 2007. Lecture Notes in Computer Science, vol 4723. Springer, Berlin, pp 163–173
Viertl R (2003) Statistical inference with imprecise data. Encyclopedia of life support systems. UNESCO, Paris. Online publication: http://www.eolss.unesco.org
Viertl R (1997) On statistical inference for non-precise data. Environmetrics 8(5):541–568
Yang MS, Hwang PY, Chen DH (2004) Fuzzy clustering algorithms for mixed feature variables. Fuzzy Sets Syst 141(2):301–317
Zadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the ICML. Citeseer, vol 1, pp 609–616
Acknowledgements
Authors would like to thank Prof. A. Baroni of the Campania University “Luigi Vanvitelli” (Italy) for kindly providing us the Skin lesions data set. Authors would like to thank two anonymous reviewers whose comments highly contribute to improve the quality of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aria, M., D’Ambrosio, A., Iorio, C. et al. Dynamic recursive tree-based partitioning for malignant melanoma identification in skin lesion dermoscopic images. Stat Papers 61, 1645–1661 (2020). https://doi.org/10.1007/s00362-018-0997-x
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-018-0997-x