Abstract
Most data mining algorithms are designed for traditional type of data objects which are referred to as certain data objects. Certain data objects contain no uncertainty information and are represented by a single point. Capturing uncertainty can result in better performance of algorithms as they might generate more accurate results. There are different ways of modeling uncertainty for data objects, two of the most popular ones are: (1) considering a group of points for each object and (2) considering a probability density function (pdf) for each object. Objects modeled in these ways are referred to as uncertain data objects. Fuzzy clustering is a well-established field of research for certain data. When fuzzy clustering algorithms are used, degrees of membership are generated for assignment of objects to clusters which gives the flexibility to express that objects can belong to more than one cluster. To the best of our knowledge, for uncertain data, there is only one existing fuzzy clustering algorithm in the literature. The existing uncertain fuzzy clustering algorithm, however, cannot properly create non-convex shaped clusters, and therefore, its performance is not that well on uncertain data sets with arbitrary-shaped clusters—clusters that are non-convex, unconventional, and possibly nonlinearly separable. In this paper, we propose a novel fuzzy kernel K-medoids clustering algorithm for uncertain objects which works well on data sets with arbitrary-shaped clusters. We show through several experiments on synthetic and real data that the proposed algorithm outperforms the competitor algorithms: certain fuzzy K-medoids and the uncertain fuzzy K-medoids.
Similar content being viewed by others
References
Aggarwal CC, Philip SY (2009) A survey of uncertain data algorithms and applications. IEEE Trans Knowl Data Eng 21:609–623
Chau M, Cheng R, Kao B, Ng J (2006) Uncertain data mining: An example in clustering location data. In: Ng W-K, Kitsuregawa M, Li J, Chang K (eds) Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, pp 199–204
Gullo F, Ponti G, Tagarelli A (2013) Minimizing the variance of cluster mixture models for clustering uncertain objects. Stat Anal Data Min ASA Data Sci J 6:116–135
Gullo F, Ponti G, Tagarelli A, Greco S (2017) An information-theoretic approach to hierarchical clustering of uncertain data. Inf Sci 402:199–215
Gullo F, Ponti G, Tagarelli A (2010) Minimizing the variance of cluster mixture models for clustering uncertain objects. In: Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, pp 839–844
Gullo F, Ponti G, Tagarelli A (2008) Clustering uncertain data via k-medoids. In: Greco S, Lukasiewicz T (eds) Scalable Uncertain Management. Springer, Berlin, pp 229–242
Gullo F, Ponti G, Tagarelli A, Greco S (2008) A hierarchical algorithm for clustering uncertain data via an information-theoretic approach. In: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE, pp 821–826
Jiang B, Pei J, Tao Y, Lin X (2013) Clustering uncertain data based on probability distribution similarity. IEEE Trans Knowl Data Eng 25:751–763
Kao B, Lee SD, Lee FK et al (2010) Clustering uncertain data using voronoi diagrams and r-tree index. IEEE Trans Knowl Data Eng 22:1219–1233
Kriegel H-P, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, pp 672–677
Lee SD, Kao B, Cheng R (2007) Reducing UK-means to K-means. In: Data Mining Workshops, 2007. ICDM Workshops 2007. Seventh IEEE International Conference on. IEEE, pp 483–488
Yang B, Zhang Y (2010) Kernel based K-medoids for clustering data with uncertainty. In: Cao L, Feng Y, Zhong J (eds) Advance Data Mining and Applications. Springer, Berlin, pp 246–253
Qin B, Xia Y, Li F (2009) DTU: a decision tree for uncertain data. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) Advances in Knowledge Discovery and Data Mining. Springer, Berlin, pp 4–15
Tavakkol B, Jeong MK, Albin S (2021) Measures of scatter and fisher discriminant analysis for uncertain data. IEEE Trans Syst Man and Cybern Syst 51(3):1690–1703. https://doi.org/10.1109/TSMC.2019.2902508
Tavakkol B, Jeong MK, Albin SL (2017) Object-to-group probabilistic distance measure for uncertain data classification. Neurocomputing 230:143–151
Aggarwal CC, Yu PS (2008) Outlier detection with uncertain data. In: Proceedings of the 2008 SIAM International Conference on Data Mining. SIAM, pp 483–493
Jiang B, Pei J (2011) Outlier detection on uncertain data: Objects, instances, and inferences. In: 2011 IEEE 27th International Conference on Data Engineering. IEEE, pp 422–433
Liu B, Xiao Y, Cao L et al (2013) SVDD-based outlier detection on uncertain data. Knowl Inf Syst 34:597–618
Liu J, Deng H (2013) Outlier detection on uncertain data based on local information. Knowl-Based Syst 51:60–71
Shaikh SA, Kitagawa H (2014) Top-k outlier detection from uncertain data. Int J Autom Comput 11:128–142
Shaikh SA, Kitagawa H (2012) Distance-based outlier detection on uncertain data of Gaussian distribution. In: Asia-Pacific Web Conference. Springer, pp 109–121
Wang B, Xiao G, Yu H, Yang X (2009) Distance-based outlier detection on uncertain data. In: 2009 Ninth IEEE International Conference on Computer and Information Technology. IEEE, pp 293–298
Zhang H, Wang S, Xu X et al (2018) Tree2Vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Netw Learn Syst 29:5304–5318
Yang M-S (1993) A survey of fuzzy clustering. Math Comput Model 18:1–16
Bora DJ, Gupta D, Kumar A (2014) A comparative study between fuzzy clustering algorithm and hard clustering algorithm. ArXiv Prepr ArXiv14046059
Hamdan H, Govaert G (2005) Mixture model clustering of uncertain data. In: The 14th IEEE International Conference on Fuzzy Systems, 2005. FUZZ’05. IEEE, pp 879–884
Kriegel H-P, Pfeifle M (2005) Hierarchical density-based clustering of uncertain data. In: Fifth IEEE International Conference on Data Mining (ICDM’05). IEEE, pp 4–pp
Wang Y, Dong J, Zhou J, et al (2017) Fuzzy c-medoids method based on JS-divergence for uncertain data clustering. In: 2017 4th International Conference on Information, Cybernetics and Computational Social Systems (ICCSS). IEEE, pp 312–315
Patra BK, Nandi S, Viswanath P (2011) A distance based clustering method for arbitrary shaped clusters in large datasets. Pattern Recognit 44:2862–2870
Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1:1
Cui M, Lin Y (2009) Nonlinear numerical analysis in reproducing kernel space. Nova Science Publishers Inc., NewYork
Fan J, Heckman NE, Wand MP (1995) Local polynomial kernel regression for generalized linear models and quasi-likelihood functions. J Am Stat Assoc 90:141–150
Zhong W-M, He G-L, Pi D-Y, Sun Y-X (2005) SVM with quadratic polynomial kernel function based nonlinear model one-step-ahead predictive control. Chin J Chem Eng 13:373–379
Gutmann H-M (2001) A radial basis function method for global optimization. J Glob Optim 19:201–227
Musavi MT, Ahmed W, Chan KH et al (1992) On the training of radial basis function classifiers. Neural Netw 5:595–603
Krishnapuram R, Joshi A, Yi L (1999) A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering. In: FUZZ-IEEE’99. 1999 IEEE International Fuzzy Systems. Conference Proceedings (Cat. No. 99CH36315). IEEE, pp 1281–1286
Cover TM, Thomas JA (2012) Elements of information theory. John Wiley & Sons
Devijver PA, Kittler J (1982) Pattern recognition: a statistical approach. Prentice hall, New Jersey
Briët J, Harremoës P (2009) Properties of classical and quantum Jensen-Shannon divergence. Phys Rev A 79:052311
Fuglede B, Topsoe F (2004) Jensen-Shannon divergence and Hilbert space embedding. In: International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings. IEEE, p 31
Bhattacharyya A (1946) On a measure of divergence between two multinomial populations. Sankhyā Indian J Stat 7(4):401–406
Basseville M (1989) Distance measures for signal processing and pattern recognition. Signal Process 18:349–369
Chernoff H (1952) A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann Math Stat 23(4):493–507
Zhou S, Chellappa R (2004) Probabilistic distance measures in reproducing kernel Hilbert space. SCR Technical Report, University of Maryland, USA
Zhou SK, Chellappa R (2006) From sample similarity to ensemble similarity: probabilistic distance measures in reproducing kernel hilbert space. IEEE Trans Pattern Anal Mach Intell 28:917–929
Zhang H, Guo H, Wang X et al (2020) Clothescounter: a framework for star-oriented clothes mining from videos. Neurocomputing 377:38–48
Graves D, Pedrycz W (2010) Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study. Fuzzy Sets Syst 161:522–543
Campello RJ (2007) A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment. Pattern Recognit Lett 28:833–841
Huang H-C, Chuang Y-Y, Chen C-S (2011) Multiple kernel fuzzy clustering. IEEE Trans Fuzzy Syst 20:120–134
Lei Y, Bezdek JC, Chan J et al (2016) Extending information-theoretic validity indices for fuzzy clustering. IEEE Trans Fuzzy Syst 25:1013–1018
Asuncion A, Newman D (2007) UCI machine learning repository
Acknowledgements
The authors would like to thank the editor and anonymous reviewers for their valuable comments and suggestions which helped to improve the quality of this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tavakkol, B., Son, Y. Fuzzy kernel K-medoids clustering algorithm for uncertain data objects. Pattern Anal Applic 24, 1287–1302 (2021). https://doi.org/10.1007/s10044-021-00983-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-021-00983-z