Abstract
Numerous works implemented multi-view clustering algorithms in document clustering. A challenging problem in document clustering is the similarity metric. Existing multi-view document clustering methods broadly utilized two measurements: the Cosine similarity (CS) and the Euclidean distance (ED). The first did not consider the magnitude difference (MD) between the two vectors. The second can’t register the divergence of two vectors that offer a similar ED. In this paper, we originally created five models of similarity metric. This methodology foils the downside of the CS and ED similarity metrics by figuring the divergence between documents with the same ED while thinking about their sizes. Furthermore, we proposed our multi-view document clustering plan which dependent on the proposed similarity metric. Firstly, CS, ED, triangle’s area similarity and sector’s area similarity metric, and our five similarity metrics have been applied to every view of a dataset to generate a corresponding similarity matrix. Afterward, we ran clustering algorithms on these similarity matrices to evaluate the performance of single view. Later, we aggregated these similarity matrices to obtain a unified similarity matrix and apply spectral clustering algorithm on it to generate the final clusters. The experimental results show that the proposed similarity functions can gauge the similitude between documents more accurately than the existing metrics, and the proposed clustering scheme surpasses considerably up-to-date algorithms.
Similar content being viewed by others
Notes
https://github.com/Geovhbn/MLRSSC
References
Shah N, Mahajan S (2012) Document clustering: a detailed review. Int J Appl Inf Syst 4(5):30–38
Bisson G, Grimal C (2012) Co-clustering of multi-view datasets: a parallelizable approach. In: Proceedings of the 12th international conference on data mining. IEEE, pp 828–833
Hussain SF, Mushtaq M, Halim Z (2014) Multi-view document clustering via ensemble method. J Intell Inf Syst 43(1):81–99
Sabthami J, Thirumoorthy K, Muneeswaran K (2016) Multi-view clustering of clinical documents based on conditions and medical responses of patients. In: Proceedings of the 10th international conference on intelligent systems and control (ISCO). IEEE, pp 1–5
Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Proc Expert Syst Appl 134:192–200
Wahid A, Gao X, Andreae P (2014) Multi-view clustering of web documents using multi-objective genetic algorithm. In: Proceedings of the congress on evolutionary computation (CEC). IEEE, pp 2625–2632
Priya MJS (2012) Clustering technique in data mining for text documents. Int J Comput Sci Inf Technol 1:2943–2947
Zhan K, Shi J, Wang J, Tian F (2017) Graph-regularized concept factorization for multi-view document clustering. J Vis Commun Image Represent 48:411–418
Yan W, Zhang B, Ma S, Yang Z (2017) A novel regularized concept factorization for document clustering. Knowl Based Syst 135:147–158
Jia H, Ding S, Du M, Xue Y (2016) Approximate normalized cuts without Eigen-decomposition. Inf Sci 374:135–150
Sherkat E, Milios EE, Minghim R (2019) A visual analytic approach for interactive document clustering. ACM Trans Interact Intell Syst 10(1):1–33
Hussain SF, Bisson G, Grimal C (2010) An improved co-similarity measure for document clustering. In: Proceedings of the 9th international conference on machine learning and applications, 2010, pp 190–197
Xu S, Chan K-S, Gao J, Xu X, Li X, Hua X, An J (2016) An integrated k-means-Laplacian cluster ensemble approach for document datasets. Neurocomputing 214:495–507
Heidarian A, Dinneen MJ (2016) A hybrid geometric approach for measuring similarity level among documents and document clustering. In: Proceedings of the 2nd international conference on big data computing service and applications. IEEE, pp 142–151
Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125
Huang S, Xu Z, Lv J (2018) Adaptive local structure learning for document co-clustering. Knowl Based Syst 148:74–84
Tan AH, Ridge K, Labs D, Terrace HMK (1999) Text mining: the state of the art and the challenges,” Proceedings of the Pakdd workshop on knowledge discovery from advanced databases, pp 65–70
Kaijun W, Baijie W, Liuqing P (2009) CVAP: Validation for cluster analyses. Data Sci J 0904220071–0904220071
Talib R, Kashif M, Ayesha S, Fatima F (2016) Text mining: techniques, applications and issues. Int J Adv Comput Sci Appl 7(11):414–418
Bhardwaj B (2016) Text mining, its utilities, challenges and clustering techniques. Int J Comput Appl 135(7):22–24
Yue L, Zuo W, Peng T, Wang Y, Han X (2015) A fuzzy document clustering approach based on domain-specified ontology. Data Knowl Eng 100:148–166
Birjali M, Beni-Hssane A, Erritali M (2016) Measuring documents similarity in large corpus using mapreduce algorithm. In: Proceedings of the 5th international conference on multimedia computing and systems. IEEE, 2016, pp 24–28
Wagh R, Anand D (2017) Application of citation network analysis for improved similarity index estimation of legal case documents: a study. In: International conference on current trends in advanced computing, (ICCTAC). IEEE, 2017, pp 1–5
Jagatheeshkumar G, Brunda SS (2017) An analysis of efficient clustering methods for estimates similarity measures. In: Proceedings of the 4th international conference on advanced computing and communication systems. IEEE, 2017, pp 1–3
Shirkhorshidi AS, Aghabozorgi S, Wah TY (2015) A comparison study on similarity and dissimilarity measures in clustering continuous data. PloS One 10(12):1–20
Popat SK, Deshmukh PB, Metre VA (2017) Hierarchical document clustering based on cosine similarity measure. In: Proceedings of the 1st international conference on intelligent systems and information management. IEEE, 2017, pp 153–159
George KK, Kumar CS, Sivadas S, Ramachandran K, Panda A (2018) Analysis of cosine distance features for speaker verification. Pattern Recognit Lett 112:285–289
Kalhori H, Alamdari MM, Ye L (2018) Automated algorithm for impact force identification using cosine similarity searching. Measurement 122:648–657
Diego JSN, Mesquita PP, João PP Gomes, Amauri HSJ (2017) Euclidean distance estimation in incomplete datasets. Neurocomputing 248:11–18
Sailaja NV, Padmasree L, Mangathayaru N (2016) Survey of text mining techniques, challenges and their applications. Int J Comput Appl 146(11):30–35
Ye Y, Liu X, Liu Q, Yin J (2017) Consensus kernel k-means clustering for incomplete multi-view data. Comput Intell Neurosci 2017:1–11
Hussain SF, Bashir S (2016) Co-clustering of multi-view datasets. Knowl Inf Syst 47(3):545–570
Liang N, Yang Z, Li Z, Sun W, Xie S (2020) Multi-view clustering by non-negative matrix factorization with co-orthogonal constraints. Knowl Based Syst 105582
Jin H, Feiping N, Heng H, Chris D (2014) Robust manifold non-negative matrix factorization. ACM Trans Knowl Discov Data 8(3):1–21
Yang Y, Wang H (2018) Multi-view clustering: a survey. Big Data Min Anal 1(2):83–107
Diallo B, Hu J, Li T, Khan G, Ji C (2019) Concept-enhanced multi-view clustering of document data. In: Proceedings of the 14th international conference on intelligent systems and knowledge engineering. IEEE, 2019, pp 1357–1363
Yu D, Xu Z, Pedrycz W, Wang W (2017) Information sciences 1968–2016: a retrospective analysis with text mining and bibliometric. Inf Sci 418:619–634
Saini N, Saha S, Bhattacharyya P (2019) Automatic scientific document clustering using self-organized multi-objective differential evolution. Cogn Comput 11(2):271–293
Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recognit Artif Intell 25(03):337–372
Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156
Boongoen T, Iam-On N (2018) Cluster ensembles: a survey of approaches with recent extensions and applications. Comput Sci Rev 28:1–25
Xie X, Sun S (2013) Multi-view clustering ensembles. In: Proceedings of the 2013 international conference on machine learning and cybernetics. IEEE, 2013, pp 51–56
Cano A (2017) An ensemble approach to multi-view multi-instance learning. Knowl Based Syst 136:46–57
Huang S, Wang H, Li D, Yang Y, Li T (2015) Spectral co-clustering ensemble. Knowl Based Syst 84:46–55
Sun S (2013) A survey of multi-view machine learning. Neural Comput Appl 23(7–8):2031–2038
Zhao J, Xie X, Xu X, Sun S (2017) Multi-view learning overview: recent progress and new challenges. Inf Fusion 38:43–54
Jiang B, Qiu F, Wang L (2016) Multi-view clustering via simultaneous weighting on views and features. Appl Soft Comput J 47:304–315
Xu YM, Wang CD, Lai JH (2016) Weighted multi-view clustering with feature selection. Pattern Recognit 53:25–35
Huang S, Kang Z, Xu Z (2018) Self-weighted multi-view clustering with soft capped norm. Knowl Based Syst 158:1–8
Huang S, Kang Z, Tsang IW, Xu Z (2019) Auto-weighted multi-view clustering via kernelized graph learning. Pattern Recognit 88:174–184
Wahid A, Gao X, Andreae P (2015) Multi-objective clustering ensemble for high-dimensional data based on strength pareto evolutionary algorithm (spea-ii). In: Proceedings of the international conference on data science and advanced analytics. IEEE, 2015, pp 1–9
Xia P, Zhang L, Li F (2015) Learning similarity with cosine similarity ensemble. Inf Sci 307:39–52
Dong J-Y, Chen Y, Wan S-P (2018) A cosine similarity based qualiflex approach with hesitant fuzzy linguistic term sets for financial performance evaluation. Appl Soft Comput 69:316–329
Geng Z, Li Y, Han Y, Zhu Q (2018) A novel self-organizing cosine similarity learning network: an application to production prediction of petrochemical systems. Energy 142:400–410
Xiang W-L, Li Y-Z, He R-C, Gao M-X, An M-Q (2018) A novel artificial bee colony algorithm based on the cosine similarity. Comput Ind Eng 115:54–68
Moujahid D, Elharrouss O, Tairi H (2018) Visual object tracking via the local soft cosine similarity. Pattern Recognit Lett 110:79–85
Alencar J, Lavor C, Liberti L (2019) Realizing euclidean distance matrices by sphere intersection. Discrete Appl Math 256:5–10
Bapat RB, Kurata H (2019) On Cartesian product of Euclidean distance matrices. Linear Algebra Appl 562:135–153
Abasi AK, Khader AT, Al-Betar MA, Naim S, Makhadmeh SN, Alyasseri ZAA (2020) Link-based multi-verse optimizer for text documents clustering. Appl Soft Comput 87:Article 106002
Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Liu X, Yu S, Moreau Y, Moor BD, Glänzel W, Janssens FAL (2009) Hybrid clustering of text mining and bibliometrics applied to journal sets. In: Proceedings of the international conference on data mining, 2009, pp 49–60
Zheng L, Li T, Ding C (2010) Hierarchical ensemble clustering. In: 10th international conference on data mining. IEEE, 2010, pp 1199–1204
Mirzaei H (2010) A novel multi-view agglomerative clustering algorithm based on ensemble of partitions on different views. In: Proceedings of the 20th international conference on pattern recognition. IEEE, 2010, pp 1007–1010
Hussain SF, Haris M (2019) A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data. Expert Syst Appl 118:20–34
Wang J, Tian F, Yu H, Liu CH, Zhan K, Wang X (2018) Diverse non-negative matrix factorization for multi-view data representation. IEEE Trans Cybern 48(9):2620–2632
Brbić M, Kopriva I (2018) Multi-view low-rank sparse subspace clustering. Pattern Recognit 73:247–258
Zong L, Zhang X, Zhao L, Yu H, Zhao Q (2017) Multi-view clustering via multi-manifold regularized non-negative matrix factorization. Neural Netw 88:74–89
Huang S, Kang Z, Xu Z (2018) Self-weighted multi-view clustering with soft capped norm. Knowl Based Syst 158:1–8
Huang S, Ren Y, Xu Z (2018) Robust multi-view data clustering with multi-view capped-norm k-means. Neurocomputing 311:197–208
Ren Y, Huang S, Zhao P, Han M, Xu Z (2020) Self-paced and auto-weighted multi-view clustering. Neurocomputing 383:248–256
Acknowledgements
This work is supported by the National Science Foundation of China (nos. 61772435, 61976182, 61876157) and the Fundamental Research Funds for the Central Universities (no. 220710004005040177) and Sichuan Key R&D project (no. 2020YFG0035).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Diallo, B., Hu, J., Li, T. et al. Multi-view document clustering based on geometrical similarity measurement. Int. J. Mach. Learn. & Cyber. 13, 663–675 (2022). https://doi.org/10.1007/s13042-021-01295-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-021-01295-8