Skip to main content
Log in

Multi-view document clustering based on geometrical similarity measurement

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Numerous works implemented multi-view clustering algorithms in document clustering. A challenging problem in document clustering is the similarity metric. Existing multi-view document clustering methods broadly utilized two measurements: the Cosine similarity (CS) and the Euclidean distance (ED). The first did not consider the magnitude difference (MD) between the two vectors. The second can’t register the divergence of two vectors that offer a similar ED. In this paper, we originally created five models of similarity metric. This methodology foils the downside of the CS and ED similarity metrics by figuring the divergence between documents with the same ED while thinking about their sizes. Furthermore, we proposed our multi-view document clustering plan which dependent on the proposed similarity metric. Firstly, CS, ED, triangle’s area similarity and sector’s area similarity metric, and our five similarity metrics have been applied to every view of a dataset to generate a corresponding similarity matrix. Afterward, we ran clustering algorithms on these similarity matrices to evaluate the performance of single view. Later, we aggregated these similarity matrices to obtain a unified similarity matrix and apply spectral clustering algorithm on it to generate the final clusters. The experimental results show that the proposed similarity functions can gauge the similitude between documents more accurately than the existing metrics, and the proposed clustering scheme surpasses considerably up-to-date algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://sites.google.com/site/fawadsyed/

  2. https://github.com/taki0112/

  3. https://linqs-data.soe.ucsc.edu/public/lbc/citeseer.tgz

  4. http://www.cs.umd.edu/~sen/lbc-proj/data/cora.tgz

  5. http://membres-lig.imag.fr/grimal/data/Cornell.tar.gz

  6. http://membres-lig.imag.fr/grimal/data/Texas.tar.gz

  7. http://lig-membres.imag.fr/grimal/data/Washington.tar.gz

  8. http://lig-membres.imag.fr/grimal/data/Wisconsin.tar.gz

  9. https://github.com/Geovhbn/MLRSSC

References

  1. Shah N, Mahajan S (2012) Document clustering: a detailed review. Int J Appl Inf Syst 4(5):30–38

    Google Scholar 

  2. Bisson G, Grimal C (2012) Co-clustering of multi-view datasets: a parallelizable approach. In: Proceedings of the 12th international conference on data mining. IEEE, pp 828–833

  3. Hussain SF, Mushtaq M, Halim Z (2014) Multi-view document clustering via ensemble method. J Intell Inf Syst 43(1):81–99

    Article  Google Scholar 

  4. Sabthami J, Thirumoorthy K, Muneeswaran K (2016) Multi-view clustering of clinical documents based on conditions and medical responses of patients. In: Proceedings of the 10th international conference on intelligent systems and control (ISCO). IEEE, pp 1–5

  5. Janani R, Vijayarani S (2019) Text document clustering using spectral clustering algorithm with particle swarm optimization. Proc Expert Syst Appl 134:192–200

    Article  Google Scholar 

  6. Wahid A, Gao X, Andreae P (2014) Multi-view clustering of web documents using multi-objective genetic algorithm. In: Proceedings of the congress on evolutionary computation (CEC). IEEE, pp 2625–2632

  7. Priya MJS (2012) Clustering technique in data mining for text documents. Int J Comput Sci Inf Technol 1:2943–2947

    Google Scholar 

  8. Zhan K, Shi J, Wang J, Tian F (2017) Graph-regularized concept factorization for multi-view document clustering. J Vis Commun Image Represent 48:411–418

    Article  Google Scholar 

  9. Yan W, Zhang B, Ma S, Yang Z (2017) A novel regularized concept factorization for document clustering. Knowl Based Syst 135:147–158

    Article  Google Scholar 

  10. Jia H, Ding S, Du M, Xue Y (2016) Approximate normalized cuts without Eigen-decomposition. Inf Sci 374:135–150

    Article  MATH  Google Scholar 

  11. Sherkat E, Milios EE, Minghim R (2019) A visual analytic approach for interactive document clustering. ACM Trans Interact Intell Syst 10(1):1–33

    Article  Google Scholar 

  12. Hussain SF, Bisson G, Grimal C (2010) An improved co-similarity measure for document clustering. In: Proceedings of the 9th international conference on machine learning and applications, 2010, pp 190–197

  13. Xu S, Chan K-S, Gao J, Xu X, Li X, Hua X, An J (2016) An integrated k-means-Laplacian cluster ensemble approach for document datasets. Neurocomputing 214:495–507

    Article  Google Scholar 

  14. Heidarian A, Dinneen MJ (2016) A hybrid geometric approach for measuring similarity level among documents and document clustering. In: Proceedings of the 2nd international conference on big data computing service and applications. IEEE, pp 142–151

  15. Abualigah LM, Khader AT, Hanandeh ES (2018) A combination of objective functions and hybrid krill herd algorithm for text document clustering analysis. Eng Appl Artif Intell 73:111–125

    Article  Google Scholar 

  16. Huang S, Xu Z, Lv J (2018) Adaptive local structure learning for document co-clustering. Knowl Based Syst 148:74–84

    Article  Google Scholar 

  17. Tan AH, Ridge K, Labs D, Terrace HMK (1999) Text mining: the state of the art and the challenges,” Proceedings of the Pakdd workshop on knowledge discovery from advanced databases, pp 65–70

  18. Kaijun W, Baijie W, Liuqing P (2009) CVAP: Validation for cluster analyses. Data Sci J 0904220071–0904220071

  19. Talib R, Kashif M, Ayesha S, Fatima F (2016) Text mining: techniques, applications and issues. Int J Adv Comput Sci Appl 7(11):414–418

    Google Scholar 

  20. Bhardwaj B (2016) Text mining, its utilities, challenges and clustering techniques. Int J Comput Appl 135(7):22–24

    Google Scholar 

  21. Yue L, Zuo W, Peng T, Wang Y, Han X (2015) A fuzzy document clustering approach based on domain-specified ontology. Data Knowl Eng 100:148–166

    Article  Google Scholar 

  22. Birjali M, Beni-Hssane A, Erritali M (2016) Measuring documents similarity in large corpus using mapreduce algorithm. In: Proceedings of the 5th international conference on multimedia computing and systems. IEEE, 2016, pp 24–28

  23. Wagh R, Anand D (2017) Application of citation network analysis for improved similarity index estimation of legal case documents: a study. In: International conference on current trends in advanced computing, (ICCTAC). IEEE, 2017, pp 1–5

  24. Jagatheeshkumar G, Brunda SS (2017) An analysis of efficient clustering methods for estimates similarity measures. In: Proceedings of the 4th international conference on advanced computing and communication systems. IEEE, 2017, pp 1–3

  25. Shirkhorshidi AS, Aghabozorgi S, Wah TY (2015) A comparison study on similarity and dissimilarity measures in clustering continuous data. PloS One 10(12):1–20

    Article  Google Scholar 

  26. Popat SK, Deshmukh PB, Metre VA (2017) Hierarchical document clustering based on cosine similarity measure. In: Proceedings of the 1st international conference on intelligent systems and information management. IEEE, 2017, pp 153–159

  27. George KK, Kumar CS, Sivadas S, Ramachandran K, Panda A (2018) Analysis of cosine distance features for speaker verification. Pattern Recognit Lett 112:285–289

    Article  Google Scholar 

  28. Kalhori H, Alamdari MM, Ye L (2018) Automated algorithm for impact force identification using cosine similarity searching. Measurement 122:648–657

    Article  Google Scholar 

  29. Diego JSN, Mesquita PP, João PP Gomes, Amauri HSJ (2017) Euclidean distance estimation in incomplete datasets. Neurocomputing 248:11–18

    Article  Google Scholar 

  30. Sailaja NV, Padmasree L, Mangathayaru N (2016) Survey of text mining techniques, challenges and their applications. Int J Comput Appl 146(11):30–35

    Google Scholar 

  31. Ye Y, Liu X, Liu Q, Yin J (2017) Consensus kernel k-means clustering for incomplete multi-view data. Comput Intell Neurosci 2017:1–11

    Article  Google Scholar 

  32. Hussain SF, Bashir S (2016) Co-clustering of multi-view datasets. Knowl Inf Syst 47(3):545–570

    Article  Google Scholar 

  33. Liang N, Yang Z, Li Z, Sun W, Xie S (2020) Multi-view clustering by non-negative matrix factorization with co-orthogonal constraints. Knowl Based Syst 105582

  34. Jin H, Feiping N, Heng H, Chris D (2014) Robust manifold non-negative matrix factorization. ACM Trans Knowl Discov Data 8(3):1–21

    Article  Google Scholar 

  35. Yang Y, Wang H (2018) Multi-view clustering: a survey. Big Data Min Anal 1(2):83–107

    Article  Google Scholar 

  36. Diallo B, Hu J, Li T, Khan G, Ji C (2019) Concept-enhanced multi-view clustering of document data. In: Proceedings of the 14th international conference on intelligent systems and knowledge engineering. IEEE, 2019, pp 1357–1363

  37. Yu D, Xu Z, Pedrycz W, Wang W (2017) Information sciences 1968–2016: a retrospective analysis with text mining and bibliometric. Inf Sci 418:619–634

    Article  Google Scholar 

  38. Saini N, Saha S, Bhattacharyya P (2019) Automatic scientific document clustering using self-organized multi-objective differential evolution. Cogn Comput 11(2):271–293

    Article  Google Scholar 

  39. Vega-Pons S, Ruiz-Shulcloper J (2011) A survey of clustering ensemble algorithms. Int J Pattern Recognit Artif Intell 25(03):337–372

    Article  MathSciNet  Google Scholar 

  40. Krawczyk B, Minku LL, Gama J, Stefanowski J, Woźniak M (2017) Ensemble learning for data stream analysis: a survey. Inf Fusion 37:132–156

    Article  Google Scholar 

  41. Boongoen T, Iam-On N (2018) Cluster ensembles: a survey of approaches with recent extensions and applications. Comput Sci Rev 28:1–25

    Article  MathSciNet  MATH  Google Scholar 

  42. Xie X, Sun S (2013) Multi-view clustering ensembles. In: Proceedings of the 2013 international conference on machine learning and cybernetics. IEEE, 2013, pp 51–56

  43. Cano A (2017) An ensemble approach to multi-view multi-instance learning. Knowl Based Syst 136:46–57

    Article  Google Scholar 

  44. Huang S, Wang H, Li D, Yang Y, Li T (2015) Spectral co-clustering ensemble. Knowl Based Syst 84:46–55

    Article  Google Scholar 

  45. Sun S (2013) A survey of multi-view machine learning. Neural Comput Appl 23(7–8):2031–2038

    Article  Google Scholar 

  46. Zhao J, Xie X, Xu X, Sun S (2017) Multi-view learning overview: recent progress and new challenges. Inf Fusion 38:43–54

    Article  Google Scholar 

  47. Jiang B, Qiu F, Wang L (2016) Multi-view clustering via simultaneous weighting on views and features. Appl Soft Comput J 47:304–315

    Article  Google Scholar 

  48. Xu YM, Wang CD, Lai JH (2016) Weighted multi-view clustering with feature selection. Pattern Recognit 53:25–35

    Article  Google Scholar 

  49. Huang S, Kang Z, Xu Z (2018) Self-weighted multi-view clustering with soft capped norm. Knowl Based Syst 158:1–8

    Article  Google Scholar 

  50. Huang S, Kang Z, Tsang IW, Xu Z (2019) Auto-weighted multi-view clustering via kernelized graph learning. Pattern Recognit 88:174–184

    Article  Google Scholar 

  51. Wahid A, Gao X, Andreae P (2015) Multi-objective clustering ensemble for high-dimensional data based on strength pareto evolutionary algorithm (spea-ii). In: Proceedings of the international conference on data science and advanced analytics. IEEE, 2015, pp 1–9

  52. Xia P, Zhang L, Li F (2015) Learning similarity with cosine similarity ensemble. Inf Sci 307:39–52

    Article  MathSciNet  MATH  Google Scholar 

  53. Dong J-Y, Chen Y, Wan S-P (2018) A cosine similarity based qualiflex approach with hesitant fuzzy linguistic term sets for financial performance evaluation. Appl Soft Comput 69:316–329

    Article  Google Scholar 

  54. Geng Z, Li Y, Han Y, Zhu Q (2018) A novel self-organizing cosine similarity learning network: an application to production prediction of petrochemical systems. Energy 142:400–410

    Article  Google Scholar 

  55. Xiang W-L, Li Y-Z, He R-C, Gao M-X, An M-Q (2018) A novel artificial bee colony algorithm based on the cosine similarity. Comput Ind Eng 115:54–68

    Article  Google Scholar 

  56. Moujahid D, Elharrouss O, Tairi H (2018) Visual object tracking via the local soft cosine similarity. Pattern Recognit Lett 110:79–85

    Article  Google Scholar 

  57. Alencar J, Lavor C, Liberti L (2019) Realizing euclidean distance matrices by sphere intersection. Discrete Appl Math 256:5–10

    Article  MathSciNet  MATH  Google Scholar 

  58. Bapat RB, Kurata H (2019) On Cartesian product of Euclidean distance matrices. Linear Algebra Appl 562:135–153

    Article  MathSciNet  MATH  Google Scholar 

  59. Abasi AK, Khader AT, Al-Betar MA, Naim S, Makhadmeh SN, Alyasseri ZAA (2020) Link-based multi-verse optimizer for text documents clustering. Appl Soft Comput 87:Article 106002

  60. Strehl A, Ghosh J (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617

    MathSciNet  MATH  Google Scholar 

  61. Liu X, Yu S, Moreau Y, Moor BD, Glänzel W, Janssens FAL (2009) Hybrid clustering of text mining and bibliometrics applied to journal sets. In: Proceedings of the international conference on data mining, 2009, pp 49–60

  62. Zheng L, Li T, Ding C (2010) Hierarchical ensemble clustering. In: 10th international conference on data mining. IEEE, 2010, pp 1199–1204

  63. Mirzaei H (2010) A novel multi-view agglomerative clustering algorithm based on ensemble of partitions on different views. In: Proceedings of the 20th international conference on pattern recognition. IEEE, 2010, pp 1007–1010

  64. Hussain SF, Haris M (2019) A k-means based co-clustering (kCC) algorithm for sparse, high dimensional data. Expert Syst Appl 118:20–34

    Article  Google Scholar 

  65. Wang J, Tian F, Yu H, Liu CH, Zhan K, Wang X (2018) Diverse non-negative matrix factorization for multi-view data representation. IEEE Trans Cybern 48(9):2620–2632

    Article  Google Scholar 

  66. Brbić M, Kopriva I (2018) Multi-view low-rank sparse subspace clustering. Pattern Recognit 73:247–258

    Article  Google Scholar 

  67. Zong L, Zhang X, Zhao L, Yu H, Zhao Q (2017) Multi-view clustering via multi-manifold regularized non-negative matrix factorization. Neural Netw 88:74–89

    Article  MATH  Google Scholar 

  68. Huang S, Kang Z, Xu Z (2018) Self-weighted multi-view clustering with soft capped norm. Knowl Based Syst 158:1–8

    Article  Google Scholar 

  69. Huang S, Ren Y, Xu Z (2018) Robust multi-view data clustering with multi-view capped-norm k-means. Neurocomputing 311:197–208

    Article  Google Scholar 

  70. Ren Y, Huang S, Zhao P, Han M, Xu Z (2020) Self-paced and auto-weighted multi-view clustering. Neurocomputing 383:248–256

    Article  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Science Foundation of China (nos. 61772435, 61976182, 61876157) and the Fundamental Research Funds for the Central Universities (no. 220710004005040177) and Sichuan Key R&D project (no. 2020YFG0035).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jie Hu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Diallo, B., Hu, J., Li, T. et al. Multi-view document clustering based on geometrical similarity measurement. Int. J. Mach. Learn. & Cyber. 13, 663–675 (2022). https://doi.org/10.1007/s13042-021-01295-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-021-01295-8

Keywords

Navigation