A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures,Journal of Cloud Computing

当前位置： X-MOL 学术 › J. Cloud Comp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures
Journal of Cloud Computing ( IF 3.7 ) Pub Date : 2020-01-17 , DOI: 10.1186/s13677-019-0150-y
Maria Th. Kotouza , Fotis E. Psomopoulos , Pericles A. Mitkas

Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users’ queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters’ connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.

中文翻译：

用于云计算基础架构上基于层次的基于频率的文档集群的dockerized框架

可扩展的大数据分析框架在现代网络社会中至关重要，其特征在于包括电子文本文档在内的大量资源。文档聚类是文本挖掘中的重要领域，通常用于文档组织，浏览，摘要和分类。层次聚类方法构建了一个层次结构，该层次结构与生成的聚类结合在一起可用于管理文档，从而使浏览和导航过程更加轻松快捷，并且通过利用结构关系仅向用户查询提供相关信息。然而，基线分层聚类算法的高计算成本和内存使用量使其不适用于每天必须处理的大量文档。在本文中，我们提出了一个新的可伸缩的分层聚类框架，该框架使用文档中主题的频率来克服这些限制。我们的工作包括一个使用三个度量（身份，熵，二进制相似度）创建文档层次结构的二叉树构建算法，以及一个通过将阈值应用于树的每个分支来构成最终集群的分支分解算法。聚类算法之后是一个元聚类模块，该模块利用图论来获取有关叶聚类连接的见解。表示每个文档的特征向量均来自主题建模。在实现级别，已对集群方法进行了泊坞处理，以促进其在云计算基础架构上的部署。最后，在不同大小和内容的几个数据集上对提出的框架进行了评估，与现有的分层聚类算法相比，显着减少了内存消耗和计算时间。实验还包括使用不同的设置对云资源进行性能测试，结果令人鼓舞。

更新日期：2020-04-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11