An Efficient Algorithm for Spatio-Textual Object Cluster Join,Big Data Research

当前位置： X-MOL 学术 › Big Data Res. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Efficient Algorithm for Spatio-Textual Object Cluster Join
Big Data Research ( IF 3.5 ) Pub Date : 2021-02-05 , DOI: 10.1016/j.bdr.2021.100191
Mingming Chen , Ning Wang , Daxin Zhu , Jedi S. Shang

With the proliferation of GPS-based equipments and location-based services, spatio-textual objects have been playing an indispensable role in spatial data management. It is of great importance to enable the join operation among spatio-textual object groups. In this paper, we propose to study a novel problem of spatio-textual object cluster join (STOC-Join). Given two sets of spatio-textual objects $D_{1}$ and $D_{2}$ and a similarity threshold θ, the STOC-Join problem finds all object cluster pairs whose spatio-textual similarities are no less than θ. The problem of STOC-Join is practical in a variety of application scenarios, including location-based event detection, location-based data cleaning, and location-based social media data pre-processing in general. Efficient processing of STOC-Join is challenging in the following three aspects: (1) How to define and compute the spatio-textual similarity between two clusters of spatio-textual objects effectively; (2) How to efficiently cluster a large number of spatio-textual objects; (3) How to efficiently find similar cluster pairs and filter out unqualified pair candidates. To address the challenges, we define an effective and easy-to-compute similarity metric that measures the aggregated similarities between two groups of spatio-textual objects. Based on the similarity metric, we propose a novel two-phase matching algorithm that is able to cluster a large number of spatio-textual objects and find all cluster pairs efficiently. Our experiments on large real-life datasets confirm that our proposed two-phase matching algorithm is capable of achieving high efficiency compared with straightforward methods.

中文翻译：

时空-文本对象簇连接的高效算法

随着基于GPS的设备和基于位置的服务的普及，时空文本对象在空间数据管理中起着不可或缺的作用。启用时空文本对象组之间的联接操作非常重要。在本文中，我们建议研究一个时空-文本对象簇连接（STOC-Join）的新问题。给定两组时空文本对象 $d_{1个}$ 和 $d_{2}$ 和相似度阈值θ，STOC-Join问题找到时空文本相似度不小于θ的所有对象簇对。通常，STOC-Join的问题在各种应用场景中都是可行的，包括基于位置的事件检测，基于位置的数据清理以及基于位置的社交媒体数据预处理。STOC-Join的有效处理在以下三个方面具有挑战性：（1）如何有效地定义和计算两个时空文本对象簇之间的时空文本相似性；（2）如何有效地将大量时空文本对象聚类；（3）如何有效地找到相似的聚类对，并筛选出不合格的候选对。为了解决这些挑战，我们定义了一种有效且易于计算的相似度度量标准，用于衡量两组时空文本对象之间的汇总相似度。根据相似度指标，我们提出了一种新颖的两阶段匹配算法，该算法能够对大量时空文本对象进行聚类并有效地找到所有聚类对。我们在大型现实数据集上的实验证实，与简单方法相比，我们提出的两阶段匹配算法能够实现较高的效率。

更新日期：2021-02-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文