Big Data Research ( IF 3.5 ) Pub Date : 2021-02-05 , DOI: 10.1016/j.bdr.2021.100191 Mingming Chen , Ning Wang , Daxin Zhu , Jedi S. Shang
With the proliferation of GPS-based equipments and location-based services, spatio-textual objects have been playing an indispensable role in spatial data management. It is of great importance to enable the join operation among spatio-textual object groups. In this paper, we propose to study a novel problem of spatio-textual object cluster join (STOC-Join). Given two sets of spatio-textual objects and and a similarity threshold θ, the STOC-Join problem finds all object cluster pairs whose spatio-textual similarities are no less than θ. The problem of STOC-Join is practical in a variety of application scenarios, including location-based event detection, location-based data cleaning, and location-based social media data pre-processing in general. Efficient processing of STOC-Join is challenging in the following three aspects: (1) How to define and compute the spatio-textual similarity between two clusters of spatio-textual objects effectively; (2) How to efficiently cluster a large number of spatio-textual objects; (3) How to efficiently find similar cluster pairs and filter out unqualified pair candidates. To address the challenges, we define an effective and easy-to-compute similarity metric that measures the aggregated similarities between two groups of spatio-textual objects. Based on the similarity metric, we propose a novel two-phase matching algorithm that is able to cluster a large number of spatio-textual objects and find all cluster pairs efficiently. Our experiments on large real-life datasets confirm that our proposed two-phase matching algorithm is capable of achieving high efficiency compared with straightforward methods.
中文翻译:
时空-文本对象簇连接的高效算法
随着基于GPS的设备和基于位置的服务的普及,时空文本对象在空间数据管理中起着不可或缺的作用。启用时空文本对象组之间的联接操作非常重要。在本文中,我们建议研究一个时空-文本对象簇连接(STOC-Join)的新问题。给定两组时空文本对象 和 和相似度阈值θ,STOC-Join问题找到时空文本相似度不小于θ的所有对象簇对。通常,STOC-Join的问题在各种应用场景中都是可行的,包括基于位置的事件检测,基于位置的数据清理以及基于位置的社交媒体数据预处理。STOC-Join的有效处理在以下三个方面具有挑战性:(1)如何有效地定义和计算两个时空文本对象簇之间的时空文本相似性;(2)如何有效地将大量时空文本对象聚类;(3)如何有效地找到相似的聚类对,并筛选出不合格的候选对。为了解决这些挑战,我们定义了一种有效且易于计算的相似度度量标准,用于衡量两组时空文本对象之间的汇总相似度。根据相似度指标,我们提出了一种新颖的两阶段匹配算法,该算法能够对大量时空文本对象进行聚类并有效地找到所有聚类对。我们在大型现实数据集上的实验证实,与简单方法相比,我们提出的两阶段匹配算法能够实现较高的效率。