当前位置: X-MOL 学术J. Parallel Distrib. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Estimating record linkage costs in distributed environments
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2020-05-20 , DOI: 10.1016/j.jpdc.2020.05.003
Dimas Cassimiro Nascimento , Carlos Eduardo Santos Pires , Tiago Brasileiro Araujo , Demetrio Gomes Mestre

Record Linkage (RL) is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to the size of the dataset. In practice, this task can be outsourced to a cloud service, and thus, a service customer may be interested in estimating the costs of a record linkage solution before executing it. Since the execution time of a record linkage solution depends on a combination of various algorithms, their respective parameter values and the employed cloud infrastructure, in practice it is hard to perform an a priori estimation of infrastructure costs for executing a record linkage task. Besides estimating customer costs, the estimation of record linkage costs is also important to evaluate whether (or not) the application of a set of RL parameter values will satisfy predefined time and budget restrictions. Aiming to tackle these challenges, we propose a theoretical model for estimating RL costs taking into account the main steps that may influence the execution time of the RL task. We also propose an algorithm, denoted as TBF, for evaluating the feasibility of RL parameter values, given a set of predefined customer restrictions. We evaluate the efficacy of the proposed model combined with regression techniques using record linkage results processed in real distributed environments. Based on the experimental results, we show that the employed regression technique has significant influence over the estimated record linkage costs. Moreover, we conclude that specific regression techniques are more suitable for estimating record linkage costs, depending on the evaluated scenario.



中文翻译:

估计分布式环境中的记录链接成本

记录链接(RL)是识别一个或多个数据集中的重复实体的任务。在大数据时代,由于与数据集的大小有关的问题固有的二次复杂性,这项任务已引起人们的关注。实际上,可以将该任务外包给云服务,因此,服务客户可能会对在执行记录链接解决方案之前估算其成本感兴趣。由于记录链接解决方案的执行时间取决于各种算法,它们各自的参数值和所采用的云基础架构的组合,因此在实践中很难对用于执行记录链接任务的基础架构成本进行先验估计。除了估算客户成本外,记录链接成本的估算对于评估(或不)应用一组RL参数值是否将满足预定义的时间和预算限制也很重要。为了解决这些挑战,我们提出了一个理论模型来估算RL成本,同时考虑了可能影响RL任务执行时间的主要步骤。我们还提出了一种算法,表示为ŤF给定一组预定义的客户限制,用于评估RL参数值的可行性。我们使用在实际分布式环境中处理的记录链接结果,结合回归技术评估所提出模型的功效。根据实验结果,我们表明所采用的回归技术对估计的记录链接成本具有重大影响。此外,我们得出结论,根据所评估的方案,特定的回归技术更适合于估计记录链接成本。

更新日期:2020-05-20
down
wechat
bug