当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CoPart: a context-based partitioning technique for big data
Journal of Big Data ( IF 8.6 ) Pub Date : 2021-01-19 , DOI: 10.1186/s40537-021-00410-4
Sara Migliorini , Alberto Belussi , Elisa Quintarelli , Damiano Carra

The MapReduce programming paradigm is frequently used in order to process and analyse a huge amount of data. This paradigm relies on the ability to apply the same operation in parallel on independent chunks of data. The consequence is that the overall performances greatly depend on the way data are partitioned among the various computation nodes. The default partitioning technique, provided by systems like Hadoop or Spark, basically performs a random subdivision of the input records, without considering the nature and correlation between them. Even if such approach can be appropriate in the simplest case where all the input records have to be always analyzed, it becomes a limit for sophisticated analyses, in which correlations between records can be exploited to preliminarily prune unnecessary computations. In this paper we design a context-based multi-dimensional partitioning technique, called CoPart, which takes care of data correlation in order to determine how records are subdivided between splits (i.e., units of work assigned to a computation node). More specifically, it considers not only the correlation of data w.r.t. contextual attributes, but also the distribution of each contextual dimension in the dataset. We experimentally compare our approach with existing ones, considering both quality criteria and the query execution times.



中文翻译:

CoPart:大数据的基于上下文的分区技术

为了处理和分析大量数据,经常使用MapReduce编程范例。这种范例依赖于在独立的数据块上并行应用相同操作的能力。结果是总体性能很大程度上取决于数据在各个计算节点之间的分配方式。由Hadoop或Spark等系统提供的默认分区技术基本上是对输入记录进行随机细分,而无需考虑它们之间的性质和相关性。即使这种方法在必须始终分析所有输入记录的最简单情况下是适当的,它仍然成为复杂分析的限制,在复杂分析中,可以利用记录之间的相关性来预先修剪不必要的计算。CoPart,它负责数据相关性,以确定如何在记录(即分配给计算节点的工作单元)之间细分记录。更具体地说,它不仅考虑与上下文属性相关的数据的相关性,而且考虑数据集中每个上下文维度的分布。我们在实验上将我们的方法与现有方法进行比较,同时考虑质量标准和查询执行时间。

更新日期:2021-01-19
down
wechat
bug