当前位置: X-MOL 学术Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Configuring Parallelism for Hybrid Layouts Using Multi-Objective Optimization.
Big Data ( IF 2.6 ) Pub Date : 2020-06-01 , DOI: 10.1089/big.2019.0068
Rana Faisal Munir 1, 2 , Alberto Abelló 1 , Oscar Romero 1 , Maik Thiele 2 , Wolfgang Lehner 2
Affiliation  

Modern organizations typically store their data in a raw format in data lakes. These data are then processed and usually stored under hybrid layouts, because they allow projection and selection operations. Thus, they allow (when required) to read less data from the disk. However, this is not very well exploited by distributed processing frameworks (e.g., Hadoop, Spark) when analytical queries are posed. These frameworks divide the data into multiple partitions and then process each partition in a separate task, consequently creating tasks based on the total file size and not the actual size of the data to be read. This typically leads to launching more tasks than needed, which, in turn, increases the query execution time and induces significant waste of computing resources. To allow a more efficient use of resources and reduce the query execution time, we propose a method that decides the number of tasks based on the data being read. To this end, we first propose a cost-based model for estimating the size of data read in hybrid layouts. Next, we use the estimated reading size in a multi-objective optimization method to decide the number of tasks and computational resources to be used. We prototyped our solution for Apache Parquet and Spark and found that our estimations are highly correlated (0.96) with the real executions. Further, using TPC-H we show that our recommended configurations are only 5.6% away from the Pareto front and provide 2.1 × speedup compared with default solutions.

中文翻译:

使用多目标优化为混合布局配置并行性。

现代组织通常将其数据以原始格式存储在数据湖中。然后处理这些数据,通常将它们存储在混合布局下,因为它们允许进行投影和选择操作。因此,它们允许(在需要时)从磁盘读取较少的数据。但是,当提出分析查询时,分布式处理框架(例如,Hadoop,Spark)并不能很好地利用这一点。这些框架将数据分为多个分区,然后在单独的任务中处理每个分区,因此根据文件的总大小而不是要读取的数据的实际大小创建任务。这通常导致启动比所需数量更多的任务,进而增加查询执行时间,并导致计算资源的大量浪费。为了更有效地利用资源并减少查询执行时间,我们提出了一种根据读取的数据来决定任务数量的方法。为此,我们首先提出一种基于成本的模型,用于估计在混合布局中读取的数据大小。接下来,我们在多目标优化方法中使用估计的阅读大小来确定要使用的任务数和计算资源。我们为Apache Parquet和Spark设计了解决方案原型,发现我们的估计与实际执行高度相关(0.96)。此外,使用TPC-H可以证明,我们推荐的配置距离Pareto前端仅5.6%,与默认解决方案相比,可提供2.1倍的加速。我们首先提出一种基于成本的模型,用于估计在混合布局中读取的数据大小。接下来,我们在多目标优化方法中使用估计的阅读大小来确定要使用的任务数和计算资源。我们为Apache Parquet和Spark设计了解决方案原型,发现我们的估计与实际执行高度相关(0.96)。此外,使用TPC-H可以证明,我们推荐的配置距离Pareto前端仅5.6%,与默认解决方案相比可提供2.1倍的加速。我们首先提出一种基于成本的模型,用于估计在混合布局中读取的数据大小。接下来,我们在多目标优化方法中使用估计的阅读大小来确定要使用的任务数和计算资源。我们为Apache Parquet和Spark设计了解决方案的原型,发现我们的估计与实际执行高度相关(0.96)。此外,使用TPC-H可以证明,我们推荐的配置距离Pareto前端仅5.6%,与默认解决方案相比,可以提供2.1倍的加速。我们为Apache Parquet和Spark设计了解决方案原型,发现我们的估计与实际执行高度相关(0.96)。此外,使用TPC-H可以证明,我们推荐的配置距离Pareto前端仅5.6%,与默认解决方案相比,可提供2.1倍的加速。我们为Apache Parquet和Spark设计了解决方案原型,发现我们的估计与实际执行高度相关(0.96)。此外,使用TPC-H可以证明,我们推荐的配置距离Pareto前端仅5.6%,与默认解决方案相比,可提供2.1倍的加速。
更新日期:2020-06-01
down
wechat
bug