Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks,Information Systems Frontiers

当前位置： X-MOL 学术 › Inf. Syst. Front. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Cache-Based Multi-Query Optimization for Data-Intensive Scalable Computing Frameworks
Information Systems Frontiers ( IF 5.9 ) Pub Date : 2020-03-04 , DOI: 10.1007/s10796-020-09995-2
Pietro Michiardi , Damiano Carra , Sara Migliorini

In modern large-scale distributed systems, analytics jobs submitted by various users often share similar work, for example scanning and processing the same subset of data. Instead of optimizing jobs independently, which may result in redundant and wasteful processing, multi-query optimization techniques can be employed to save a considerable amount of cluster resources. In this work, we introduce a novel method combining in-memory cache primitives and multi-query optimization, to improve the efficiency of data-intensive, scalable computing frameworks. By careful selection and exploitation of common (sub)expressions, while satisfying memory constraints, our method transforms a batch of queries into a new, more efficient one which avoids unnecessary recomputations. To find feasible and efficient execution plans, our method uses a cost-based optimization formulation akin to the multiple-choice knapsack problem. Extensive experiments on a prototype implementation of our system show significant benefits of worksharing for both TPC-DS workloads and detailed micro-benchmarks.

中文翻译：

数据密集型可扩展计算框架的基于缓存的多查询优化

在现代大规模分布式系统中，由各种用户提交的分析作业通常共享相似的工作，例如，扫描和处理相同的数据子集。代替独立地优化作业（这可能导致冗余和浪费的处理），可以采用多查询优化技术来节省大量的群集资源。在这项工作中，我们介绍了一种结合内存中缓存原语和多查询优化的新颖方法，以提高数据密集型可伸缩计算框架的效率。通过仔细选择和利用常见（子）表达式，同时满足内存限制，我们的方法将一批查询转换为一个新的，更有效的查询，从而避免了不必要的重新计算。为了找到可行，有效的执行计划，我们的方法使用类似于多选背包问题的基于成本的优化公式。在我们系统的原型实现上进行的大量实验表明，工作共享对于TPC-DS工作负载和详细的微基准测试具有显着的好处。

更新日期：2020-04-21

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>