Adaptive partitioning and indexing for in situ query processing,The VLDB Journal

当前位置： X-MOL 学术 › VLDB J. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Adaptive partitioning and indexing for in situ query processing
The VLDB Journal ( IF 2.8 ) Pub Date : 2019-11-15 , DOI: 10.1007/s00778-019-00580-x
Matthaios Olma , Manos Karpathiotakis , Ioannis Alagiannis , Manos Athanassoulis , Anastasia Ailamaki

The constant flux of data and queries alike has been pushing the boundaries of data analysis systems. The increasing size of raw data files has made data loading an expensive operation that delays the data-to-insight time. To alleviate the loading cost, in situ query processing systems operate directly over raw data and offer instant access to data. At the same time, analytical workloads have increasing number of queries. Typically, each query focuses on a constantly shifting—yet small—range. As a result, minimizing the workload latency requires the benefits of indexing in in situ query processing. In this paper, we present an online partitioning and indexing scheme, along with a partitioning and indexing tuner tailored for in situ querying engines. The proposed system design improves query execution time by taking into account user query patterns, to (i) partition raw data files logically and (ii) build lightweight partition-specific indexes for each partition. We build an in situ query engine called Slalom to showcase the impact of our design. Slalom employs adaptive partitioning and builds non-obtrusive indexes in different partitions on-the-fly based on lightweight query access pattern monitoring. As a result of its lightweight nature, Slalom achieves efficient query processing over raw data with minimal memory consumption. Our experimentation with both microbenchmarks and real-life workloads shows that Slalom outperforms state-of-the-art in situ engines and achieves comparable query response times with fully indexed DBMS, offering lower cumulative query execution times for query workloads with increasing size and unpredictable access patterns.

中文翻译：

自适应分区和索引，用于原位查询处理

数据和查询的不断变化一直在推动数据分析系统的边界。原始数据文件大小的增加使数据加载成为昂贵的操作，从而延迟了数据收集时间。为了减轻加载成本，原地查询处理系统直接对原始数据进行操作并提供对数据的即时访问。同时，分析工作负载的查询数量也在增加。通常，每个查询都侧重于不断变化的范围（但很小）。结果，最小化工作负载等待时间需要在原位查询处理中建立索引的好处。在本文中，我们提出了一种在线分区和索引方案，以及为现场查询引擎量身定制的分区和索引调谐器。拟议的系统设计通过考虑用户查询模式来缩短查询执行时间，在逻辑上；（ii）为每个分区建立轻量级的分区特定索引。我们构建了一个名为Slalom的现场查询引擎，以展示我们设计的影响。Slalom采用自适应分区，并基于轻量级查询访问模式监控，在不同分区中动态构建非干扰性索引。由于具有轻量级特性，Slalom可以以最小的内存消耗实现对原始数据的高效查询处理。我们对微基准测试和实际工作负载的实验表明，Slalom的性能优于最先进的原位引擎，并通过完全索引的DBMS达到了可比的查询响应时间，为查询工作负载提供了更短的累积查询执行时间，并且查询工作负载的大小不断增加且访问无法预测模式。

更新日期：2019-11-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文