FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data,arXiv - CS - Databases

当前位置： X-MOL 学术 › arXiv.cs.DB › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data
arXiv - CS - Databases Pub Date : 2021-01-09 , DOI: arxiv-2101.03298
Shuyuan Yan, Bolin Ding, Wei Guo, Jingren Zhou, Zhewei Wei, Xiaowei Jiang, Sheng Xu

Interactive response time is important in analytical pipelines for users to explore a sufficient number of possibilities and make informed business decisions. We consider a forecasting pipeline with large volumes of high-dimensional time series data. Real-time forecasting can be conducted in two steps. First, we specify the portion of data to be focused on and the measure to be predicted by slicing, dicing, and aggregating the data. Second, a forecasting model is trained on the aggregated results to predict the trend of the specified measure. While there are a number of forecasting models available, the first step is the performance bottleneck. A natural idea is to utilize sampling to obtain approximate aggregations in real time as the input to train the forecasting model. Our scalable real-time forecasting system FlashP (Flash Prediction) is built based on this idea, with two major challenges to be resolved in this paper: first, we need to figure out how approximate aggregations affect the fitting of forecasting models, and forecasting results; and second, accordingly, what sampling algorithms we should use to obtain these approximate aggregations and how large the samples are. We introduce a new sampling scheme, called GSW sampling, and analyze error bounds for estimating aggregations using GSW samples. We introduce how to construct compact GSW samples with the existence of multiple measures to be analyzed. FlashP is deployed in Alibaba for data scientists to analyze and predict the status of advertisement slots in real time. We conduct experiments to evaluate our solution and compare it with alternatives on real data.

中文翻译：

FlashP：用于时间序列关系数据的实时预测的分析管道

交互式响应时间在分析管道中对于用户探索足够多的可能性并制定明智的业务决策非常重要。我们考虑具有大量高维时间序列数据的预测管道。实时预测可以分两个步骤进行。首先，我们通过对数据进行切片，切块和聚合来指定要关注的数据部分和要预测的度量。其次，在汇总结果上训练预测模型，以预测指定度量的趋势。尽管有许多可用的预测模型，但第一步是性能瓶颈。一个自然的想法是利用采样实时获取近似聚合，作为训练预测模型的输入。我们基于此思想构建了可扩展的实时预测系统FlashP（Flash预测），本文需要解决两个主要挑战：首先，我们需要弄清楚近似聚合如何影响预测模型的拟合以及预测结果; 第二，相应地，我们应该使用什么采样算法来获得这些近似的聚合，以及样本的大小。我们介绍了一种称为GSW采样的新采样方案，并使用GSW采样分析了误差范围以估计聚合。我们介绍了如何构造具有多个要分析的措施的紧凑型GSW样本。FlashP已在阿里巴巴部署，供数据科学家实时分析和预测广告位的状态。

更新日期：2021-01-12

点击分享查看原文

点击收藏

阅读更多本刊最新论文