当前位置:
X-MOL 学术
›
arXiv.cs.DB
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
FlashP: An Analytical Pipeline for Real-time Forecasting of Time-Series Relational Data
arXiv - CS - Databases Pub Date : 2021-01-09 , DOI: arxiv-2101.03298 Shuyuan Yan, Bolin Ding, Wei Guo, Jingren Zhou, Zhewei Wei, Xiaowei Jiang, Sheng Xu
arXiv - CS - Databases Pub Date : 2021-01-09 , DOI: arxiv-2101.03298 Shuyuan Yan, Bolin Ding, Wei Guo, Jingren Zhou, Zhewei Wei, Xiaowei Jiang, Sheng Xu
Interactive response time is important in analytical pipelines for users to
explore a sufficient number of possibilities and make informed business
decisions. We consider a forecasting pipeline with large volumes of
high-dimensional time series data. Real-time forecasting can be conducted in
two steps. First, we specify the portion of data to be focused on and the
measure to be predicted by slicing, dicing, and aggregating the data. Second, a
forecasting model is trained on the aggregated results to predict the trend of
the specified measure. While there are a number of forecasting models
available, the first step is the performance bottleneck. A natural idea is to
utilize sampling to obtain approximate aggregations in real time as the input
to train the forecasting model. Our scalable real-time forecasting system
FlashP (Flash Prediction) is built based on this idea, with two major
challenges to be resolved in this paper: first, we need to figure out how
approximate aggregations affect the fitting of forecasting models, and
forecasting results; and second, accordingly, what sampling algorithms we
should use to obtain these approximate aggregations and how large the samples
are. We introduce a new sampling scheme, called GSW sampling, and analyze error
bounds for estimating aggregations using GSW samples. We introduce how to
construct compact GSW samples with the existence of multiple measures to be
analyzed. FlashP is deployed in Alibaba for data scientists to analyze and
predict the status of advertisement slots in real time. We conduct experiments
to evaluate our solution and compare it with alternatives on real data.
中文翻译:
FlashP:用于时间序列关系数据的实时预测的分析管道
交互式响应时间在分析管道中对于用户探索足够多的可能性并制定明智的业务决策非常重要。我们考虑具有大量高维时间序列数据的预测管道。实时预测可以分两个步骤进行。首先,我们通过对数据进行切片,切块和聚合来指定要关注的数据部分和要预测的度量。其次,在汇总结果上训练预测模型,以预测指定度量的趋势。尽管有许多可用的预测模型,但第一步是性能瓶颈。一个自然的想法是利用采样实时获取近似聚合,作为训练预测模型的输入。我们基于此思想构建了可扩展的实时预测系统FlashP(Flash预测),本文需要解决两个主要挑战:首先,我们需要弄清楚近似聚合如何影响预测模型的拟合以及预测结果; 第二,相应地,我们应该使用什么采样算法来获得这些近似的聚合,以及样本的大小。我们介绍了一种称为GSW采样的新采样方案,并使用GSW采样分析了误差范围以估计聚合。我们介绍了如何构造具有多个要分析的措施的紧凑型GSW样本。FlashP已在阿里巴巴部署,供数据科学家实时分析和预测广告位的状态。
更新日期:2021-01-12
中文翻译:
FlashP:用于时间序列关系数据的实时预测的分析管道
交互式响应时间在分析管道中对于用户探索足够多的可能性并制定明智的业务决策非常重要。我们考虑具有大量高维时间序列数据的预测管道。实时预测可以分两个步骤进行。首先,我们通过对数据进行切片,切块和聚合来指定要关注的数据部分和要预测的度量。其次,在汇总结果上训练预测模型,以预测指定度量的趋势。尽管有许多可用的预测模型,但第一步是性能瓶颈。一个自然的想法是利用采样实时获取近似聚合,作为训练预测模型的输入。我们基于此思想构建了可扩展的实时预测系统FlashP(Flash预测),本文需要解决两个主要挑战:首先,我们需要弄清楚近似聚合如何影响预测模型的拟合以及预测结果; 第二,相应地,我们应该使用什么采样算法来获得这些近似的聚合,以及样本的大小。我们介绍了一种称为GSW采样的新采样方案,并使用GSW采样分析了误差范围以估计聚合。我们介绍了如何构造具有多个要分析的措施的紧凑型GSW样本。FlashP已在阿里巴巴部署,供数据科学家实时分析和预测广告位的状态。