当前位置: X-MOL 学术ACM Trans. Database Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
General Temporally Biased Sampling Schemes for Online Model Management
ACM Transactions on Database Systems ( IF 2.2 ) Pub Date : 2019-12-09 , DOI: 10.1145/3360903
Brian Hentschel 1 , Peter J. Haas 2 , Yuanyuan Tian 3
Affiliation  

To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified “decay function.” We then periodically retrain the models on the current sample. This approach speeds up the training process relative to training on all of the data. Moreover, time-biasing lets the models adapt to recent changes in the data while—unlike in a sliding-window approach—still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (Targeted-Size Time-Biased Sampling (T-TBS)) that probabilistically maintains a target sample size and a novel reservoir-based scheme (Reservoir-Based Time-Biased Sampling (R-TBS)) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. If the decay function is exponential, then control over the decay rate is complete, and R-TBS maximizes both expected sample size and sample-size stability. For general decay functions, the actual item inclusion probabilities can be made arbitrarily close to the nominal probabilities, and we provide a scheme that allows a tradeoff between sample footprint and sample-size stability. R-TBS rests on the notion of a “fractional sample” and allows for data arrival rates that are unknown and time varying (unlike T-TBS). The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.

中文翻译:

在线模型管理的一般时间偏差抽样方案

为了在存在不断变化的数据流的情况下保持监督学习模型的准确性,我们提供了对最近数据加权最大的时间偏差抽样方案,给定数据项的包含概率根据指定的“衰减函数”随时间衰减。然后,我们会定期在当前样本上重新训练模型。相对于所有数据的训练,这种方法加快了训练过程。此外,时间偏差使模型能够适应数据的最新变化,同时 - 与滑动窗口方法不同 - 仍然保留一些旧数据以确保面对数据值的临时波动和周期性时的稳健性。此外,基于采样的方法允许将现有的静态数据分析算法应用于动态流数据,基本上没有变化。我们提供并分析了一个简单的采样方案(目标大小时偏采样 (T-TBS)),它在概率上保持目标样本大小和一种新的基于储层的方案(基于储层的时偏采样 (R-TBS) ) 这是第一个提供对衰减率的控制和对样本大小的保证上限。如果衰减函数是指数函数,则对衰减率的控制完成,并且 R-TBS 最大化预期样本量和样本量稳定性。对于一般衰减函数,实际项目包含概率可以任意接近标称概率,并且我们提供了一个允许在样本足迹和样本大小稳定性之间进行权衡的方案。R-TBS 基于“分数样本”的概念,并允许未知且随时间变化的数据到达率(与 T-TBS 不同)。R-TBS 和 T-TBS 方案具有独立的意义,扩展了已知的不等概率抽样方案集。我们讨论分布式实现策略;Spark 中的实验阐明了算法的性能和可扩展性,并表明我们的方法可以在面对不断变化的数据时提高机器学习的鲁棒性。
更新日期:2019-12-09
down
wechat
bug